CN-121996774-A - Second-hand vehicle data set construction method and device based on data management and storage medium
Abstract
The invention provides a method, a device and a storage medium for constructing a second-hand vehicle data set based on data management, which are characterized in that an original vehicle description text in a second-hand vehicle transaction scene is obtained to carry out text cleaning and standardization processing to obtain a to-be-processed vehicle description text set, event unit extraction is carried out to the to-be-processed vehicle description text set, event units containing operation behaviors and state changes in the to-be-processed vehicle description text set are identified and extracted, double-base vector decomposition processing is carried out to each event unit to obtain double-base vectors, the double-base vectors are embedded into a preset three-dimensional event space, double-base vectors in the three-dimensional event space are detected, contradictory event units are identified and corrected, a correction vector set for eliminating contradiction is generated, the correction vector set is projected to a preset second-hand vehicle estimation model data field space, and a structured data unit aligned with the second-hand vehicle estimation model data field space is generated.
Inventors
- ZHU JIXI
- Cheng Yingang
- ZHU LIN
Assignees
- 贵州引擎科技产业有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251225
Claims (10)
- 1. A method for constructing a second-hand vehicle data set based on data governance, the method comprising: acquiring an original vehicle description text in a second-hand vehicle transaction scene, and performing text cleaning and standardization processing on the original vehicle description text to obtain a vehicle description text set to be processed, wherein the vehicle description text set to be processed comprises state description information and operation record information of a vehicle in different use stages; Extracting event units from the vehicle description text set to be processed, and identifying and extracting event units containing operation behaviors and state changes from the vehicle description text set to be processed, wherein the event units are used for representing the state change process of the vehicle at corresponding time nodes; Performing double-base vector decomposition processing on each event unit, and decomposing the event units into double-base vectors composed of operation vectors and state vectors, wherein the operation vectors are used for representing the direction of the force of active intervention applied to the vehicle, and the state vectors are used for representing the offset amplitude of the vehicle attribute under the intervention action; embedding the double-base vector into a preset three-dimensional event space, detecting the double-base vector in the three-dimensional event space through a geometric constraint model of a vehicle physical state, identifying and correcting an event unit with contradiction, and generating a correction vector set for eliminating the contradiction; And carrying out projection conversion processing on the correction vector set, projecting the correction vector set to a preset secondary vehicle estimated value model data field space, and generating a structural data unit aligned with the secondary vehicle estimated value model data field space, wherein the structural data unit is used for constructing a standardized secondary vehicle data set.
- 2. The method according to claim 1, wherein the extracting the event unit from the set of vehicle description texts to be processed, identifying and extracting the event unit from the set of vehicle description texts to be processed, which includes the operation behavior and the state change, includes: Performing time mark analysis on the vehicle description text set to be processed, extracting description information representing time nodes in the text, dividing the vehicle description text set to be processed into text segment units with sequence based on the time node description information, performing semantic consistency analysis on adjacent text segment units, calculating semantic overlap between the tail end of a preceding segment and the head of a subsequent segment, and performing content completion on text segments with the semantic overlap not reaching a set standard; Performing semantic role recognition on the text segment units, analyzing an action initiator, an action object and a state change description part in each text segment unit, and determining a concrete vehicle composition part or an operation execution main body pointed by a pronoun by combining a context pointing relation to obtain a labeling text content containing semantic role identification and the pointing relation; Based on the association condition of the action initiator and the action object in the labeling text content, constructing a basic composition relation of the vehicle operation behavior, analyzing the time duration characteristic of the action, and distinguishing the one-time operation from the continuous operation; Based on a semantic association analysis result of a basic composition relation between the state change description part and the vehicle operation behaviors, determining state change description content corresponding to each operation behavior, and identifying main state changes directly caused by the operation behaviors and association state changes indirectly caused by the operation behaviors by combining with a causal association relation of vehicle attributes to obtain an event unit preliminary set containing the corresponding relation between the operation behaviors and the state changes; Performing redundancy check on the preliminary set of event units, calculating the repeatability among the event units according to the combination characteristics of the operation behavior type, the action object and the state change amplitude, removing the event units with the repeatability reaching the set standard, and obtaining event unit groups after de-duplication; and carrying out dependency relationship identification on event unit groups after de-duplication, analyzing causal association and conditional association among event units, and adding pre-event identification information and post-event identification information for each event unit to obtain a structured event unit containing time sequence, semantic roles and dependency relationships.
- 3. The method according to claim 2, wherein the performing semantic role recognition on the text segment units, analyzing the action initiator, the action object, and the state change description part in each text segment unit, determining the specific vehicle component part or the operation execution body pointed by the pronoun in combination with the context reference relationship, and obtaining the labeled text content including the semantic role identifier and the reference relationship includes: Word segmentation is carried out on the text segment units, a continuous text sequence is segmented into word components with independent ideographic functions, the word components are divided into different categories according to part-of-speech characteristics, and each category of words bears different semantic functions in event description; the method comprises the steps of calling a context semantic coding model to carry out semantic association processing on word components, analyzing the position relation of the word components in text fragments and the semantic influence of adjacent words, and generating a word vector sequence containing context-dependent information, wherein the word vector sequence reflects the semantic meaning of the word components in a specific context; inputting the word vector sequence into a semantic role recognition flow, performing role classification processing on each word component, and outputting a semantic role classification result corresponding to each word component, wherein the classification result reflects the functional positioning of the word component in the event description; Determining semantic role identification of each word component according to the semantic role classification result, wherein the semantic role identification reflects the functional positioning of the word component in the event description; performing association storage on the semantic role identifications and the corresponding word components, and establishing a corresponding relation between the word components and the semantic role identifications to obtain an association data set containing the word components, the part-of-speech categories and the semantic role identifications; And carrying out cross verification on the associated data set, comparing the common semantic role configuration conditions in the field, and adjusting the deviation part in the recognition result to enable the associated data set to accord with the semantic specification in the field of the second-hand vehicle, so as to obtain the marked text content.
- 4. The method of claim 1, wherein said performing a bi-base vector decomposition process on each event cell, decomposing said event cell into a bi-base vector comprised of an operation vector and a state vector, comprises: the event unit is subjected to structural conversion, the event unit in a natural language form is converted into a structural event expression consisting of operation behavior description, action object description, operation implementation tools, environmental influence factors and state change conditions, and all components have clear logic association in the event description; Extracting new terms from the vehicle operation behavior description and state change conditions, supplementing the extracted new terms to a vehicle operation behavior vocabulary set and a vehicle state attribute vocabulary set, and updating term items and semantic descriptions of the vocabulary set; based on the updated vehicle operation behavior vocabulary set, carrying out joint vectorization processing on the operation behavior description and the operation implementation tool in the structured event expression, and generating a comprehensive operation vector integrating operation type, implementation strength and tool influence by combining the type characteristic of the operation behavior and the functional characteristic of the operation implementation tool; Based on the updated vehicle state attribute vocabulary set, carrying out joint vectorization processing on the state change condition and the environment influence factor in the structured event expression, distinguishing recoverable state change and unrecoverable state change, adding continuous influence weight for unrecoverable state change, and generating a comprehensive state vector containing the state attribute change amplitude and the environment influence degree; Performing causal relation degree analysis on the comprehensive operation vector and the comprehensive state vector, counting co-occurrence conditions of operation behaviors and state changes in historical event data, judging association tightness of the comprehensive operation vector and the comprehensive state vector, screening vector pairs with association tightness being greater than a first threshold, and eliminating abnormal vector pairs with association tightness being less than a second threshold; And carrying out dimension optimization processing on the comprehensive operation vector and the comprehensive state vector, and reserving key information to obtain a double-base vector formed by the operation vector and the state vector.
- 5. The method of claim 4, wherein the performing joint vectorization processing on the operation behavior description and the operation implementation tool in the structured event expression based on the updated vehicle operation behavior vocabulary set, and generating a comprehensive operation vector integrating operation type, implementation strength and tool influence by combining type characteristics of operation behaviors and functional characteristics of the operation implementation tool, includes: Performing principle analysis on the operation behavior description in the structured event expression, dividing the operation behavior into different action types according to the action difference of the operation on a physical structure or an electronic system of the vehicle, and recording the proportion condition of each category in the operation behavior description; performing functional deconstructment on the operation implementation tool in the structured event expression, extracting description information of an action part, an operation mode and an accuracy level of the tool, and establishing a mapping relation between tool functional parameters and operation effects; constructing a dynamic operation dictionary based on the updated vehicle operation behavior vocabulary set, mapping operation behaviors of different action types to corresponding entries in the dictionary, and generating a basic vector representing the operation type; According to the strength modifier in the operation behavior description and the analysis result of the tool function parameters, determining the strength grade of operation implementation, and converting the strength grade into a strength coefficient matched with the dimension of the basic vector; element-by-element fusion is carried out on the basic vector and the intensity coefficient to obtain an intermediate vector containing operation types and implementation intensity, and the dimension of the intermediate vector is consistent with the number of entries of a dynamic operation dictionary; based on the mapping relation between the tool function parameters and the operation effects, differential weights are distributed to each dimension of the intermediate vector, and a comprehensive operation vector integrating operation types, implementation strength and tool influence is generated through weighted calculation; The method includes the steps that based on the updated vehicle state attribute vocabulary set, the state change condition and the environment influence factor in the structured event expression are subjected to joint vectorization processing, recoverable state change and unrecoverable state change are distinguished, continuous influence weight is added for the unrecoverable state change, and a comprehensive state vector containing state attribute change amplitude and environment influence degree is generated, and the method comprises the following steps: Performing attribute tracing on the state change condition in the structured event expression, determining the vehicle core attribute related to the state change, and establishing the association corresponding relation between the state change description and the core attribute; performing action analysis on environmental influence factors in the structured event expression, and calculating the contribution ratio of each factor to state change according to the duration of the environmental factors and the action range description information; constructing an attribute feature space based on the updated vehicle state attribute vocabulary set, mapping the associated core attribute to a corresponding dimension in the feature space, and generating a basic vector representing the state attribute; Calculating the variation amplitude value of each core attribute according to the variation description in the state variation description and the contribution proportion of the environmental factors, and converting the variation amplitude value into an amplitude coefficient matched with the dimension of the basic vector; judging the reversibility of the state change, distinguishing recoverable state change from unrecoverable state change, adding time attenuation weight for the amplitude coefficient corresponding to the unrecoverable state change, and generating a weighted amplitude coefficient; And fusing the basic vector and the weighted amplitude coefficient element by element to obtain an intermediate vector containing state attributes and variation amplitude, wherein the dimension of the intermediate vector is consistent with the dimension of the attribute feature space, and the value of each dimension of the intermediate vector is adjusted according to the contribution proportion of environmental factors to generate a comprehensive state vector.
- 6. The method according to claim 1, wherein the embedding the bistatic vector into a preset three-dimensional event space, detecting the bistatic vector in the three-dimensional event space through a geometric constraint model of a vehicle physical state, identifying and correcting event units with contradictions, and generating a correction vector set for eliminating the contradictions, includes: Determining dimension constitution of a three-dimensional event space, wherein the time dimension comprises an absolute time mark of event occurrence and a time interval of adjacent events, the operation strength dimension comprises operation action force and operation frequency in a unit time period, the state change dimension comprises instant state expression and accumulated state expression after operation, and each dimension converts an original numerical value into a standardized coordinate value in a unified interval through linear conversion; Performing space mapping pretreatment on the double-base vector, extracting a characteristic component reflecting action force and a characteristic component reflecting operation frequency from an operation vector, extracting a characteristic component reflecting an instant state and a characteristic component reflecting an accumulated state from a state vector, performing dimension consistency treatment on the extracted characteristic component, and enabling the vector lengths of the components to be matched; embedding the preprocessed characteristic components into a three-dimensional event space, generating time dimension coordinates through joint conversion of absolute time marks and time intervals, generating operation strength dimension coordinates through fusion calculation of action force characteristic components and operation frequency characteristic components, generating state change dimension coordinates through weighted combination of instant state characteristic components and accumulated state characteristic components, and obtaining space coordinate points; constructing a geometric constraint rule base of the physical state of the vehicle, wherein the geometric constraint rule base comprises a time sequence rule, an operation-state association rule and a state evolution rule, the time sequence rule determines a threshold range through statistics of historical event time intervals, and the operation-state association rule determines direction constraint through association analysis of historical operation and state change; Performing rule matching detection on space coordinate points in a three-dimensional event space, comparing each dimension value of the space coordinate points with a threshold range of a geometric constraint rule base, identifying contradictory coordinate points exceeding the threshold range, and recording event units and contradictory types corresponding to contradiction; And carrying out parameter adjustment on the double-base vector corresponding to the contradictory coordinate point, calling a corresponding correction strategy in the rule base according to the contradictory type, and adjusting the action force characteristic component of the operation vector or the accumulated state characteristic component of the state vector to ensure that the corrected space coordinate point meets the threshold range of the geometric constraint rule base and generate a correction vector set for eliminating the contradiction.
- 7. The method of claim 6, wherein the determining the dimension of the three-dimensional event space comprises determining a time dimension comprising an absolute time stamp of occurrence of an event and a time interval of an adjacent event, the operation strength dimension comprises an operation action force and an operation frequency in a unit time period, the state change dimension comprises an instantaneous state representation after operation and an accumulated state representation, and each dimension converts an original numerical value into a standardized coordinate value in a unified interval through linear conversion, and the method comprises: Performing double-coordinate division on the time dimension, analyzing and converting the absolute time stamp into a time sequence value through time description information of an event unit, and calculating and generating a time interval of adjacent events through a time stamp difference value of a current event and a preamble event to obtain two sub-coordinate components of the time dimension; Performing double-coordinate division on the operation intensity dimension, wherein the operation action force is generated by extracting intensity related features in the operation vector, and counting the occurrence times of similar operation vectors by using operation frequency in a unit time period through a fixed time window to generate two sub-coordinate components of the operation intensity dimension; performing double-coordinate division on the state change dimension, directly extracting and generating the instant state expression through instant change characteristics in the state vector after operation, and generating the accumulated state expression through similar characteristic accumulation calculation in the history state vector to obtain two sub-coordinate components of the state change dimension; carrying out standardization processing on the sub-coordinate components of each dimension, converting original values such as absolute time marks, time intervals, operation action forces and the like into standardized coordinate values in a unified interval through linear conversion, and reserving the relative magnitude relation among the values; Determining coordinate scale intervals of each dimension according to the distribution characteristics of the historical event data, dynamically adjusting time dimension scale intervals according to the event distribution density, adjusting operation strength dimension scale intervals according to the action force distribution characteristics, and adjusting state change dimension scale intervals according to the fluctuation range of state values; Mapping and associating the standardized coordinate values of each dimension with scale intervals to generate a coordinate mapping table of a three-dimensional event space, and recording the corresponding relation between the original numerical value and the standardized coordinate values; The spatial mapping preprocessing is performed on the double basis vectors, the characteristic components reflecting the action force and the characteristic components reflecting the operation frequency are extracted from the operation vectors, the characteristic components reflecting the instant state and the characteristic components reflecting the accumulated state are extracted from the state vectors, the dimensional consistency processing is performed on the extracted characteristic components, so that the vector lengths of the components are matched, and the method comprises the following steps: The operation vector is subjected to vector decomposition, the operation vector is divided into a basic operation matrix and an intensity coefficient matrix, the basic operation matrix reflects the type attribute of the operation, the intensity coefficient matrix reflects the action force of the operation, and characteristic components reflecting the action force are extracted from the intensity coefficient matrix; Performing frequency statistical analysis on the operation vectors, constructing a time window according to a time mark sequence of the event unit, counting the occurrence times of similar operation vectors in the window, taking a statistical result as a characteristic component reflecting the operation frequency, and forming an operation intensity characteristic pair with an action intensity characteristic component; The method comprises the steps of carrying out vector decomposition on a state vector, splitting the state vector into an instant state matrix and a historical state matrix, wherein the instant state matrix reflects short-term state change after operation, the historical state matrix reflects state change accumulated for a long time, and extracting characteristic components reflecting accumulated states from the historical state matrix; Performing timeliness analysis on the state vector, determining the effective duration of the instant state characteristic component according to the duration description of the state change, merging the instant state characteristic component values exceeding the effective duration into the accumulated state characteristic component, and updating the value of the accumulated state characteristic component; performing dimension alignment processing on the operation intensity feature pair and the state change feature pair, reserving key dimensions through feature selection, deleting redundant dimensions, and enabling the vector lengths of the operation intensity feature pair and the state change feature pair to be consistent; and carrying out standardization processing on the operation intensity characteristic pairs and the state change characteristic pairs after the dimension alignment, and adjusting the numerical range of each characteristic component to unify the dimensions of different characteristic components so as to combine the characteristic groups to be mapped.
- 8. The method of claim 1, wherein performing a projective transformation on the set of correction vectors, projecting the set of correction vectors into a predetermined second-hand-vehicle estimation model data field space, generating a structured data element aligned with the second-hand-vehicle estimation model data field space, comprises: Carrying out hierarchical analysis on field definitions of the second-hand vehicle estimation model data, extracting father-son dependency relationships and peer-level association relationships among fields, constructing a field dependency network containing field hierarchical paths and association weights, wherein the father-son dependency relationships represent the inclusion relationships among the fields, and the peer-level association relationships represent the cooperative relationships among the fields; Performing feature layering processing on the correction vector set, dividing the bistatic vector into a core feature vector and an auxiliary feature vector based on the business influence degree of vector dimensions, wherein the core feature vector corresponds to a key data field in an estimation model, and the auxiliary feature vector corresponds to a secondary data field; Performing multipath mapping on the core feature vector based on the field-dependent network, mapping one core feature vector dimension to a plurality of associated fields according to a field level path, and distributing the feature duty ratio of each field through the associated weight to generate a preliminary field mapping result; Performing supplementary mapping processing on the auxiliary feature vector, mapping the dimension of the auxiliary feature vector to an uncovered field of the core feature vector according to the same-level association relation, weighting the supplementary field value through the feature duty ratio, and perfecting a field mapping result; performing field logic relation check on the completed field mapping result, analyzing the numerical value containing relation between father and child fields and the numerical value cooperative relation between the same-level fields, identifying conflict fields with logic contradictions, and recording conflict types and corresponding feature vectors; And calling a field adjustment strategy according to the conflict type, adjusting the core feature vector or auxiliary feature vector duty ratio corresponding to the conflict field, enabling the field value to meet the association relation of the field dependent network, and generating a structured data unit aligned with the field space of the second-hand vehicle estimation model data.
- 9. A data set construction apparatus comprising a memory and a processor, the memory storing a computer program executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the program is executed.
- 10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, realizes the steps in the method according to any one of claims 1 to 8.
Description
Second-hand vehicle data set construction method and device based on data management and storage medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method and apparatus for constructing a second-hand vehicle data set based on data management, and a storage medium. Background Along with the advancement of the process of digitalization of the second-hand vehicle transaction, the construction of a standardized second-hand vehicle data set based on unstructured text data is a basic technology for supporting the evaluation of vehicle value and the control of transaction risk, and the key requirement is to extract effective information reflecting the change of vehicle state from text data such as vehicle usage records, maintenance descriptions and the like. At present, discrete attributes are extracted through means of word segmentation, entity identification, keyword matching and the like in the processing of text data of a second hand cart in the related art, and then extracted attribute values are filled into a preset data template to obtain a structured data set containing basic parameters. However, when the processing mode is faced with the dynamic evolution description in the using process of the vehicle, the correlation information between the operation behavior and the state change is difficult to be completely reserved, so that the implicit causal influence rule in the data cannot be effectively converted into the usable characteristics, in addition, the prior art lacks systematic inspection on whether the data record accords with the objective rule in the using process of the vehicle, and abnormal data which do not accord with the physical characteristics or the using logic of the vehicle possibly exists. These problems make existing second-hand vehicle datasets limited in supporting accurate analysis and decision making, and make it difficult to fully exploit the potential value of unstructured text data. Disclosure of Invention In view of the above, the present invention provides a method, an apparatus and a storage medium for constructing a second-hand vehicle data set based on data management. The technical scheme of the embodiment of the invention is realized as follows: On the one hand, the embodiment of the invention provides a method for constructing a second-hand vehicle data set based on data management, which comprises the steps of obtaining an original vehicle description text in a second-hand vehicle transaction scene, conducting text cleaning and standardization processing on the original vehicle description text to obtain a vehicle description text set to be processed, wherein the vehicle description text set to be processed comprises state description information and operation record information of vehicles in different use stages, conducting event unit extraction on the vehicle description text set to be processed, identifying and extracting event units containing operation behaviors and state changes in the vehicle description text set to be processed, the event units are used for representing a state change process of the vehicles in corresponding time nodes, conducting double-base vector decomposition processing on each event unit, decomposing the event units into double-base vectors formed by operation vectors and state vectors, the operation vectors are used for representing the force direction of active intervention applied to the vehicles, the state vectors are used for representing the offset amplitude generated by vehicle attributes under the intervention action, embedding the double-base vectors into a preset three-dimensional event space, conducting detection on the double-base vectors in the three-dimensional event space through a geometric constraint model of the physical state of the vehicles, identifying and correcting event units existing in the three-dimensional event space, generating correction vectors, eliminating contradiction vectors, carrying out correction vector set, and carrying out conversion processing on the two-hand vehicle data set structure data set to be used for converting the two-hand vehicle data set to be used for constructing a field-hand vehicle data set, and a field-made data set to be used for converting a data set of a field-made of a model of a data model to be converted into a contradiction model. In another aspect, an embodiment of the present invention provides a data set construction apparatus, including a memory and a processor, where the memory stores a computer program executable on the processor, and where the processor implements the steps of the method described above when the processor executes the program. In a third aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method described above. The method for constructing the s