CN-122019844-A - Data processing method, device, electronic equipment and storage medium

CN122019844ACN 122019844 ACN122019844 ACN 122019844ACN-122019844-A

Abstract

The application relates to a data processing method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining various types of data to be processed; the method comprises the steps of obtaining a plurality of types of data to be processed, wherein the plurality of types of data to be processed comprise at least two of document data, audio data and image data, processing the data to be processed in a corresponding data processing mode to obtain corresponding structured data, wherein the structured data comprise information related to entities, constructing a knowledge graph and a vector database based on the structured data, determining target query intention of target entities in query information based on the obtained query information, determining target query strategies based on the target query intention, and querying from the knowledge graph and/or the vector database based on the target query strategies to obtain query results. The method can improve the accuracy and reliability of knowledge retrieval.

Inventors

SHAO YING
XU JUN

Assignees

北京和信瑞通电力技术股份有限公司
上海正泰自动化软件系统有限公司

Dates

Publication Date: 20260512
Application Date: 20251127

Claims (10)

1. A method of data processing, the method comprising: Acquiring a plurality of types of data to be processed, wherein the plurality of types of data to be processed comprise at least two of document data, audio data and image data; Processing the data to be processed in a corresponding data processing mode to obtain corresponding structured data, wherein the structured data comprises information related to an entity; The method comprises the steps of constructing a knowledge graph and a vector database based on the structured data, wherein the knowledge graph comprises a plurality of nodes and edges among the nodes, the nodes correspond to the entities, and the edges are used for indicating the relation among the entities corresponding to the nodes; Determining a target query intention of a target entity in the query information based on the obtained query information, and determining a target query strategy based on the target query intention, wherein the target query intention comprises one of a relational inference intention, a content retrieval intention and a fuzzy inference intention; If the target query intention comprises a relation reasoning intention, the target query strategy comprises inquiring a sub-graph associated with the target entity in the knowledge graph, and if the target query intention comprises a content retrieval intention, the target query strategy comprises inquiring candidate feature vectors associated with the target entity in the vector database, and if the target query intention comprises a fuzzy reasoning intention, the target query strategy comprises inquiring the sub-graph associated with the target entity in the knowledge graph and inquiring the candidate feature vectors associated with the target entity in the vector database; And inquiring from the knowledge graph and/or the vector database based on the target inquiry strategy to obtain an inquiry result.
2. The method of claim 1, wherein at least one node in the knowledge-graph corresponds to a plurality of edges, one edge corresponding to each relationship type, and wherein if the target query intent comprises a fuzzy inference intent, the target query strategy further comprises querying the knowledge-graph first and then querying the vector database; the query from the knowledge graph and/or vector database based on the target query strategy, to obtain a query result, includes: Determining a target relationship type corresponding to the target entity based on the query information; Inquiring in the knowledge graph based on the target relation type and the inquiring information to obtain a sub-graph associated with the target entity; inquiring in the vector database based on the nodes contained in the subgraph to obtain candidate feature vectors with mapping relation with the nodes contained in the subgraph; Or alternatively Generating an enhanced query vector based on the subgraph and the query information, and querying in the vector database based on the enhanced query vector to obtain the candidate feature vector matched with the enhanced query vector; And determining the query result based on the subgraph and the candidate feature vector.
3. The method of claim 2, wherein the determining the query result based on the subgraph and the candidate feature vector comprises: Determining the association degree of the subgraph and the target entity and the similarity of each candidate feature vector and the target entity; Determining a first weight corresponding to the subgraph and a second weight corresponding to the candidate feature vector based on the structural features of the subgraph; For each candidate feature vector, weighting processing is carried out based on the relevance of the subgraph and the first weight, and the similarity of the candidate feature vector and the second weight, so as to obtain a matching scoring value of the candidate feature vector; Determining a target feature vector from a plurality of candidate feature vectors based on the matching score value of each of the candidate feature vectors; and obtaining the query result based on the target feature vector and the subgraph.
4. A method according to any one of claims 1 to 3, wherein the processing the data to be processed by using a corresponding data processing method to obtain corresponding structured data includes: processing the data to be processed in a mode corresponding to the type of the data to be processed to obtain standardized data; Evaluating the standardized data to obtain at least one of sensitivity information and complexity information of the standardized data, wherein the sensitivity information comprises at least one of sensitivity scores, sensitive entities, sensitive types and severity grades corresponding to the sensitive entities and sensitivity grade marks of data to be processed to which the standardized data belongs; If the sensitivity information meets a preset condition and/or the complexity is greater than a preset complexity threshold, format converting the standardized data to obtain converted standardized data, and carrying out structural processing on the converted standardized data to obtain corresponding structured data, wherein the preset condition comprises that the sensitivity score exceeds the preset score threshold and/or the number of sensitive entities with the severity level greater than the preset level threshold exceeds the preset number threshold and/or the standardized data contains sensitive entities with target severity level and/or target sensitive type and/or the sensitivity level is marked as a preset sensitivity level mark.
5. The method according to claim 4, wherein the data to be processed is document data, and the processing the data to be processed in a manner corresponding to a type of the data to be processed to obtain standardized data includes: analyzing and extracting the document data to obtain at least one first content block; and obtaining standardized data corresponding to the document data based on the at least one first content block, wherein each first content block is provided with at least one attribute information for describing the first content block, the attribute information of the first content block comprises type information and hierarchy information of the first content block, and the type information of the first content block comprises at least one of a title, a list, a table or a paragraph.
6. The method according to claim 4, wherein the data to be processed is audio data, and the processing the data to be processed in a manner corresponding to a type of the data to be processed to obtain standardized data includes: dividing a transcribed text obtained according to the audio data to obtain at least one speaking unit; identifying the at least one speaking unit to obtain a question text and an answer text; correlating the question text with the answer text to obtain at least one question-answer pair; And obtaining standardized data corresponding to the audio data based on the at least one question-answer pair.
7. The method according to claim 4, wherein the data to be processed is image data, the processing the data to be processed in a manner corresponding to a type of the data to be processed to obtain standardized data includes: performing identification processing on the image data to obtain text description information for describing the content of the image data; analyzing and extracting according to the text description information to obtain at least one second content block; and obtaining standardized data corresponding to the image data based on the at least one second content block, wherein each second content block is provided with at least one attribute information for describing the second content block, the attribute information of the second content block comprises type information of the second content block, and the type information of the second content block comprises at least one of a title, a list, a table or a paragraph.
8. A data processing apparatus, the apparatus comprising: The device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a plurality of types of data to be processed, and the plurality of types of data to be processed comprise at least two of document data, audio data and image data; the processing module is used for processing the data to be processed in a corresponding data processing mode to obtain corresponding structured data, wherein the structured data comprises information related to an entity; The method comprises the steps of constructing a knowledge graph and a vector database based on the structured data, wherein the knowledge graph comprises a plurality of nodes and edges among the nodes, the nodes correspond to the entities, and the edges are used for indicating the relation among the entities corresponding to the nodes; Determining a target query intention of a target entity in the query information based on the obtained query information, and determining a target query strategy based on the target query intention, wherein the target query intention comprises one of a relational inference intention, a content retrieval intention and a fuzzy inference intention; If the target query intention comprises a relation reasoning intention, the target query strategy comprises inquiring a sub-graph associated with the target entity in the knowledge graph, and if the target query intention comprises a content retrieval intention, the target query strategy comprises inquiring candidate feature vectors associated with the target entity in the vector database, and if the target query intention comprises a fuzzy reasoning intention, the target query strategy comprises inquiring the sub-graph associated with the target entity in the knowledge graph and inquiring the candidate feature vectors associated with the target entity in the vector database; And inquiring from the knowledge graph and/or the vector database based on the target inquiry strategy to obtain an inquiry result.
9. An electronic device, comprising: A memory having a computer program stored thereon; A processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, realizes the steps of the method according to any of claims 1 to 7.

Description

Data processing method, device, electronic equipment and storage medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a storage medium. Background With the continuous advancement of digital transformation, enterprises accumulate a large amount of unstructured text data in the processes of equipment operation and maintenance, technical management and experience inheritance, including document fragments such as technical manuals, maintenance reports, expert interview records, field conference disciplines and the like. Businesses currently store document snippets in various forms of documents, databases, and file systems. However, this storage method easily causes problems such as fragmentation of information and breaking of context, resulting in low accuracy of query results. Disclosure of Invention The embodiment of the application provides a data processing method, a data processing device, electronic equipment and a storage medium, which are used for solving the problem of low accuracy of query results of a knowledge base. To achieve the above object, according to a first aspect of the present application, there is provided a data processing method comprising: Acquiring a plurality of types of data to be processed, wherein the plurality of types of data to be processed comprise at least two of document data, audio data and image data; Processing the data to be processed in a corresponding data processing mode to obtain corresponding structured data, wherein the structured data comprises information related to an entity; The method comprises the steps of constructing a knowledge graph and a vector database based on the structured data, wherein the knowledge graph comprises a plurality of nodes and edges among the nodes, the nodes correspond to the entities, and the edges are used for indicating the relation among the entities corresponding to the nodes; Determining a target query intention of a target entity in the query information based on the obtained query information, and determining a target query strategy based on the target query intention, wherein the target query intention comprises one of a relational inference intention, a content retrieval intention and a fuzzy inference intention; If the target query intention comprises a relation reasoning intention, the target query strategy comprises inquiring a sub-graph associated with the target entity in the knowledge graph, and if the target query intention comprises a content retrieval intention, the target query strategy comprises inquiring candidate feature vectors associated with the target entity in the vector database, and if the target query intention comprises a fuzzy reasoning intention, the target query strategy comprises inquiring the sub-graph associated with the target entity in the knowledge graph and inquiring the candidate feature vectors associated with the target entity in the vector database; And inquiring from the knowledge graph and/or the vector database based on the target inquiry strategy to obtain an inquiry result. Optionally, at least one node in the knowledge graph corresponds to a plurality of edges, and one edge corresponds to one relationship type; if the target query intention comprises a fuzzy inference intention, the target query strategy further comprises querying the knowledge graph and then querying the vector database; the query from the knowledge graph and/or vector database based on the target query strategy, to obtain a query result, includes: Determining a target relationship type corresponding to the target entity based on the query information; Inquiring in the knowledge graph based on the target relation type and the inquiring information to obtain a sub-graph associated with the target entity; inquiring in the vector database based on the nodes contained in the subgraph to obtain candidate feature vectors with mapping relation with the nodes contained in the subgraph; Or alternatively Generating an enhanced query vector based on the subgraph and the query information, and querying in the vector database based on the enhanced query vector to obtain the candidate feature vector matched with the enhanced query vector; And determining the query result based on the subgraph and the candidate feature vector. Optionally, the determining the query result based on the subgraph and the candidate feature vector includes: Determining the association degree of the subgraph and the target entity and the similarity of each candidate feature vector and the target entity; Determining a first weight corresponding to the subgraph and a second weight corresponding to the candidate feature vector based on the structural features of the subgraph; For each candidate feature vector, weighting processing is carried out based on the relevance of the subgraph and the first weight, and the