CN-115658846-B - Intelligent search method and device suitable for open source software supply chain
Abstract
The invention relates to an intelligent searching method and device suitable for an open source software supply chain. The method comprises the steps of 1) receiving a natural language query question sent by a user, and identifying key elements of the query question by using various methods, wherein the specifically identified elements comprise entities, concepts, relationship names, attribute names and numerical value attributes, 2) generating candidate paths according to the situation that the key elements are identified, and 3) converting the candidate paths and carrying out path matching sorting by using a matching sorting model so as to obtain search results. The invention provides a natural language searching algorithm based on knowledge graph data of an open source software supply chain, and provides a high-performance interface for searching the graph data for a user. The invention comprehensively utilizes a plurality of methods to improve the key element recognition so as to ensure the result recall rate, and divides the conditions to generate candidate paths so as to reduce the path matching sequencing range, thereby effectively improving the searching effect of the knowledge graph data by using natural language.
Inventors
- CUI XING
- WU JINGZHENG
- Luo tianyue
- WU YANJUN
- GUO ZHI
Assignees
- 中国科学院软件研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20220930
Claims (9)
- 1. An intelligent search method suitable for an open source software supply chain, comprising the following steps: Receiving a natural language query question inputted by a user, and identifying key elements of the natural language query question, wherein the key elements are key elements in an open source software supply chain knowledge graph; Generating candidate paths by adopting a multi-condition candidate path generation strategy according to the identified key elements; matching and sorting the candidate paths with the natural language query questions, and taking the candidate paths with the highest scores in the matching and sorting results as the final search result; the multi-conditional candidate path generation strategy comprises the following steps: 1) If only a single key entity in the knowledge graph is identified in the key elements, the entity is used as an initialization node, the entity is respectively expanded along the triplet direction of the knowledge graph, and the expansion comprises a first-order or second-order path of the key entity as a candidate path; 2) If a single key entity and a single relation name in the knowledge graph are identified in the key elements, the entity is used as an initialization node, and a first-order path equal to the identified relation name or a second-order path containing the identified relation is used as a candidate path; 3) If two entities in the knowledge graph are identified in the key elements, taking a first-order or second-order path connecting the two entities as a candidate path, and expanding the candidate path in a way of adding the first-order or second-order path of the head entity or the tail entity; 4) If the key elements include other relationships in addition to the two entities identified, screening candidate paths based on 3) in the same manner as 2); 5) If the numerical attribute of the knowledge graph is identified in the key element identification, the candidate attribute is added for each node in the candidate path to limit and screen, and the nodes without the type attribute are removed.
- 2. The method of claim 1, wherein the performing key element recognition on the natural language query sentence is performed by combining a plurality of methods, including entity, concept, relationship name, attribute name recognition based on a synonym dictionary, entity recognition based on a sequence labeling model, entity recognition based on rules, numerical attribute discovery and normalization.
- 3. The method of claim 1, wherein generating candidate paths using a multi-conditional candidate path generation strategy based on the identified key elements comprises first order second order path expansion for a single key entity, candidate path selection and first order second order path expansion for multiple key entities, filtering the candidate paths using the identified relationships; And screening and filtering the candidate paths by using the specific type attribute.
- 4. The method of claim 1, wherein matching the candidate paths to natural language query questions comprises ranking using Sentence-BERT models as path ranking models, extracting features using average pooling and computing similarity scores using cosine similarity.
- 5. The method of claim 4, wherein constructing an objective function for the sentence generated vector by the Sentence-BERT model comprises three ways: ① Splicing the two sentence vectors and the bitwise difference vector: o=softmax(W t (u,v,|u-v|)) wherein u and v respectively represent two sentence vectors, o represents an objective function, W t represents a learnable weight parameter, and I·| represents a bitwise difference; ② Cosine similarity is calculated for two sentence vectors: Where l represents a loss function, cos (θ) represents cosine similarity, n represents the number of samples, y represents an actual tag, y 'represents a predicted tag, and MSE (y, y') represents a mean square error; ③ And the way the anchor sentence and positive and negative samples are used: l=max(||s a -s p ||-||s a -s n ||+∈,0) Wherein s a 、s p 、s n represents an anchor sentence and positive and negative samples respectively, l represents a loss function, l represents Euclidean distance, and E represents that the distance between s a and s p is at least shorter than the distance between s a and s n .
- 6. The method of claim 5, wherein pattern ③ is used during training and is migrated to pattern ② for prediction.
- 7. An intelligent search device for an open source software supply chain employing the method of any one of claims 1-6, comprising: the key element identification module is used for receiving a natural language query question inputted by a user, and carrying out key element identification on the natural language query question, wherein the key element is a key element in an on-off source software supply chain knowledge graph; The candidate path generation module is used for generating candidate paths by adopting a multi-condition candidate path generation strategy according to the identified key elements; and the matching and sorting module is used for matching and sorting the candidate paths with the natural language query questions, and taking the candidate paths with the highest scores in the matching and sorting results as the final search result.
- 8. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-6.
- 9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-6.
Description
Intelligent search method and device suitable for open source software supply chain Technical Field The invention belongs to the technical field of computers, and relates to an intelligent searching method and device suitable for an open source software supply chain. Background The open source software is already the basis of the present application programs, and in the development and running process of the open source software, all related upstream communities, source code packages, binary packages, third-party component distribution markets, application software distribution markets, developers, maintainers, communities, foundation and the like of the open source software form a supply relation network according to dependence, combination and the like to form an open source software supply chain. The open source software supply chain generally uses a knowledge graph as an information carrier and uses a Neo4j graph database for data storage. The query of the knowledge graph data of the supply chain usually requires query sentences such as Cypher, and in the query process, not only is a basic grammar required to be mastered by a querier, but also the querier needs to know the entity and the relationship type of the graph data to a certain extent, but most of the cases, the ordinary users do not have the capability. At present, the knowledge graph is directly queried by using natural language, and the mainstream method comprises semantic analysis (SEMANTIC PARSER) and information extraction (Information Retrieval). The method based on semantic analysis is to convert a natural language question into a series of formalized logic expressions which can express the semantic information of the whole problem, and can be converted into query sentences which can be executed in a knowledge graph, and finally, the target data is obtained by querying in the knowledge graph by utilizing the corresponding query sentences. The method based on information extraction is to identify and extract a central entity in the question, query a knowledge graph in the adjacent range of the entity node in the knowledge graph, take each node, edge or path contained in the graph as a candidate answer, establish a model to convert the candidate answer and the question into feature vectors, and further compare the similarity of the candidate answer and the question to perform candidate sorting to obtain a final result. However, in the face of complex and diverse real data resources, the recall rate and the accuracy rate are not satisfactory when only a single method is used for solving. Therefore, when the query system is constructed, various algorithms should be comprehensively utilized to expand the recall of the early stage and to precisely screen the later stage. Disclosure of Invention The invention aims to provide an intelligent search method and device suitable for an open source software supply chain, which comprise three parts of key element mining based on multi-method cooperation, multi-condition candidate path generation and candidate path matching and sorting. The method comprehensively utilizes a plurality of methods to identify key elements so as to ensure the recall rate of results, and divides conditions to generate candidate paths so as to effectively reduce the matching and sequencing range of the paths. In order to achieve the above purpose, the invention adopts the following technical scheme: an intelligent search method suitable for an open source software supply chain, comprising the following steps: Receiving a natural language query question inputted by a user, and identifying key elements of the natural language query question, wherein the key elements are key elements in an open source software supply chain knowledge graph; Generating candidate paths by adopting a multi-condition candidate path generation strategy according to the identified key elements; And carrying out matching sequencing on the candidate paths and the natural language query questions, and taking the candidate paths with the highest scores in the matching sequencing results as the final search result. Further, the key element identification comprises the steps of identifying the key element specifically, wherein the key element specifically comprises an entity, a concept, a relation name, an attribute name and a numerical attribute in an open source software supply chain knowledge graph. Further, the step of performing key element recognition on the query question includes the steps of: 1) Entity, concept, relationship name, and attribute name identification based on the synonym dictionary. In this step, word segmentation is carried out on the query sentence through a word segmentation tool, a token obtained after word segmentation is matched with a dictionary tree generated offline. The dictionary tree contains entity names, concept names, relationship names and attribute names in the knowledge graph, and synonyms and paraphrasing correspondi