CN-121979983-A - Question-answer pair retrieval enhancement generation method based on three-library separation architecture and cross vector conflict verification

CN121979983ACN 121979983 ACN121979983 ACN 121979983ACN-121979983-A

Abstract

The invention relates to a question-answer pair retrieval enhancement generation method based on a three-library separation architecture and cross vector conflict verification. (2) An indicative aggregate vector library, a vector list library and an answer library are established and are associated through a case_id. (3) And analyzing the user query into indicative and exclusionary evidence sets, and generating corresponding vectors. (4) And calculating the similarity of the vector in the records of the indicative aggregate vector and the indicative aggregate vector library of the user query, screening the records exceeding a set threshold value, and generating an initial candidate set. (5) And performing forward and reverse conflict verification on the initial candidate set, and eliminating conflict records. (6) And outputting an answer corresponding to the final candidate set case_ID as a query result. According to the method, the problem of 'answer leakage' in the traditional index is effectively solved through three-library separation and establishment of exclusionary evidence, and the retrieval accuracy under complex logic is improved.

Inventors

LUO GUIZHONG
LUO GAN

Assignees

南京华淦信息技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260115

Claims (7)

1. A question-answer pair retrieval enhancement generation method based on three-library separation architecture and cross vector conflict verification is characterized by comprising the following steps: step 1, data structuring, namely analyzing an original question-answer pair into an indicative evidence set, an exclusionary evidence set and an answer text; step 2, vectorization processing, namely vectorizing two types of evidence sets to generate an indicative vector list and an exclusionary vector list respectively, aggregation processing is carried out on the indicative vector list to generate corresponding indicative aggregate vectors, and keyword list of the indicative evidence set is marked as { Splicing into a coherent text sequence by semantic connectors The keyword list of the exclusionary evidence set is denoted as { }; Step 3, separately storing, namely establishing three databases, namely an indicative aggregate vector library, a vector list library and an answer library, wherein the indicative aggregate vector library stores indicative aggregate vectors of question-answer pairs, and the vector list library stores indicative vector lists and exclusivity vector lists; Step 4, query analysis, namely receiving user query, analyzing the user query into an indicative evidence set and an exclusionary evidence set, and generating a corresponding indicative aggregate vector, an indicative vector list and an exclusionary vector list; step 5, preliminary screening recall, namely calculating the similarity between the indicative aggregate vector queried by the user and the vector in the indicative aggregate vector library, screening out records exceeding a set threshold value, and generating an initial candidate set comprising an indicative vector list and an exclusionary vector list according to a case_ID; Step 6, bidirectional logic conflict filtering, namely executing positive and negative conflict verification on the initial candidate set, and eliminating conflict records; and 7, outputting a result, namely outputting an answer corresponding to the filtered candidate set case_ID as a query result.
2. The question-answer pair retrieval enhancement generation method based on the three-library separation architecture and the cross vector conflict verification is characterized in that in the step 1, an indicative evidence set is a series of keyword lists and corresponds to a series of features confirmed to exist in a question description, an exclusionary evidence set is a series of keyword lists and corresponds to a series of features confirmed to not exist in the question description, and answer texts are corresponding question-answer pair texts.
3. The method for generating question-answer pair retrieval enhancement based on three-library separation architecture and cross vector conflict verification according to claim 1, wherein in step 2, an indicative evidence set { is generated by using a word vector model Each keyword of the sequence is converted into a word vector to obtain an indicative vector list Consecutive text sequences of the indicative evidence set Converting into word vectors to obtain indicative aggregate vectors { The exclusionary evidence set } Each keyword of the sequence is converted into a word vector to obtain an exclusionary vector list ; The number of the keywords is the indicative evidence set and the exclusionary evidence set respectively, and the vectors related to the above and the following are normalized.
4. The method for generating question-answer pair retrieval enhancement based on three-library separation architecture and cross vector conflict verification according to claim 1, wherein in step 4, user query is received, parsed into indicative and exclusionary evidence sets, and corresponding query indicative aggregate vectors are generated Query the list of indicative vectors Query exclusivity vector list ; The number of keywords in the indicative evidence set and the exclusionary evidence set respectively.
5. The method for generating question-answer pair search enhancement based on three-library separation architecture and cross vector conflict verification according to claim 1, wherein in step 5, the similarity is a query indicative aggregate vector Vector in the indicator vector library Cosine similarity of (c): ; screening from an indicative aggregate vector library that similarity to querying indicative aggregate vectors is greater than a threshold Generates an initial candidate set including an indicative vector list and an exclusionary vector list by case_id 。
6. The method for generating question-answer pair retrieval enhancement based on three-library separation architecture and cross vector conflict verification according to claim 1, wherein in step 6, the positive and negative conflict verification contents are as follows, positive conflict verification is performed by calculating the pairwise similarity of vectors between an 'indicative vector list' and a query 'exclusionary vector list' of each record of an initial candidate set, if the maximum similarity exceeds a threshold value, judging that evidence conflicts, and removing the record from the initial candidate set; And (3) carrying out negative conflict verification, namely calculating the pairwise similarity of vectors between an 'exclusivity vector list' and an inquiry 'indivity vector list' of each record in the initial candidate set, and if the maximum similarity exceeds a threshold value, judging that the records are in evidence conflict, and removing the records from the initial candidate set.
7. The method for generating question-answer pair search enhancement based on three-library separation architecture and cross vector conflict verification according to claim 6, wherein the positive and negative conflict verification processes are as follows: for the initial candidate set The following two steps of conflict verification are executed to eliminate conflict records: detecting logic I, namely eliminating item collision, namely detecting whether a feature which is explicitly eliminated by a user appears in an indicative feature corresponding to a record in an initial candidate set; Setting user query exclusivity vector list ; Is provided with Middle (f) List of record indicative vectors ; Is that Middle (f) The number of vectors in the list of the recorded indicative vectors; computing two indicative vector lists And Similarity of vectors between two pairs ; ); Given a threshold value If (if) Then it is determined that there is a first type of logical conflict, i.e., that the user exclusivity characteristics are present First, the Among some indicative features in the records, the record is removed from C init ; the detection logic II is used for indicating item collision, namely detecting whether the characteristic of user confirmation exists appears in the exclusionary characteristic corresponding to a record of the initial candidate set; let the user inquiry indicative vector list be ; Is provided with The p-th exclusionary vector list of (1) , Is that Middle (f) The number of vectors of the indicative vector list is recorded; computing two indicative vector lists And Similarity of vectors between two pairs ; ); If it is It is determined that a second type of logical conflict exists, i.e., that a user-indicative feature is present First, the In some exclusionary feature in the record, and eliminating the record from C init .

Description

Question-answer pair retrieval enhancement generation method based on three-library separation architecture and cross vector conflict verification Technical Field The invention relates to the technical field of artificial intelligence and natural language processing, in particular to an optimization technology for anti-noise retrieval by utilizing a three-dimensional decoupling storage and cross conflict filtering algorithm of an indicative evidence set, an exclusionary evidence set and answers aiming at a query-Pair (QA-Pair) based retrieval enhancement generation (RAG) system. Background RAG technology is widely used in intelligent question-answering systems, where data is typically in the form of standardized QA-Pair in an enterprise-level knowledge base. However, existing RAG techniques based on Question-Answer pairs have the technical bottleneck of (1) index granularity confusion resulting in "Answer leakage", in which conventional techniques typically splice questions (questions) and answers (answers) into one integral text block for Embedding indexes. When a user enters a query, the search engine easily matches high frequency generic words in Answer (e.g., "restart", "check power"), thereby recalling erroneous records that are not relevant to the problem description, but are similar only by solution wording, causing search pollution. (2) The lack of discrimination of "exclusivity logic" traditional QA-Pair search is based on semantic similarity of only a single vector space, can only deal with "what happens" (positive match), cannot deal with "what does not happen/what is excluded" (negative logic). For example, when a user query explicitly indicates "power failure is removed," the conventional system may recall the answer to "power failure" because of the high relevance of the term "power". Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a question-answer Pair retrieval enhancement generation method based on a three-library separation architecture and cross vector conflict verification, which is used for carrying out an 'indicative-exclusivity-answer' three-dimensional decoupling storage architecture on QA-Pair data, constructing a double independent vector space of a problem by utilizing a word vector model and realizing logic filtering by a cross-dimensional vector conflict algorithm. In order to achieve the above object, the present invention adopts the following technical scheme: a question-answer Pair retrieval enhancement generation system based on a three-library separation architecture and cross vector conflict verification does not relate to the analysis of unstructured documents, but is directly based on structured QA-Pair data for processing. 1. The database (three-dimensional decoupling) is constructed, and each question-answer pair C stored by the system is independently composed of the following three parts: (I) Indicative evidence set Confirming the existing feature set in the problem description; (II) set of exclusionary evidence Confirming the feature set which is not existed or excluded in the question description; (III) answer (A_c) corresponding question-answer pair text. 2. Vector database architecture three independent databases are set up, associated by unique question-answer identifiers (case_id): (I) An indicative aggregate vector database storing aggregate vectors (for prescreening) of indicative evidence sets. (II) vector list database storing indicative vector list and exclusionary vector list. And (III) an answer vector database only stores the original answer pair text, does not participate in vector retrieval calculation, and prevents semantic interference. The indicative aggregate vector database, the exclusionary vector database are associated by a unique identifier case_id, and the answer vector database does not participate in vector computation. 3. User query data structure the user entered query Q is defined structurally as: (I) Querying indicative evidence sets The user confirms the existing feature set. (II) query exclusionary evidence setThe user confirms the non-existing feature set. A question-answer pair retrieval enhancement generation method based on three-library separation architecture and cross vector conflict verification comprises the following steps: Step S1, double granularity independent vectorization and separate storage, namely, utilizing a word vector model (such as Qwen-Embedding) to indicate evidence sets of QA-Pair data And exclusionary evidence setAnd carrying out vectorization processing and coding storage. 1. Indicative evidence processing: (a) Aggregation process-collecting indicative evidence The keyword list in (a) is marked as {Splicing into a coherent text sequence by semantic connectors。 (B) Vectorization, namely using a word vector model to { set }Vectorization coding of the key words one by one to generate an indicative vector listConsecutive text sequences that would indicate evid