CN-122019735-A - Academic question and answer oriented multi-model paper retrieval method
Abstract
The invention discloses a multi-model paper retrieval method for academic question and answer, which relates to the technical field of natural language processing and information retrieval and comprises the steps of firstly constructing a unified corpus and a training data set, and respectively encoding and generating a document vector set by using a model in a first model set and a second target model; and performing similarity search in parallel by using each model in the model group as a model group and adopting target query respectively to obtain a corresponding original similarity matrix, screening documents based on the original similarity matrix, and generating a first target document list of the target query. The method can effectively improve the accuracy and the robustness of the academic document retrieval.
Inventors
- Cai Zulei
- HE SHAN
- CHEN FUGUI
- LI JIAYANG
- DA HAI
Assignees
- 西南石油大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260413
Claims (10)
- 1. The multi-model paper retrieval method for academic question answering is characterized by comprising the following steps of: Step S1, constructing a unified corpus containing candidate documents and a training data set containing query texts and forward document association relations thereof; s2, presetting at least two models, wherein one model is a second target model, and the other models construct a first model set which comprises at least one language model different from the second target model; s3, adopting the second target model to encode each query text respectively, and executing primary retrieval in a document embedded vector set corresponding to the second target model based on each encoding result respectively; Step S4, for each query text, screening documents meeting preset correlation conditions but not forward documents based on the primary retrieval result and the forward document association relation as difficult samples corresponding to the query text, and constructing a comparison learning training sample comprising the query text, the forward documents and the difficult samples; S5, performing contrast learning fine adjustment on the second target model by adopting the contrast learning training sample, and recoding the unified corpus by utilizing the fine-adjusted second target model to obtain a document embedded vector set corresponding to the fine-adjusted second target model; Adopting each model in the model group to encode target queries respectively to generate corresponding query vectors, and executing similarity retrieval in a document embedded vector set corresponding to each model in parallel to obtain a corresponding original similarity matrix used for representing the similarity between the target queries and candidate documents in each model; Step S7, carrying out self-adaptive denoising and normalization processing on each original similarity matrix to obtain similarity scores of each model in the model group on each candidate document, determining weights of each model in the model group according to weight strategies, carrying out weighted summation on the similarity scores of each candidate document in each model in the model group and the weights of each model in the model group to obtain corresponding fusion scores, screening documents according to the fusion scores of each candidate document, and generating a first target document list of the target query.
- 2. The academic question answering-oriented multi-model paper retrieval method according to claim 1, wherein in step S1, when a unified corpus is constructed, the following operations are performed on each candidate document: S11, extracting a title text field and a abstract text field, and if the abstract is missing, extracting information from the text introduction as the abstract; s12, respectively cleaning the extracted text fields to improve the density of effective information; And S13, linearly connecting the cleaned title text field with the abstract text field to generate a document text sequence representing the core content of the document.
- 3. The academic question answering-oriented multi-model paper retrieval method according to claim 2, wherein in step S1, when constructing a unified corpus, the method further comprises the following operations: S14, pre-constructing an instruction prompt word stock containing various academic intention types, wherein the instruction prompt word stock comprises instruction template character strings corresponding to each academic intention; and S15, identifying academic intentions of each candidate document according to the abstract, splicing the academic intentions to the front end of the document text sequence representing the core content of the document, and generating a model input sequence containing intention semantics for representing the candidate document.
- 4. The academic question answering-oriented multimodal paper retrieval method according to claim 3, wherein in step S2, when encoding the unified corpus, the method comprises the following operations: S21, respectively inputting the model input sequences of each candidate document into the model to execute forward propagation calculation to obtain a corresponding last layer hidden state tensor sequence containing rich semantic information; s22, unifying the last layer of hidden state tensor sequences corresponding to each candidate document into a vector with a fixed dimension by using a mean value aggregation strategy; s23, carrying out L2 norm normalization processing on the vector with fixed dimension obtained in the step S22 to obtain document embedding vectors of each candidate document.
- 5. The method for multi-model paper retrieval for academic question answering according to claim 1, wherein in step S5, the contrast learning fine tuning includes: S51, maintaining backbone network parameters in the second target model unchanged in the fine tuning process; s52, a trainable low-rank adaptation matrix is accessed in parallel at a query projection layer, a key projection layer, a value projection layer and an output projection layer in an attention mechanism module of the second target model, and at an upper projection layer, a lower projection layer and a gate control projection layer in a feedforward network module; s53, calculating a loss function gradient based on the contrast learning training sample, and updating weight parameters of the low-rank adaptation matrix only by using the loss function gradient; The calculation formula of the loss function is as follows: In the formula, Indicating the comparative loss value within the batch, A vector representation generated for the query text, A vector representation generated for the forward document, Is the first A vector representation of the generation of a difficult sample, For the total number of samples, Representing a dot product or cosine similarity calculation function, Is a preset temperature coefficient.
- 6. The academic question answering-oriented multimodal paper retrieval method according to claim 5, wherein in step S5, recoding the unified corpus using the trimmed second target model includes: s55, carrying out numerical combination on the weight parameters of the updated low-rank adaptive matrix and backbone network parameters to generate a combined second target model; S56, coding the unified corpus again by using the combined second target model, and generating an optimized second document embedded vector set.
- 7. The method for multi-model paper retrieval for academic question answering according to claim 1, wherein in step S6, after each model in the model group performs similarity retrieval on the query vector, top K candidate documents with high similarity are intercepted respectively, a corresponding second candidate document set is constructed, and the original similarity matrix is used for representing similarity between the target query and each candidate document in the second candidate document set corresponding to the model.
- 8. The academic question answering-oriented multimodal paper retrieval method according to claim 7, wherein step S7 includes: S71, determining similarity scores of all the models in the model group on all the candidate documents, namely, for each original similarity matrix, arranging all the elements in the original similarity matrix in descending order to generate a score sequence, calculating a second-order differential sequence of the score sequence, taking the position corresponding to the maximum element in the second-order differential sequence as a cut-off point, constructing a third candidate document set by using the candidate documents before the cut-off point in the score sequence, and normalizing the similarity scores of all the candidate documents in the third candidate document set to be used as the scores of the corresponding models of the original similarity matrix on all the candidate documents; S72, determining dynamic weights of all models in the model group, namely converting similarity scores of all candidate documents in the second candidate document set into probability values for each model, acquiring retrieval entropy values of the models based on the probability values, and determining the dynamic weights based on the retrieval entropy values of all models, wherein the calculation formula is as follows: In the formula, Is the first The dynamic weight of the individual model is determined, For the preset sensitivity adjustment factor to be used, Representing the total amount of the model; Represent the first The search entropy of the individual model for the target query, For the second set of candidate documents, For candidate documents In the model Probability distribution values of (a); S73, weighting and summing the similarity score of each candidate document in each model of the model group and the dynamic weight of each model in the model group to obtain a corresponding fusion score; In the formula, Representing candidate documents A fusion score for a target query, Is the first Model pair number Scoring the similarity of the candidate documents; s74, screening the documents according to the fusion scores of the candidate documents, and generating a first target document list of the target query.
- 9. The academic question answering-oriented multimodal paper retrieval method of claim 8, further comprising secondarily filtering candidate documents in the first target document list to generate a second target document list, and specifically comprising: s81, taking the weighted average of the retrieval entropy values of the target query in each model of the model set as the retrieval entropy value corresponding to the target query ; S82, searching the search entropy value corresponding to the target query The conversion is carried out into a diversity adjusting coefficient, and the calculation formula is as follows: In the formula, For the purpose of adjusting the coefficients for the diversity, In order to retrieve the position points in the entropy values, Is a super parameter; S83, calculating a maximum marginal correlation score of each candidate document by using a maximum marginal correlation algorithm based on the diversity adjustment coefficient and the fusion score of each candidate document in the first target document list, and iteratively selecting candidate documents from high to low according to the maximum marginal correlation score to be added into a second target document list until the preset document number is reached, so as to generate a final second target document list; The maximum marginal relevance score is calculated as follows: In the formula, Representing candidate documents Is a maximum marginal relevance score for (1); representing a current candidate document And has selected a second target document list Semantic similarity of the most similar documents in (a) represents a redundancy penalty.
- 10. The academic question answering-oriented multimodal paper retrieval method according to claim 8, further comprising introducing a graph calculation idea to perform secondary verification on the first target document list, and specifically comprising: S91, acquiring a reference list and a cited list of each candidate document in the first target document list, and establishing a local cited topological graph with the candidate document as a node and a cited relation as a directed edge; S92, in the local quotation topological graph, taking the fusion score of each candidate document in the first target document list as the activation energy of the corresponding node, and calculating the weighted centrality of each node, wherein the calculation formula is as follows: In the formula, Representing nodes Is used to determine the weighted centrality of (1), Representing nodes Is out of neighbor node of (a) The fusion score of the corresponding candidate document, Is a node Is defined as the set of neighbor nodes, Is the attenuation coefficient; S93, carrying out normalization processing on weighted centrality of all nodes to generate topology weights of candidate documents corresponding to the nodes, and calculating rearrangement scores by combining fusion scores of the candidate documents corresponding to the nodes, wherein the calculation formula is as follows: In the formula, For candidate documents Is used to score the rearrangement score of (c), For candidate documents Is used to determine the topology weight of the (c), Is a regulatory factor; S94, reordering the first target document list according to the final reordering score, and outputting a final target document list.
Description
Academic question and answer oriented multi-model paper retrieval method Technical Field The invention relates to the technical field of natural language processing and information retrieval, in particular to a multi-model paper retrieval method oriented to academic question answering. Background Along with the deep academic research and the exponential increase of the number of documents, an academic question-answering system becomes a key tool for scientific researchers to obtain knowledge efficiently, and in order to break through the limitation of the traditional keyword matching technology on the semantic understanding depth, the existing mainstream technical scheme mostly adopts a dense retrieval paradigm based on deep learning, namely, a pre-training language model is used as an encoder to map natural language query and academic documents of a user to the same high-dimensional vector space, and related documents are recalled by calculating geometric distances among vectors. However, the query in the academic field usually has high abstract and term specialization, only very fine semantic boundaries exist between correct answers and interference documents similar to simple words, when a retrieval system only uses a general pre-training model, the model is easily misled by difficult samples with high similarity due to lack of deep cognition on logic in a specific field, and fine differences of core semantics cannot be accurately distinguished, and if a single model is simply subjected to aggressive fine tuning for improving field adaptability, the model is easily subjected to fitting on specific data distribution, so that the model loses the generalized understanding capability on diversified question modes or cross-field knowledge. The contradiction between the field specificity and the semantic generalization of the single model is difficult to be considered, so that when the prior art faces to high-difficulty academic question-answering tasks, the prior art is difficult to effectively inhibit noise interference and simultaneously maintain wide semantic coverage, and finally, the accuracy of a search result is low and the system lacks sufficient robustness. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a multi-model paper retrieval method oriented to academic questions and answers. In order to achieve the above object, the technical scheme of the present invention is as follows: the multi-model paper retrieval method for academic question answering comprises the following steps: Step S1, constructing a unified corpus containing candidate documents and a training data set containing query texts and forward document association relations thereof; Step S2, presetting at least two models, wherein one model is a second target model, the other models construct a first model set, and the first model set comprises at least one language model different from the second target model, so that a multi-dimensional semantic feature space is constructed, and systematic deviation caused by a single architecture is reduced to the greatest extent; s3, adopting a second target model to respectively encode each inquiry text, and respectively executing primary search in a document embedded vector set corresponding to the second target model based on each encoding result; Step S4, for each query text, screening documents meeting preset correlation conditions but not forward documents based on the primary search result and forward document association relation as difficult samples corresponding to the query text, and constructing a comparison learning training sample containing the query text, the forward documents and the difficult samples; s5, performing contrast learning fine adjustment on the second target model by adopting a contrast learning training sample, and recoding the unified corpus by utilizing the fine-adjusted second target model to obtain a document embedded vector set corresponding to the fine-adjusted second target model; Adopting each model in the model group to encode the target query respectively, generating corresponding query vectors, and executing similarity retrieval in parallel in a document embedded vector set corresponding to each model to obtain a corresponding original similarity matrix used for representing the similarity between the target query and each candidate document in each model; Step S7, carrying out self-adaptive denoising and normalization processing on each original similarity matrix to obtain similarity scores of each model in the model group on each candidate document, determining weights of each model in the model group according to weight strategies, carrying out weighted summation on the similarity scores of each candidate document in each model in the model group and the weights of each model in the model group to obtain corresponding fusion scores, screening the documents according to the fusion scores of each candidate doc