CN-121980021-A - Domain-to-entity self-adaptive document rearrangement method
Abstract
The invention relates to the technical field of information retrieval and machine learning, and provides a domain-to-entity self-adaptive document rearrangement method which comprises the steps of sorting a plurality of judgment samples into a rearrangement model to calculate comprehensive relevance scores to obtain corresponding document rearrangement results, wherein the rearrangement model comprises a pre-training rearrangement module to output semantic basic relevance scores, a domain/entity identification module generates domain weight parameters of a plurality of target domains through a classification network and determines corresponding company entities, a joint self-adaptive adjustment module performs sparse selection on low-rank adaptation branches corresponding to the plurality of target domains, performs incremental feature modeling on judgment sample pairs based on the initiated low-rank matrix to generate corresponding incremental relevance scores, and a reordering execution module sorts the comprehensive relevance scores and outputs the results.
Inventors
- WANG XINYU
- JIANG CHAOLONG
- LI MUZHI
- LU PENG
- Wang Suyuchen
- HUANG JIERUI
- MA LIHENG
- TIAN JINGRUI
- WANG TAO
- ZHOU LING
- Chi Jijun
- Tai Zhenghan
- HE HAILIN
- WU HANWEI
- HU QINGCHEN
- DING LEI
- Guo Tongshen
Assignees
- 辰光幻影(上海)文化科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251127
Claims (10)
- 1. A domain-to-entity adaptive document reordering method, comprising the steps of: Obtaining a plurality of discrimination sample pairs composed of query fields and corresponding candidate document fragments, inputting the plurality of discrimination sample pairs into a trained rearrangement model, calculating a corresponding comprehensive relevance score for each discrimination sample pair, and sequencing the candidate document fragments based on the comprehensive relevance score to obtain a corresponding document rearrangement result, wherein the rearrangement model comprises a pre-training rearrangement module, a field/entity identification module, a joint self-adaptive adjustment module and a reordering execution module, The pre-training rearrangement module models the semantic relevance of each discrimination sample pair and outputs a semantic basic relevance score; The domain/entity identification module performs semantic coding on the query field, generates domain weight parameters of a plurality of target domains through a classification network, and determines corresponding company entities based on company names or entity characterization information contained in the query field; The joint self-adaptive adjusting module performs sparse selection on low-rank adaptive branches corresponding to a plurality of target domains according to the outputted domain weight parameters, starts low-rank matrixes in the low-rank adaptive branches with the domain weight parameters exceeding a preset activation threshold to participate in calculation, models the increment characteristics of the discrimination sample pairs based on the started low-rank matrixes, and generates corresponding increment relevance scores, wherein in the process of selecting the low-rank matrixes participating in the increment relevance score calculation, subdivided low-rank matrixes corresponding to company entities are preferentially loaded; And the reordering execution module sorts the candidate document fragments of the query field through the comprehensive relevance score obtained by linearly superposing the semantic basic relevance score and the incremental relevance score and outputs the document reordering result.
- 2. The document reordering method of claim 1, wherein obtaining a plurality of pairs of decision samples comprising a query field and corresponding candidate document snippets comprises: searching and obtaining a candidate document fragment set matched with the query field in the query request from an external retriever in response to the query request input by a user; the query field is marked as q, and each candidate document fragment in the candidate document fragment set is respectively marked as q And according to the unified input specification, the query field q and each candidate document fragment are combined Combining and packaging a plurality of discrimination sample pairs formed by the query field and the fragment field ; A plurality of the discrimination sample pairs Together forming a discrimination batch centered on the same query field q as a correlation discrimination input of the rearrangement model, wherein, And for the ith candidate document fragment, i is the index number of the candidate document fragment, the value range is 1 to n, and n is a natural number larger than 0 and used for representing the number of the candidate document fragments.
- 3. The document reordering method of claim 1, wherein the pre-training reordering module is configured to model semantic relevance for each of the discrimination sample pairs, and output a semantic base relevance score, comprising: the pre-training rearrangement module receives a plurality of input discrimination sample pairs and then respectively carries out semantic coding on the query field and the candidate document fragments to respectively generate a query semantic vector and a document semantic vector; And calculating the logarithmic probability score of each discrimination sample by adopting a binary discrimination structure based on the query semantic vector and the document semantic vector, and outputting the logarithmic probability score with the value range of 0 to 1 after Sigmoid normalization.
- 4. The document reordering method of claim 1 wherein the domain/entity recognition module semantically encodes the query field, generates domain weight parameters for a plurality of target domains through a classification network, and determines corresponding company entities based on company names or entity characterizing information contained in the query field, comprising: The router receives a plurality of input discrimination sample pairs, performs word embedding and context feature extraction on the query field, and generates a query semantic vector for discriminating the field and the entity; Inputting the query semantic vector into a classification network, and performing domain discrimination on a plurality of preset target domains to obtain discrimination probability distribution of each target domain; and determining the domain weight parameters corresponding to each target domain according to the discrimination probability distribution, and outputting the domain weight parameters and the company entities after identifying the corresponding company entities based on company names or contextual entity characteristics contained in the query field, wherein the domain weight parameters and the company entities are called by the joint self-adaptive adjustment module to execute semantic self-adaptation of the domain and entity levels.
- 5. The document rearrangement method according to claim 1, wherein sparse selection is performed on low-rank adaptation branches corresponding to a plurality of the target domains according to the outputted domain weight parameters, and starting low-rank matrix participation calculation in the low-rank adaptation branches of which the domain weight parameters exceed a preset activation threshold comprises: Receiving the domain weight parameters and the company entity output by the router, carrying out weight screening on a plurality of low-rank adaptive branches, and determining target low-rank adaptive branches of which the domain weight parameters exceed the preset activation threshold; And executing branch selection or on-demand combination processing on the started target low-rank adaptive branches, enabling the low-rank matrix to participate in the semantic feature adjustment process of the current judgment sample pair, generating a low-rank difference matrix for adjusting the trunk weight of the pre-training rearrangement module, acting the low-rank difference matrix on the semantic feature representation of the judgment sample pair, and calculating the adjusted semantic matching score as the corresponding incremental relevance score.
- 6. The document reordering method of claim 5 wherein receiving the domain weight parameter and the corporate entity output by the router, weight filtering a plurality of the low rank adaptation branches, determining a target low rank adaptation branch for which the domain weight parameter exceeds the preset activation threshold, comprises: receiving a plurality of domain weight parameters and company entity information output by the router, and determining a corresponding target low-rank adaptation branch in a plurality of low-rank adaptation branches based on each domain weight parameter; For each target low-rank adaptation branch, loading the corresponding low-rank matrix according to the company entity information, selecting the subdivided low-rank matrix when the subdivided low-rank matrix corresponding to the company entity information is identified, and selecting the domain general low-rank matrix when the subdivided low-rank matrix corresponding to the company entity information is not identified.
- 7. The document reordering method of claim 1 wherein the reordering execution module ranks the candidate document snippets of the query field and outputs the document reordering result by the combined relevance score of the semantic base relevance score linearly superimposed with the incremental relevance score, comprising: receiving the semantic basic relevance score output by the pre-training rearrangement module and the incremental relevance score output by the joint self-adaptive adjustment module; and fusing the semantic basic relevance score and the incremental relevance score based on a linear superposition function to obtain the comprehensive relevance score, and sequencing the comprehensive relevance scores of the candidate document fragments in a descending order to output the corresponding document rearrangement result.
- 8. The document reordering method of claim 1 further comprising executing a pre-training phase of the reordering model, the pre-training phase configured to generate the domain generic low rank matrix for each of the target domains and the subdivided low rank matrix for the company entity information in the target domains for invocation by the low rank adapter.
- 9. The document reordering method of claim 8, wherein generating the domain-generic low rank matrix for each of the target domains comprises: and constructing an entity abstract training data set based on financial general corpus, and masking or abstract replacing explicit entity names, wherein the entity names comprise companies, figures and products, and the pre-training rearrangement module is trained by automatically generating positive and negative samples by using a large language model so as to obtain the general low-rank matrix of the domain for representing semantic features of each target domain.
- 10. The document reordering method of claim 9, wherein generating the subdivided low-rank matrix corresponding to the corporate entity information in the target domain comprises: After training of the domain general low-rank matrix is completed, a self-adaptive training set is constructed based on real retrieval distribution of target company entity information, positive and negative samples are automatically marked by using the large language model, the domain general low-rank matrix is subjected to comparative fine tuning in combination with a hard negative example generated in the retrieval process and a random negative example sampled from corpus, the subdivision low-rank matrix corresponding to the company entity information is generated, and after the comparative fine tuning is completed, the domain general low-rank matrix and the subdivision low-rank matrix are solidified and stored in corresponding domain low-rank adaptation branches, and the loading of the low-rank adapter is used for calling.
Description
Domain-to-entity self-adaptive document rearrangement method Technical Field The invention relates to the technical field of information retrieval and machine learning, in particular to a domain-to-entity self-adaptive document rearrangement method which is used for calculating and outputting relevance ranking scores on a candidate set generated by any retriever and improving retrieval quality and usability under trans-main migration. Background With the development of financial science and technology and information retrieval technology, the scale of financial disclosure and supervision files is continuously enlarged, the content structure is complex and large in scale, the terms are dense, and the financial disclosure and supervision files often appear in a mixture of various forms such as texts and tables. Automatic question-answering and retrieval enhancement (RAG) methods for such documents have become an important technical direction for financial information analysis. The existing system generally relies on a general semantic retrieval and document rearrangement model to improve relevance ranking performance by training or fine tuning on a large scale of financial corpora. In recent research and evaluation tasks, retrieval quality, numerical reasoning capability and cross-company migration performance become key indexes for measuring robustness and interpretability of a financial question-answering system, and a strong-correlation rearrangement module becomes a core link for improving performance of an end-to-end RAG system. However, there are still significant limitations to existing financial document rearrangement techniques. On the one hand, the general question-answering or retrieval enhancement model is easy to generate migration degradation and semantic distribution mismatch when applied across companies, so that the ordering stability is insufficient, and the end-to-end answer accuracy is reduced. On the other hand, the existing scheme often depends on a large-scale and expensive end-to-end training pipeline, or a large number of manual labeling samples are needed in the migration process of a target company, the model lacks a high-efficiency self-adaptive training path from general financial semantics to target entity characteristics, and stable promotion is difficult to realize under limited specialization data. In addition, the rearrangement model has high coupling degree with specific retrieval realization, lacks flexible mechanism capable of rapidly adapting to different main bodies and fields, and limits the expansibility and engineering floor property of the rearrangement model in a real financial scene. Therefore, a document rearrangement training method and system which is decoupled from a retriever, can be quickly self-adaptive and has controllable labeling cost are needed to effectively improve the retrieval quality under the condition of cross-main migration. Disclosure of Invention The invention aims to solve the problems of cross-company migration degradation, high labeling cost and poor model suitability in financial document rearrangement, and provides a domain-to-entity self-adaptive document rearrangement method which realizes the rapid migration of a model from general semantics to a target entity, improves sequencing stability and retrieval precision, and reduces training and deployment cost, and the aim of the invention can be realized by the following technical scheme: the invention provides a domain-to-entity self-adaptive document rearrangement method, which comprises the following steps: Obtaining a plurality of discrimination sample pairs composed of query fields and corresponding candidate document fragments, inputting the plurality of discrimination sample pairs into a trained rearrangement model, calculating a corresponding comprehensive relevance score for each discrimination sample pair, and sequencing the candidate document fragments based on the comprehensive relevance score to obtain a corresponding document rearrangement result, wherein the rearrangement model comprises a pre-training rearrangement module, a domain/entity identification module, a joint self-adaptive adjustment module and a reordering execution module, The pre-training rearrangement module models the semantic relevance of each discrimination sample pair and outputs a semantic basic relevance score; the domain/entity identification module performs semantic coding on the query field, generates domain weight parameters of a plurality of target domains through a classification network, and determines corresponding company entities based on company names or entity characterization information contained in the query field; The joint self-adaptive adjusting module performs sparse selection on low-rank adaptive branches corresponding to a plurality of target fields according to the outputted field weight parameters, starts low-rank matrixes in the low-rank adaptive branches with the field weight para