CN-122019761-A - Scientific literature extraction task-oriented field tracing method and system
Abstract
The invention discloses a field tracing method and a field tracing system for scientific literature extraction tasks. The method is based on extraction results and layout analysis results of scientific documents, achieves rapid and accurate positioning and highlighting of extracted contents in the scientific documents, solves the problem that different levels of extracted contents deviate when the hierarchical structure extracted contents are traced by means of multi-level query construction, shortens tracing time consumption while improving tracing accuracy by means of matching modes of table text and paragraph text separation and numerical text and non-numerical text separation, and introduces a sliding window type chunk strategy aiming at the characteristics that reranker is subjected to O () calculation bottleneck (the length of input text) under long text input and the density of extracted content information is concentrated, and obviously improves reasoning speed on the premise of guaranteeing tracing accuracy.
Inventors
- LI SHUAN
- QI YAO
- Xiang Zongyuan
- PENG ZHONG
- SUN HEYANG
- YANG JIANG
- Wei Juye
- WEI SHUPING
- SHAN LIQUN
- ZHOU SHUNXIANG
- SONG ZIQI
- YE YUFEI
Assignees
- 之江实验室
Dates
- Publication Date
- 20260512
- Application Date
- 20260408
Claims (9)
- 1. A field tracing method for scientific literature extraction tasks is characterized by comprising the following steps: (1) Carrying out layout analysis on the scientific literature to obtain structured data containing text content and layout position information; (2) Extracting a field to be traced through LLM, wherein the field is divided into a main key and a sub key according to a layer level, the main key is an identity of an entity or an object, and the sub key is an attribute attached to the main key and is used for describing the characteristics and state attribute information of the main key; (3) Filtering fragments in the layout analysis file according to the characteristics of the main key/sub key values; (4) Respectively constructing recall queries according to the levels of the main key/sub key; (5) Constructing a semantic similarity matching algorithm, inputting a plurality of fragments after query and filtering, and outputting the fragments with the highest similarity value with the query; (6) And outputting a coordinate area Bounding Box corresponding to the segment with the highest similarity value, and outputting the position information of the Bounding Box to be used for positioning and highlighting the corresponding field content on a display interface.
- 2. The field tracing method for scientific literature extraction tasks according to claim 1, wherein in the step (3), fragments in the layout analysis file are filtered according to the characteristics of the primary key/sub key values, specifically comprising the following steps: (3.1) judging whether fragments all contain the main key/sub key value, if so, returning to all fragments all contain the main key/sub key value and jumping to the step (3.4), and if not, jumping to the step (3.2); (3.2) extracting the numerical value fields [ num1, num2, ], judging whether the fragments contain more than half of the numerical value fields, if so, returning to all fragments containing more than half of the numerical value fields in the main key/sub key values and jumping to the step (3.4), and if not, jumping to the step (3.3); (3.3) returning all fragments and skipping to step (3.4); (3.4) outputting the fragments filtered in the steps (3.1), (3.2) and (3.3).
- 3. The field tracing method for the scientific literature extraction task according to claim 1, wherein in the step (4), recall queries are respectively built according to a hierarchy of a main key/a sub key, the recall queries comprise the main key recall query and the sub key recall query, the built main key recall query is "[ main key value ]: [ main key name ] mentioned in the text contains [ main key value ]", and the built sub key recall query is "[ sub key value ]: [ sub key name ] of [ main key value ] in the text is [ sub key value ]".
- 4. The field tracing method for scientific literature extraction tasks according to claim 3, wherein the main key value is an entity identifier and is a specific name/ID of an entity, the main key name is an entity type, the sub key value is an attribute value and is specific content of an attribute, and the sub key name is an attribute type.
- 5. The field tracing method for scientific literature extraction tasks according to claim 1, wherein the step (5) is implemented as a semantic similarity matching algorithm, and specifically comprises the following steps: (5.1) obtaining the input of a semantic similarity matching algorithm, namely a query and a filtered fragment chunks; (5.2) judging the number Num of filtered fragments chunk, if Num > Q, the value range of Q is [768,1280], judging whether cached (fragment, embedded vector) pairs exist, if yes, directly reading the cache, if not, calculating the embedded vector of chunks by using a embedding model, caching (fragment, embedded vector) pairs, calculating dot products by using vectorized query and embedded vector of fragments, obtaining the semantic similarity of the query and the filtered fragments, taking the first 256 fragments with the most similar semantic, jumping and inputting the first 256 fragments to the step (5.3) for execution, and if Num is less than or equal to Q, returning all fragments and jumping to the step (5.3); (5.3) segmenting each initial segment returned in the step (5.2) into a plurality of sub-segments by adopting a sliding window mechanism, and establishing a mapping table between each initial segment and the segmented sub-segments; And (5.4) splicing all the sub-fragments with the query, namely splicing the query in front of each sub-fragment, inputting reranker a similarity calculation model according to batches, dividing the model input item into a plurality of batches according to the GPU video memory capacity, outputting the similarity of all the sub-fragments with the query, and establishing a mapping table between the initial fragments and the segmented sub-fragments according to the step (5.3) to return the initial fragments corresponding to the sub-fragments with the highest similarity.
- 6. The field tracing method for scientific literature extraction task according to claim 5, wherein in the step (5.3), a window size of the sliding window is 64, a step length is 48, and an overlapping portion of adjacent windows is 16.
- 7. The field tracing system for the scientific literature extraction task is characterized by comprising a data storage unit, a fragment filtering unit, a semantic matching algorithm unit and a result output and positioning unit; the data storage unit is used for storing layout analysis files, fields to be traced, and cache file (fragment, embedded vector) pairs of scientific documents; the fragment filtering unit is used for filtering the scientific literature fragments with invalid redundancy; the semantic matching algorithm unit is used for calculating the similarity between the query and the scientific literature segment and returning the scientific literature segment most similar to the query; and the result output and positioning unit jumps to the position of the fragment content in the scientific literature according to the fragment output by the semantic matching algorithm unit and Bounding Box information corresponding to the fragment, and highlights the fragment content.
- 8. A computing device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the scientific literature extraction task oriented field tracing method of any one of claims 1-6 when the computer program is executed.
- 9. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of the scientific literature extraction task oriented field tracing method of any one of claims 1 to 6.
Description
Scientific literature extraction task-oriented field tracing method and system Technical Field The invention belongs to the technical field of natural language processing and natural science, and particularly relates to a field tracing method and system for scientific literature extraction tasks. Background The conventional field extraction algorithm is generally a sequence labeling task, and a rule or a common sequence labeling algorithm is usually adopted, such as a hidden Markov chain (HMM) of early conventional machine learning, a conditional random field algorithm (CRF) and the like, a conventional deep learning cyclic neural network (LSTM), a deep understanding class model (BERT) and the like. The algorithm converts text into sequence token and predicts each token, so that the position of the extracted field is easily determined (i.e., predicted as the token position of the extracted field). After the large model appears, the convenience and the universality of the use of the large model become better choices of the current text extraction task, and the large model extracts the specified fields to generate the task, so that an additional rule or algorithm is needed to determine the source of the extracted fields, namely the positions of the extracted fields in the original text, so that a user can conveniently check whether the extracted fields are correct and correct, and therefore, the extracted fields are necessary to trace the source (tracing means to find the positions of the extracted fields in the original text). Extraction for scientific literature often has a hierarchical structure, such as a certain sample and certain properties of its sample. In order to ensure timeliness of user experience and verification, fields of different levels of fields are required to be prevented from deviating, the tracing position of a sample sub-attribute needs to contain the sample name and the sample sub-attribute name as far as possible, however, the tracing task cannot be completed timely and accurately by the prior art, and moreover, the literal expression of the same field (such as the name of a sample) in a scientific literature is often not uniform, for example, abbreviations, simple expressions, expanded expressions, approximate words, synonyms and the like bring certain difficulty to the tracing task, so that technical innovation of timely and accurately completing the field extraction task when facing long and short fields is needed. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a field tracing method and a field tracing system for scientific literature extraction tasks. The method aims at field contents of different levels (such as names of a certain sample and names of certain attributes of the sample), the sample names are not separated from the sample attribute names, the field contents of different lengths (such as short sample names and long summary sample characteristics) are aimed at, the short fields realize the tracing of sentence levels, and the long fields realize the tracing of paragraph levels. The short field oriented user can verify in time, and the long field oriented user ensures that the information is as complete as possible and not lost. In order to achieve the above object, the present invention provides a field tracing method for a scientific literature extraction task, which comprises the following steps: (1) Carrying out layout analysis on the scientific literature to obtain structured data containing text content and layout position information; (2) Extracting a field to be traced through LLM, wherein the field is divided into a main key and a sub key according to a layer level, the main key is an identity of an entity or an object, and the sub key is an attribute attached to the main key and is used for describing the characteristics and state attribute information of the main key; (3) Filtering fragments in the layout analysis file according to the characteristics of the main key/sub key values; (4) Respectively constructing recall queries according to the levels of the main key/sub key; (5) Constructing a semantic similarity matching algorithm, inputting a plurality of fragments after query and filtering, and outputting the fragments with the highest similarity value with the query; (6) And outputting a coordinate area Bounding Box corresponding to the segment with the highest similarity value, and outputting the position information of the Bounding Box to be used for positioning and highlighting the corresponding field content on a display interface. Further, in the step (3), the filtering of the segments in the layout analysis file according to the features of the primary key/sub key values specifically includes the following steps: (3.1) judging whether fragments all contain the main key/sub key value, if so, returning to all fragments all contain the main key/sub key value and jumping to the step (3.4), and if not, jumpi