CN-121996778-A - Combined extraction method and device based on semantic calibration and task bottleneck

CN121996778ACN 121996778 ACN121996778 ACN 121996778ACN-121996778-A

Abstract

The application provides a joint extraction method, device, equipment and medium based on semantic calibration and task bottlenecks, and belongs to the technical field of entity identification. And the whole sentence features are weighted and denoised through a low-rank semantic calibration mechanism, so that the overall extraction accuracy can be improved. Through relation perception interaction modeling, the model can perceive different relation semantics in the encoding stage, and the relation confusion risk can be reduced. The information flow is constrained and the multi-task joint optimization process is stabilized through the residual task perception bottleneck, so that unstable updating caused by gradient fluctuation can be reduced, the phenomena of overfitting and characteristic collapse are reduced, and the robustness and generalization capability are enhanced. Through the synergistic effect of a semantic calibration mechanism, a relation perception interaction mechanism and a joint extraction output process, the model can more accurately align the entity and the relation clue under the scene of overlapping triples and complex structures, error association and missing detection conditions are reduced, and the overall consistency and integrity of the triples output are improved.

Inventors

ZHANG XIAOQIN
Nie Shuhan
FENG ZHAOXING

Assignees

重庆市信息通信咨询设计院有限公司

Dates

Publication Date: 20260508
Application Date: 20260115

Claims (17)

1. The joint extraction method based on semantic calibration and task bottlenecks is characterized by comprising the following steps: Obtaining unstructured text to be extracted, obtaining a standardized TokenID sequence and a corresponding mask matrix according to the unstructured text to be extracted, and carrying out semantic coding on the TokenID sequence and the corresponding mask matrix by utilizing a pre-training language model to obtain an initial semantic feature sequence; denoising the initial semantic feature sequence to obtain a first semantic feature sequence; Constructing a dynamic relationship prototype based on the initial semantic feature sequence and a predefined finite relationship set; performing relation perception interactive modeling on the first semantic feature sequence and the dynamic relation prototype to generate a relation-perceived second semantic feature sequence, and performing variation compression and residual fusion on the second semantic feature sequence to obtain an entity feature sequence; Decoding the entity feature sequence by using a global pointer network, parallelly identifying entity boundary scoring matrixes of all possible head entity and tail entity intervals in sentences in the entity feature sequence, aggregating head and tail entity features according to entity boundaries in the entity boundary scoring matrixes, and parallelly predicting head entity and tail entity association matrixes under specific relation categories after secondary bottleneck processing; And analyzing the entity boundary scoring matrix and the head entity and tail entity association matrix to obtain a structured entity relation triplet set, constructing a composite objective function containing sparse multi-label task loss and information bottleneck regularization items, and adjusting an optimization strategy through dynamic weights.
2. The method according to claim 1, wherein the obtaining unstructured text to be extracted, and obtaining a standardized TokenID sequence and a corresponding mask matrix according to the unstructured text to be extracted, comprises: acquiring a natural language sentence of the unstructured text, and performing word segmentation and subwordization processing on the natural language sentence by using a word segmentation device corresponding to the pre-training language model; Adding a start marker in the head of the segmented sequence, adding an end marker in the tail, and mapping the segmented result into a numerical TokenID sequence; And generating an effective bit mask matrix corresponding to the position identifier according to the preset maximum sequence length, and outputting a standardized TokenID sequence and a corresponding mask matrix.
3. The method according to claim 1 or 2, wherein the semantically encoding TokenID sequences and corresponding mask matrices using a pre-trained language model yields an initial semantic feature sequence, further comprising: Inputting a standardized TokenID sequence, a corresponding mask matrix and a corresponding position identifier as input tensors into a pre-training language model encoder, sequentially flowing through 12 layers of convertors structures, and extracting context sensitive representations containing syntactic structures and deep semantic associations layer by layer to obtain an initial semantic feature sequence; Wherein each layer of the transducer structure comprises a multi-headed self-attention mechanism and a feedforward neural network for modeling the dependency relationship among Token in context.
4. The method according to claim 1, wherein denoising the initial semantic feature sequence to obtain a first semantic feature sequence comprises: Calculating global semantic relativity to the initial semantic feature sequence through low-rank subspace projection to generate position-sensitive gating weight, and dynamically calibrating and denoising the initial semantic feature sequence; Defining a dimension-reduction projection matrix, and projecting the initial semantic feature sequence to a low-rank space in the dimension-reduction projection matrix to obtain low-rank characterization; and determining a Gram matrix of the initial semantic feature sequence based on the low-rank representation in the low-rank space.
5. The method of claim 4, wherein denoising the initial semantic feature sequence results in a first semantic feature sequence, further comprising: Performing aggregation reconstruction on low-rank features by using the Gram matrix of the initial semantic feature sequence, performing dimension-ascending matrix mapping on the reconstructed features and the correlation matrix, and performing activation function processing to generate feature gating weights for Token in each TokenID sequence; and multiplying the feature gating weight with the initial semantic feature sequence element by element to obtain a calibrated first semantic feature sequence.
6. A method according to claim 3, wherein said constructing a dynamic relationship prototype based on said initial semantic feature sequence in combination with a predefined set of finite relationships comprises: aggregating the initial semantic feature sequences to generate text-level semantic representations; Based on the text-level semantic representation, carrying out relevance evaluation on a predefined relation type set, and screening the relation type set according to a preset threshold or Top-K rule to obtain a candidate relation subset corresponding to the current text; And initializing the candidate relation subset obtained by screening to generate a relation embedding matrix aiming at the candidate relation subset, and expanding the relation embedding matrix to the dimension of the current set value to obtain a dynamic relation prototype.
7. The method of claim 1, wherein said generating a second semantic feature sequence of relational awareness by relational-aware interactive modeling of said first semantic feature sequence and said dynamic relational prototype comprises: defining a first linear projection matrix, a second linear projection matrix and a third linear projection matrix; The first semantic feature sequence is treated as a query based on a first linear projection matrix, Taking the dynamic relation prototype as a key based on a second linear projection matrix; taking the dynamic relation prototype as a value based on a third linear projection matrix; Dividing the features in the first semantic feature sequence, setting dimensions, calculating the attention scores of the text Token on all relation prototypes, and carrying out weighted aggregation on the relation values according to the attention scores to obtain the context features containing the relation prior.
8. The method of claim 7, wherein said generating a relational-aware second semantic feature sequence by relational-aware interactive modeling of the first semantic feature sequence and the dynamic relational prototype further comprises: Splicing the contextual features output by the head in feature dimensions and performing linear fusion through an output projection matrix to obtain sequence features after relational perception; and carrying out residual fusion on the sequence features after the relation perception and the first semantic feature sequence, and obtaining a second semantic feature sequence of the final relation perception through linear layer mapping.
9. The method of claim 1, wherein performing variant compression and residual fusion on the second semantic feature sequence to obtain an entity feature sequence comprises: inputting the second semantic feature sequence into a residual task perception bottleneck module at the entity side; And defining a bottleneck dimension of the bottleneck module, predicting the mean and the logarithmic variance of the potential distribution through an independent linear layer, and generating potential variables from the mean and the logarithmic variance of the potential distribution in a parameterized mode.
10. The method of claim 9, wherein performing variant compression and residual fusion on the second semantic feature sequence to obtain an entity feature sequence, further comprises: Defining a second dimension-increasing matrix, and mapping the potential variables from the bottleneck dimension back to the initial semantic feature sequence dimension to obtain a reconstruction feature; introducing a mixing coefficient to perform residual fusion on the second semantic feature sequence and the reconstruction feature to obtain an entity feature sequence for supporting entity positioning and relationship discrimination.
11. The method according to claim 1, wherein decoding the entity signature sequence using the global pointer network, and identifying in parallel an entity boundary scoring matrix in the entity signature sequence that includes all possible head entity and tail entity intervals in the sentence, includes: inputting the entity feature sequence into an entity decoder, and parallelly predicting head entity and tail entity intervals of sentences in the entity feature sequence through a global pointer network to generate an entity boundary scoring matrix; And determining scoring entity probability by setting a channel value, and configuring a threshold parameter to restrict scoring to obtain a scoring prediction entity set of the corresponding channel.
12. The method according to claim 1, wherein aggregating head and tail entity features according to entity boundaries in the entity boundary scoring matrix, after a second bottleneck processing, predicts head and tail entity association matrices in a specific relationship category in parallel, comprising: According to the head entity and the tail entity interval of sentences in the entity feature sequence, respectively inputting the aggregate head entity feature and the tail entity feature in the entity feature sequence into a residual task and a bottleneck perception module at a relation side; And compressing and fusing the residual tasks and the bottleneck sensing modules to obtain purified relationship characteristics, and predicting a head entity association matrix and a tail entity association matrix under a specific relationship by using a classifier.
13. The method of claim 1, wherein constructing a composite objective function comprising sparse multi-labeled task loss and information bottleneck regularization term and adjusting an optimization strategy by dynamic weights comprises: The entity scoring labels correspond to entity boundary scoring matrixes, and the scoring labels of the relationship head entity and the relationship tail entity correspond to relationship head entity incidence matrixes and relationship tail entity incidence matrixes; Constructing three subtasks aiming at entity identification, relationship head entity classification and relationship tail entity classification based on sparse multi-label cross entropy task loss, and calculating entity identification loss, relationship head entity loss and relationship tail entity loss; And determining the total loss of the task based on the entity identification loss, the relationship head entity loss and the relationship tail entity loss.
14. The method of claim 13, wherein constructing a composite objective function comprising sparse multi-labeled task loss and information bottleneck regularization term and adjusting an optimization strategy by dynamic weights comprises: For each dimension of the bottleneck layer, calculating KL dimensions potentially distributed among Gaussian priors, wherein the total regularization loss is formed by KL divergence weighting of an entity side and a relation side, and a final joint optimization objective function of joint optimization and dynamic preheating strategy is defined; The model is led to learn discriminant features preferentially, and information bottleneck constraint is introduced to optimize the generalized boundary of the model and suppress noise.
15. A joint extraction device based on semantic calibration and task bottlenecks, comprising: The initial semantic feature unit is used for acquiring unstructured text to be extracted, obtaining a standardized TokenID sequence and a corresponding mask matrix according to the unstructured text to be extracted, and carrying out semantic coding on the TokenID sequence and the corresponding mask matrix by utilizing a pre-training language model to obtain an initial semantic feature sequence; The first semantic feature unit is used for denoising the initial semantic feature sequence to obtain a first semantic feature sequence; a dynamic relation prototype unit, configured to construct a dynamic relation prototype based on the initial semantic feature sequence and a predefined finite relation set; The entity feature sequence unit is used for carrying out relation perception interaction modeling on the first semantic feature sequence and the dynamic relation prototype to generate a relation-perceived second semantic feature sequence, and carrying out variation compression and residual fusion on the second semantic feature sequence to obtain an entity feature sequence; The entity association matrix unit is used for decoding the entity feature sequence by utilizing the global pointer network, parallelly identifying entity boundary scoring matrices containing all possible head entity and tail entity intervals in sentences in the entity feature sequence, aggregating head and tail entity features according to entity boundaries in the entity boundary scoring matrices, and parallelly predicting head entity and tail entity association matrices under specific relation categories after secondary bottleneck processing; And the composite objective function unit is used for analyzing the entity boundary scoring matrix and the head entity and tail entity association matrix to obtain a structured entity relation triplet set, constructing a composite objective function containing sparse multi-label task loss and information bottleneck regularization items, and adjusting an optimization strategy through dynamic weights.
16. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; a memory storing a computer program; a processor, when executing a program stored in a memory, implements the joint extraction method based on semantic calibration and task bottlenecks according to any one of claims 1 to 14.
17. A computer readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, implements the joint extraction method based on semantic calibration and task bottlenecks according to any one of claims 1 to 14.

Description

Combined extraction method and device based on semantic calibration and task bottleneck Technical Field The application belongs to the technical field of entity identification, and particularly relates to a semantic calibration and task bottleneck based joint extraction method, device, equipment and medium. Background The entity relationship joint extraction aims to simultaneously identify the entity and the semantic relationship thereof from unstructured text and generate structured relationship triplet information, and is a key basic technology in knowledge graph construction, semantic retrieval and intelligent question-answering systems. In view of the above-mentioned task, various solutions have been proposed in the prior art. In the early methods, pipeline processing is mostly adopted, namely named entity recognition is performed first, and then relationship classification is performed based on the recognized entity pairs. The method is relatively simple to realize, but errors are easily accumulated step by step in the assembly line flow because the entity recognition result directly influences the subsequent relation discrimination, and effective cooperative constraint is difficult to carry out between entity recognition and relation discrimination, so that the overall extraction performance is easy to fluctuate under the conditions of complex sentence patterns, long-distance dependence or noisy text. In recent years, the end-to-end joint extraction method is becoming the mainstream. One type of method attempts to perform joint modeling on subtasks such as entity boundary recognition and relationship discrimination under a unified coding framework, and realizes collaborative learning through shared feature representation so as to alleviate the problem of pipeline error propagation. However, in the multi-task joint optimization process, there may be a difference in the demands of different subtasks for features, so that feature competition or unstable decision boundaries are easy to generate, and especially when sample distribution is uneven or labeling noise exists, model robustness and generalization capability are still limited. The other type of method is to build a unified structured prediction space or decoding mechanism to map entity span information and entity pair relation information into the same prediction space for parallel output so as to enhance the processing capacity of complex scenes such as the same sentence multiple triples, entity overlapping or relation overlapping. However, since the structured prediction space tends to expand with the increase of the text length, a large number of redundancy candidates and negative samples are easily introduced, resulting in unbalanced training, decoding conflicts, or increased computational overhead, thereby affecting stability and efficiency in large-scale data and complex contexts. In summary, the prior art has the following defects that firstly, a text representation contains a large amount of background information and redundant features which are irrelevant to extraction tasks and is easy to be interfered by noise, secondly, fine granularity alignment and discrimination capability are still insufficient under the conditions of coexistence of multiple relations, distinction of semantically similar relations and overlapping of relations, thirdly, an information flow constraint mechanism in a multi-task combined learning process is insufficient, so that the training process is unstable and performance fluctuation is large under a data crossing condition. Therefore, there is a need for a method for entity relationship joint extraction that can enhance the relationship semantic alignment capability while reducing the influence of redundant noise and stabilize the joint extraction multi-task learning process, so as to meet the requirements of the complex context and large-scale knowledge graph construction on accuracy, stability and consistency. Disclosure of Invention In view of the above problems, the present application provides a method, apparatus, device and medium for joint extraction based on semantic calibration and task bottlenecks. The requirements of the complex context and large-scale knowledge graph construction on accuracy, stability and consistency are improved. The embodiment of the application provides a joint extraction method based on semantic calibration and task bottlenecks, which comprises the following steps: Obtaining unstructured text to be extracted, obtaining a standardized TokenID sequence and a corresponding mask matrix according to the unstructured text to be extracted, and carrying out semantic coding on the TokenID sequence and the corresponding mask matrix by utilizing a pre-training language model to obtain an initial semantic feature sequence; denoising the initial semantic feature sequence to obtain a first semantic feature sequence; Constructing a dynamic relationship prototype based on the initial semantic f