CN-121835683-B - Multi-mode named entity recognition method and device

CN121835683BCN 121835683 BCN121835683 BCN 121835683BCN-121835683-B

Abstract

The invention discloses a multi-mode named entity recognition method and device, which relate to the technical field of named entity recognition and comprise the steps of inputting a text to be recognized and an image to be recognized into a trained multi-mode named entity recognition model to determine an entity recognition result, wherein the multi-mode named entity recognition model comprises a text extraction module, an image extraction module, a sentence-level soft gating module, a self-attention transducer module, a trans-modal transducer module, an MLP expert header, a visual gating module and a CRF decoder. Based on the scheme, information loss in cross-modal interaction is reduced, fine granularity alignment and decoding robustness are improved, and therefore entity identification performance is enhanced.

Inventors

FENG GUANG
SUN XIANGLI
HUANG JUNHUI
LIU XINTING
Cao Yuqiao
ZHAO ZHIWEN
SU XU
ZHOU KEDONG
Liao Beirong
LIN YIBAO

Assignees

广东工业大学

Dates

Publication Date: 20260505
Application Date: 20260313

Claims (10)

1. A method for identifying a multi-modal named entity, comprising: Inputting the text to be identified and the image to be identified into a trained multi-modal named entity identification model; extracting word-level feature matrixes of the texts to be recognized through a text extraction module, and extracting visual feature matrixes of the images to be recognized by adopting an image extraction module for example; A self-attention transducer module is adopted to determine the main task text characteristics of the word-level characteristic matrix, and the word-level characteristic matrix is subjected to attention operation and then is subjected to linear mapping to output auxiliary emission scores; Generating gating weight according to cosine similarity corresponding to the word-level feature matrix and the visual feature matrix based on a sentence-level soft gating module, and weighting the visual feature matrix to determine visual region features; According to a cross-modal converter module, the main task text features and the visual area features are subjected to cross-attention processing, area alignment features and visual perception features are correspondingly generated, and the main task text features are adopted to guide the visual perception features to interactively construct context perception features; after the region alignment feature and the context perception feature are respectively input into an MLP expert head, weighting linear fusion is carried out by adopting the gating weight, and a multi-mode fusion feature is generated; The regional alignment features and the context sensing features are spliced through a visual gating module to generate gating coefficients, and the regional alignment features are weighted to generate gating regional alignment features; Splicing and outputting enhanced fusion features by adopting the context sensing features, the multi-mode fusion features and the gating region alignment features, and linearly mapping the enhanced fusion features into main emission scores; And fusing the main emission score and the auxiliary emission score to output a fused emission score, and determining an entity identification result based on the fused emission score through a CRF decoder.
2. The method for identifying a multi-modal named entity according to claim 1, wherein the extracting, by a text extraction module, a word-level feature matrix of the text to be identified includes: after [ CLS ] labels and [ SEP ] labels are added to the head and the tail of the text to be identified, inputting RoBERTa a pre-training model, and outputting a subword vector; and after the sub-word vectors are deactivated randomly, repeatedly screening according to a first character alignment strategy to form a word-level feature matrix.
3. The method for identifying a multi-modal named entity according to claim 1, wherein the employing an image extraction module to perform instance extraction of the visual feature matrix of the image to be identified includes: Performing example mask segmentation on the image to be identified through a SAM model to generate a regional mask; Inputting each regional mask into a CLIP model respectively, and embedding corresponding output masks; respectively carrying out average operation on each mask embedding to construct corresponding region vectors; And mapping the region vectors by adopting a linear layer to form a visual feature matrix.
4. The method for identifying a multi-modal named entity according to claim 1, wherein the generating gating weights based on the sentence-level soft gating module according to cosine similarities corresponding to the word-level feature matrix and the visual feature matrix, and the weighting the visual feature matrix to determine visual region features includes: Extracting [ CLS ] vectors of the word-level feature matrix as text semantic abstract; Performing mean value operation on the visual feature matrix to determine the visual whole-image feature; Calculating cosine similarity between the text semantic abstract and the vision whole-graph feature; generating gating weights based on the cosine similarity through a Sigmoid function capable of learning temperature coefficients and bias thresholds; and broadcasting the gating weight according to channels and multiplying the gating weight by the visual characteristic matrix to determine the visual area characteristics.
5. The method for identifying a multi-modal named entity according to claim 1, wherein the training process of the trained multi-modal named entity identification model comprises: determining training word level feature matrixes, training visual feature matrixes, training main task text features and training auxiliary emission scores of any associated training text and training images in a current training batch in a training set through a multi-mode named entity recognition model to be trained; Performing in-batch standardization on cosine similarity of the training word level feature matrix and the training visual feature matrix by adopting a multi-modal named entity recognition model to be trained to generate gating weight, and weighting the training visual feature matrix to determine training visual region features; Determining a training fusion emission score based on the training visual area features, the training visual area features and the training auxiliary emission score through a multi-modal named entity recognition model to be trained; determining an auxiliary task path energy score of the training auxiliary emission score using a CRF decoder; Calculating CRF joint loss based on inverse category frequency weight of the training set according to the training fusion emission score and the auxiliary task path energy score; And iteratively optimizing model parameters of the multi-modal named entity recognition model to be trained by taking the CRF joint loss minimization as a target until the trained multi-modal named entity recognition model is determined.
6. The method of claim 5, wherein the CRF joint loss comprises: ; In the formula, Indicating the joint loss of the CRF, The size of the batch is indicated and, Representing the index of the training samples, Represents a logarithmic function based on a natural constant, Represent the first The training fusion emission scores of the individual training samples, Represent the first The master task true tag sequence of the individual training samples, Represents any one of the primary task predictive tag sequences in the primary task tag space, Representing the primary task tag space and, Represent the first The main task path energy scores of the individual training samples under the main task true tag sequence, Represent the first The primary task path energy scores of the individual training samples under any primary task predictive tag sequence, Represents a natural constant of the natural product, The effective length of the sentence is represented, Representing Token time step indexes in sentences, Represent the first Training sample at the first The inverse class frequency weights corresponding to the main task real labels of the time steps, The weight coefficient of the auxiliary task is represented, Represents any auxiliary task prediction boundary tag sequence in the auxiliary task tag space, A space for the auxiliary task labels is represented, Represent the first The auxiliary emission score of the individual training samples, Represent the first Auxiliary task real boundary tag sequences of individual training samples, Represent the first Auxiliary task path energy scores for individual training samples under any auxiliary task prediction boundary tag sequence, Represent the first The training samples are assigned auxiliary task path energy scores under the auxiliary task true boundary tag sequence.
7. A multi-modal named entity recognition device, comprising: The data input module is used for inputting the text to be recognized and the image to be recognized into a trained multi-mode named entity recognition model; The data extraction module is used for extracting word-level feature matrixes of the texts to be identified through the text extraction module, and extracting visual feature matrixes of the images to be identified by adopting the image extraction module; The self-attention processing module is used for determining the main task text characteristics of the word-level characteristic matrix by adopting the self-attention transducer module, carrying out attention operation on the word-level characteristic matrix, and outputting auxiliary emission scores by linear mapping; The sentence-level enhancement module is used for generating gating weight according to cosine similarity corresponding to the word-level feature matrix and the visual feature matrix based on the sentence-level soft gating module, and weighting the visual feature matrix to determine visual region features; The cross-modal interaction module is used for processing the main task text feature and the visual area feature by intersecting attention according to the cross-modal transducer module, generating an area alignment feature and a visual perception feature correspondingly, and guiding the visual perception feature to interact to construct a context perception feature by adopting the main task text feature; the expert fusion module is used for respectively inputting the region alignment feature and the context perception feature into an MLP expert head, and then adopting the gating weight to carry out weighted linear fusion to generate a multi-mode fusion feature; the visual enhancement module is used for generating a gating coefficient after splicing the region alignment feature and the context sensing feature through the visual gating module, and weighting the region alignment feature to generate a gating region alignment feature; The score determining module is used for splicing and outputting enhanced fusion features by adopting the context sensing features, the multi-mode fusion features and the gating region alignment features, and linearly mapping the enhanced fusion features into main emission scores; And the entity identification module is used for fusing the main emission score with the auxiliary emission score to output a fused emission score, and determining an entity identification result based on the fused emission score through a CRF decoder.
8. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the multimodal named entity recognition method of any of claims 1-6.
9. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor implements the steps of the multimodal named entity recognition method of any of claims 1-6.
10. A computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the multimodal named entity recognition method of any one of claims 1 to 6.

Description

Multi-mode named entity recognition method and device Technical Field The present invention relates to the field of named entity recognition technologies, and in particular, to a method and apparatus for identifying a multi-modal named entity. Background Along with the generalization of the image-text mixed content in social media, the multi-Mode Named Entity Recognition (MNER) aims at realizing recognition and extraction of named entities by integrating the mode information such as texts, images and the like. In the conventional multi-mode named entity recognition method, text and image features are generally extracted respectively, then simple splicing or encoder-level fusion is carried out, and finally named entity tags are output through a sequence marker, but the problems of insufficient mode interaction, large alignment error, image noise interference and the like easily exist, so that the entity recognition performance is poor. Disclosure of Invention The invention provides a multi-modal named entity recognition method and device, which solve the technical problem of poor entity recognition performance of the existing multi-modal named entity recognition method. The multi-mode named entity identification method provided by the first aspect of the invention comprises the following steps: Inputting the text to be identified and the image to be identified into a trained multi-modal named entity identification model; extracting word-level feature matrixes of the texts to be recognized through a text extraction module, and extracting visual feature matrixes of the images to be recognized by adopting an image extraction module for example; A self-attention transducer module is adopted to determine the main task text characteristics of the word-level characteristic matrix, and the word-level characteristic matrix is subjected to attention operation and then is subjected to linear mapping to output auxiliary emission scores; Generating gating weight according to cosine similarity corresponding to the word-level feature matrix and the visual feature matrix based on a sentence-level soft gating module, and weighting the visual feature matrix to determine visual region features; According to a cross-modal converter module, the main task text features and the visual area features are subjected to cross-attention processing, area alignment features and visual perception features are correspondingly generated, and the main task text features are adopted to guide the visual perception features to interactively construct context perception features; after the region alignment feature and the context perception feature are respectively input into an MLP expert head, weighting linear fusion is carried out by adopting the gating weight, and a multi-mode fusion feature is generated; The regional alignment features and the context sensing features are spliced through a visual gating module to generate gating coefficients, and the regional alignment features are weighted to generate gating regional alignment features; Splicing and outputting enhanced fusion features by adopting the context sensing features, the multi-mode fusion features and the gating region alignment features, and linearly mapping the enhanced fusion features into main emission scores; And fusing the main emission score and the auxiliary emission score to output a fused emission score, and determining an entity identification result based on the fused emission score through a CRF decoder. Optionally, the extracting, by the text extracting module, the word-level feature matrix of the text to be identified includes: after [ CLS ] labels and [ SEP ] labels are added to the head and the tail of the text to be identified, inputting RoBERTa a pre-training model, and outputting a subword vector; and after the sub-word vectors are deactivated randomly, repeatedly screening according to a first character alignment strategy to form a word-level feature matrix. Optionally, the performing instance extraction on the visual feature matrix of the image to be identified by using an image extraction module includes: Performing example mask segmentation on the image to be identified through a SAM model to generate a regional mask; Inputting each regional mask into a CLIP model respectively, and embedding corresponding output masks; respectively carrying out average operation on each mask embedding to construct corresponding region vectors; And mapping the region vectors by adopting a linear layer to form a visual feature matrix. Optionally, the sentence-level soft gating module generates gating weight according to cosine similarity corresponding to the word-level feature matrix and the visual feature matrix, and weights the visual feature matrix to determine visual region features, which includes: Extracting [ CLS ] vectors of the word-level feature matrix as text semantic abstract; Performing mean value operation on the visual feature matrix to determine the visual whol