CN-121981118-A - Multi-mode named entity recognition system and method based on multi-granularity query guidance

CN121981118ACN 121981118 ACN121981118 ACN 121981118ACN-121981118-A

Abstract

The application provides a multi-mode named entity recognition system and method based on multi-granularity query guidance, and relates to the technical field of multi-mode named entity recognition. The method comprises the steps of preprocessing input multi-modal data to obtain text representations and visual representations respectively, constructing a multi-granularity query set, conducting multi-granularity query guiding fusion on the text representations and the visual representations by utilizing the multi-granularity query set, outputting fused text representations, fused visual area representations and query set representations, and finally conducting multi-modal named entity recognition based on the fused text representations, the fused visual area representations and the query set representations, and outputting multi-modal named entity recognition results. The application effectively improves the accuracy and the efficiency of identifying the multi-mode named entities.

Inventors

YIN JIAN
TANG JIELONG
YU JIANXING
LI FAN
WANG SHIQI
LIU WEI
Lai hanjiang

Assignees

中山大学

Dates

Publication Date: 20260505
Application Date: 20251127

Claims (10)

1. A multi-modal named entity recognition system based on multi-granularity query guidance, comprising: The feature extraction unit is used for preprocessing the input multi-mode data to respectively obtain a text representation and a visual representation; the multi-granularity query set construction unit is used for constructing a multi-granularity query set; the query guidance fusion unit is used for fusing the text representation and the visual representation by utilizing the multi-granularity query set and outputting a fused text representation, a fused visual area representation and a query set representation; And the multi-modal set recognition unit is used for carrying out multi-modal named entity recognition by utilizing the fusion text representation, the fusion visual area representation and the query set representation and outputting a multi-modal named entity recognition result.
2. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 1, wherein the multi-modal data comprises text data and image data, the feature extraction unit comprises a text editor and a visual encoder, and preprocessing the input multi-modal data to obtain a text representation and a visual representation respectively comprises: S11, carrying out context modeling on the text data by using the text editor to obtain the text representation; S12, extracting visual features of the image data by using the visual encoder to obtain the visual representation.
3. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 2, wherein the extracting visual features of image data with the visual encoder to obtain the visual representation comprises: S121, performing region suggestion detection on the image data to obtain a plurality of candidate regions; s122, encoding the candidate region through the visual encoder to obtain a region-level visual feature corresponding to the candidate region, and adding a special region embedded for representing an entity which cannot be positioned in the image into the candidate region; S123, combining the regional visual features with the special regional embedding to obtain the visual representation.
4. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 1, wherein the multi-granularity query set construction unit comprises a type granularity query generator, a query embedder and a query fusion device, wherein the construction of the multi-granularity query set comprises: S21, analyzing a preset prompt template by using the type granularity query generator, and outputting type granularity query embedding; s22, initializing the type granularity query embedding into a learning embedding vector by utilizing the query embedder, and constructing the learning embedding vector into entity granularity query embedding; s23, fusing the type granularity query embedding with the entity granularity query embedding by utilizing a query fusion device to obtain the multi-granularity query set.
5. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 1, wherein the query guidance fusion unit is provided with a cross-attention mechanism by which the multi-granularity query set, the text representation, and the visual representation are respectively fused, comprising: s31, inputting the multi-granularity query set and the text representation into a preset transducer structure, and extracting text features related to the query; S32, injecting the characteristics of the multi-granularity query set as prefix information into a key layer and a value layer of a preset visual transducer, performing prefix modulation on the visual representation, and outputting the characteristics of the visual region guided by query; S33, carrying out weighted fusion on the similarity matrix between the multi-granularity query set and the visual area features and text features, and outputting fused text representation, fused visual area representation and query set representation.
6. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 5, wherein the multi-granularity query set includes a similarity matrix between visual region features, text features The calculation formula of (2) is as follows: Wherein, the In the case of a visual feature or a text feature, Is the first Visual area or first A representation of the individual text labels is provided, Is a multi-granularity query set.
7. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 1, wherein the multi-modal named entity recognition based on the fused text representation, fused visual area representation, and query set representation comprises: generating a boundary probability matrix containing a start index and an end index of the prediction candidate entity in the text by using the fused text representation and the query set representation; Generating a matching probability matrix containing the region index of the prediction candidate entity in the image by using the fused visual region representation and the query set representation; Generating presence detection of type granularity query by using the boundary probability matrix and the matching probability matrix; and carrying out global optimal matching by utilizing the presence detection and matching algorithm of the type granularity query, and outputting the multi-mode named entity recognition result.
8. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 7, wherein the boundary probability matrix comprises a probability matrix of a start index and a probability matrix of an end index; probability matrix of the start index The calculation formula is as follows: Wherein, the As a result of the first learning parameter, A joint representation located for segment boundaries; Probability matrix of the ending index The calculation formula is as follows: Wherein, the Is the second learning parameter.
9. The multi-modal named entity recognition system based on multi-granularity query guidance of claim 7, wherein the matching probability matrix The calculation formula is as follows; Wherein, the As a result of the third learning parameter, For the fourth parameter of the learning process, As a parameter to be learned in the fifth step, For a joint representation of candidate region matches, In order to fuse the visual area representation, Is represented for a query set.
10. The multi-mode named entity identification method based on multi-granularity query guidance is characterized by comprising the following steps of: s1, preprocessing input multi-mode data to respectively obtain text representation and visual representation; s2, constructing a multi-granularity query set; S3, carrying out multi-granularity query guidance fusion on the text representation and the visual representation by utilizing the multi-granularity query set, and outputting a fused text representation, a fused visual region representation and a query set representation; s4, based on the fusion text representation, the fusion visual area representation and the query set representation, multi-modal named entity recognition is carried out, and a multi-modal named entity recognition result is output.

Description

Multi-mode named entity recognition system and method based on multi-granularity query guidance Technical Field The application relates to the technical field of multi-modal named entity recognition, in particular to a multi-modal named entity recognition system and method based on multi-granularity query guidance. Background Along with the rapid development of artificial intelligence, big data and natural language processing technology, social media has become an important platform for information transmission and user interaction, and is widely applied to the fields of public opinion monitoring, automatic news reporting, intelligent question and answer, commodity recommendation, security monitoring and the like. Unlike traditional text-based content formats, the content of current social media presents a multi-modal feature that includes both text information and image information. There are significant limitations to relying on text alone for named entity Recognition (NAMED ENTITY Recognition, NER). For example, when the post text refers to a person but the matching diagram is another object, the single-mode recognition method is easy to generate misjudgment or omission, and is difficult to meet the actual requirement of multi-mode content analysis. Therefore, multi-modal named entity Recognition (Grounded Multimodal NAMED ENTITY Recognition, GMNER) becomes a research focus, and the task not only needs to extract named entities and types thereof (such as person names, place names, organization names, brand names, etc.) in texts, but also needs to further locate corresponding areas of the entities in images to achieve cross-modal alignment. The prior art mainly comprises two types of methods, namely a pipeline method, wherein GMNER tasks are disassembled into subtasks such as multi-mode entity identification, entity binding, entity positioning and the like. For example, a sequence annotation model is used to identify text entities and then an image region is determined by a visual localization model. However, the method has the problems of error propagation and low reasoning efficiency, is difficult to adapt to complex social media scenes, and the other is an end-to-end method, wherein unified modeling is carried out on text entity identification and image entity positioning, such as introducing inquiry guidance through a machine reading understanding framework, so as to realize simultaneous identification and positioning. However, the method generally relies on manual construction of query sentences, which is time-consuming and labor-consuming, and is difficult to cover all entity types, and erroneous judgment is easy to cause under fuzzy semantics or multi-entity scenes. Disclosure of Invention The invention provides a multi-mode named entity recognition system and method based on multi-granularity query guidance, and aims to solve the problems of low recognition accuracy and low recognition efficiency of multi-mode named entities in the prior art. In order to achieve the technical effects, the technical scheme of the invention is as follows: a multi-modal named entity recognition system based on multi-granularity query guidance, comprising: The feature extraction unit is used for preprocessing the input multi-mode data to respectively obtain a text representation and a visual representation; the multi-granularity query set construction unit is used for constructing a multi-granularity query set; the query guidance fusion unit is used for fusing the text representation and the visual representation by utilizing the multi-granularity query set and outputting a fused text representation, a fused visual area representation and a query set representation; And the multi-modal set recognition unit is used for carrying out multi-modal named entity recognition by utilizing the fusion text representation, the fusion visual area representation and the query set representation and outputting a multi-modal named entity recognition result. Preferably, the multimodal data includes text data and image data, preprocessing the input multimodal data to obtain a text representation and a visual representation respectively includes: S11, inputting the text data into a preset text editor, and performing context modeling on the text data by using the text editor to obtain the text representation; s12, inputting the image data into a pre-trained visual encoder, and extracting visual features of the image data by using the visual encoder to obtain the visual representation. Preferably, said extracting visual features of the image data with said visual encoder, obtaining said visual representation comprises: S121, performing region suggestion detection on the image data to obtain a plurality of candidate regions; s122, encoding the candidate region through the visual encoder to obtain a region-level visual feature corresponding to the candidate region, and adding a special region embedded for representing an entity which cannot