CN-121997116-A - Knowledge-enhanced Chinese multimodal hate language detection method

CN121997116ACN 121997116 ACN121997116 ACN 121997116ACN-121997116-A

Abstract

The invention discloses a knowledge-enhanced Chinese multi-modal hate language detection method, and belongs to the technical field of natural language processing. The method comprises the steps of obtaining an image-text pair, extracting vision, scene text, text and an hate knowledge feature in parallel through a feature representation module, inputting the text and a prompt template into a large language model to generate a background knowledge text, splicing the background knowledge text with the text feature and the hate knowledge feature after BERT coding to form a final text feature, and introducing an image-text affinity matrix generated by CLIP into an attention map by using a knowledge-guided cross-modal attention mechanism to realize multi-modal feature fusion and predict a hate class. According to the invention, large language model background knowledge and a Chinese hate vocabulary are introduced, so that the problems of understanding of Chinese contexts, modal differences and recognition of hate words are remarkably relieved, and higher accuracy, precision, recall rate and F1 score are obtained on CMMHS data sets, so that the method is suitable for multi-modal hate content auditing in social platform Chinese.

Inventors

HUANG QINGBAO
Li Pijian
CHEN YIFEI

Assignees

广西大学

Dates

Publication Date: 20260508
Application Date: 20251226

Claims (7)

1. A knowledge-enhanced method for detecting a multi-modal hate of chinese language, comprising the steps of: s1, an acquisition step, which is used for acquiring an image-text pair; s2, a feature representation step, wherein the image-text pairs are processed in parallel through a feature representation module to obtain: Visual characteristics The said Appearance characteristics of M targets obtained through pretraining Mask R-CNN 4-Dimensional relative position feature Through a learnable weight matrix 、 After mapping and layer normalization, element-by-element addition is achieved: = Wherein And Is a parameter that can be learned and is, Normalizing the representation layer; Scene text feature The said FastText 300-dimensional semantic features of N scene texts obtained through hundred-degree intelligent cloud OCR API Normalized bounding box position features Confidence feature Through a learnable weight matrix 、、 After mapping and layer normalization, element-by-element addition is achieved: = Wherein , And Is a parameter that can be learned and is, Normalizing the representation layer; Text feature The said Extracting 768-dimensional sentence-level features through BERT to obtain; Anhua knowledge feature The said Matching words in the text with a preset Chinese hate vocabulary in a table look-up mode, and then embedding the words in the text; S3, a background supplementing step, namely splicing the text with a preset prompt template, inputting the spliced text into a large language model, generating a background knowledge text, and obtaining external knowledge features from the background knowledge text through BERT coding And with text features Hate knowledge feature Splicing in feature dimension to form final text feature =cat( , , ); S4, a multi-mode reconstruction step, namely calculating a correction matrix by utilizing a knowledge-guided cross-mode attention mechanism to bridge the semantic gap between the image and the text, fusing multi-mode characteristics and predicting the hate category.
2. The method of claim 1, wherein the text in step S1 comprises one or more of chinese microblog text, short video feeds, comments, or barrages.
3. The method according to claim 1, wherein the multi-modal reconstruction step specifically comprises: 1) An affinity matrix generation step of generating visual features Image characterization via a CLIP image encoder Associated text features Pooling obtained aggregated text characterization Cosine similarity calculation is carried out, and then a learnable Gaussian kernel function is used for weighting to generate an affinity matrix The method is used for quantifying semantic association strength among image-text modalities; 2) KGCA network processing steps Each encoder layer contains a KGCA network that integrates an affinity matrix into an attention map, the initial input of the encoder layer being made of visual features Syndicated text features After the cross-modal interaction of KGCA layers, the self-attention layer and the multi-layer perceptron MLP are further fused and characterized, and finally, the output is used as an intermediate state to be input into a decoder; 3) And a decoder output step, namely taking the initial characteristics of the input text as a query and the intermediate state as a key value, and obtaining a final prediction result from the output of the decoder through the MLP and softmax layers.
4. The method of claim 3, wherein the KGCA network processing step specifically includes 1) mapping the input to the high-dimensional space through a learnable weight matrix, and then performing matrix multiplication operation to generate a corresponding weight graph, 2) performing hadamard product operation on the similarity graph and the weight graph by using external knowledge to effectively bridge differences among modes, and 3) performing normalization processing on the attention graph through an L1 norm to obtain an optimized attention graph.
5. A method according to claim 3, wherein in the affinity matrix generating step, an affinity matrix is generated The calculation formula of (2) is as follows: =ɑ· , where α and σ are the learnable parameters, and L2 represents normalization.
6. A method according to claim 3, wherein the joint text feature From final text features With scene text features And (3) splicing to obtain: = 。
7. The method of claim 4, wherein the optimized attention map is calculated as follows: = , = , = , Where Q and K represent a query vector and a key vector respectively, And Is a learnable query matrix and key matrix, and by which is meant the Hadamard product operator, which is the input channel dimension, L1 is the L1 normalization operation.

Description

Knowledge-enhanced Chinese multimodal hate language detection method Technical Field The invention belongs to the technical field of natural language processing, and particularly relates to a knowledge-enhanced Chinese multimodal hate language detection method. Background The present invention relates to a method for generating a great deal of discrimination, attack and hostile language aiming at specific groups or individuals along with the development of a social platform. This phenomenon has evolved into a serious social hazard that can mislead the public and cause panic. To solve this problem, an hate language detection task has been developed. The multimodal hate speaker detection task aims to identify whether content contains hate speakers by integrating images with textual descriptions and to identify the hate categories present therein. The main challenges of the task are three, namely 1) how to alleviate the difficulty of understanding the Chinese language background and 2) how to alleviate the difficulty caused by the multi-modal expression difference. 3) How to detect the hate vocabulary in the Chinese context. In recent years, due to the development of a multimodal large model, some researchers have proposed multimodal hate language detection methods and data sets. However, the existing work focuses on the English field, the detection aspect of the Chinese hate language is mainly focused on the text field, and the research on the detection of the Chinese multi-modal hate language is still insufficient. Therefore, research on a knowledge-enhanced Chinese multimodal hate detection framework has important significance, which is helpful to improve the performance of a multimodal large model in Chinese multimodal hate detection related tasks. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a knowledge-enhanced Chinese multimodal hate language detection method, which aims to integrate an image and a text description to identify whether content contains hate language or not and identify the hate category existing in the content, so as to improve the performance of a multimodal large model in Chinese multimodal hate language detection related tasks. In order to achieve the above object, the present invention is specifically as follows: The invention provides a knowledge-enhanced Chinese multi-modal hate language detection method, which comprises the following steps: s1, an acquisition step, which is used for acquiring an image-text pair; s2, a feature representation step, wherein the image-text pairs are processed in parallel through a feature representation module to obtain: Visual characteristics The saidAppearance characteristics of M targets obtained through pretraining Mask R-CNN4-Dimensional relative position featureThrough a learnable weight matrix、After mapping and layer normalization, element-by-element addition is achieved:= Wherein AndIs a parameter that can be learned and is,Normalizing the representation layer; Scene text feature The saidFastText 300-dimensional semantic features of N scene texts obtained through hundred-degree intelligent cloud OCR APINormalized bounding box position featuresConfidence featureThrough a learnable weight matrix、、After mapping and layer normalization, element-by-element addition is achieved:= Wherein , AndIs a parameter that can be learned and is,Normalizing the representation layer; Text feature The saidExtracting 768-dimensional sentence-level features through BERT to obtain; Anhua knowledge feature The saidMatching words in the text with a preset Chinese hate vocabulary in a table look-up mode, and then embedding the words in the text; S3, a background supplementing step, namely splicing the text with a preset prompt template, inputting the spliced text into a large language model, generating a background knowledge text, and obtaining external knowledge features from the background knowledge text through BERT coding And with text featuresHate knowledge featureSplicing in feature dimension to form final text feature=cat(,,); S4, a multi-mode reconstruction step, namely calculating a correction matrix by utilizing a knowledge-guided cross-mode attention mechanism to bridge the semantic gap between the image and the text, fusing multi-mode characteristics and predicting the hate category. Further, the text in step S1 includes one or more of a chinese microblog text, a short video feed, a comment, or a bullet screen. Further, the multi-mode reconstruction step specifically includes: 1) An affinity matrix generation step of generating visual features Image characterization via a CLIP image encoderAssociated text featuresPooling obtained aggregated text characterizationCosine similarity calculation is carried out, and then a learnable Gaussian kernel function is used for weighting to generate an affinity matrixThe method is used for quantifying semantic association strength among image-text modalities; 2) KGCA netw