CN-116563619-B - Multimode mould method for detecting emotion of graph

CN116563619BCN 116563619 BCN116563619 BCN 116563619BCN-116563619-B

Abstract

The invention discloses a multimode model graph emotion detection method, which carries out fine data preprocessing work on model graph data, takes VL-BERT, UNITER, villa as the basis of 3 multimode pre-training models, introduces countermeasure training through input entity and character characteristic information, improves a loss function and a classification head of hierarchical multi-label classification, and improves the multimode pre-training model. The improved model has good effects on the emotion detection task and emotion type fine classification task of the model, and the performance of the improved model exceeds that of all the multi-mode basic classification models and improved versions thereof in experiments.

Inventors

LIU RUIKANG
LIN HONGFEI
YANG LIANG

Assignees

大连理工大学

Dates

Publication Date: 20260508
Application Date: 20230505
Priority Date: 20220627

Claims (10)

1.A multimode pattern emotion detection method is characterized by comprising the following steps: step 1, removing text content in a model image to obtain a model image A; When the model factor image A comprises a plurality of sub-images, the plurality of sub-images in the model factor image A are separated by adopting a trained FASTER RCNN model, and then the area where each sub-image is positioned is regarded as the area of interest, and the characteristics of the corresponding area of interest are extracted by a RoI Pooling module of FASTER RCNN; Step 3, identifying entity information in the model image A, and if the model image A comprises a plurality of sub-images, identifying the entity information in the sub-images; identifying character characteristic information in the model image A by adopting a face detector, and identifying the character characteristic information in the sub-images if the model image A comprises a plurality of sub-images; Step 4, training basic multi-mode classification models based on VL-BERT, UNITER and Villa, wherein a classification head adopts a multi-layer perceptron, and training 6 models for each of a model graph emotion detection task and an emotion type fine classification task by training base and large versions of VL-BERT, UNITER and Villa; and 5, improving the basic multi-mode classification model in the step 4, wherein the specific improvement scheme is as follows: Firstly, inputting entity information and character characteristic information in the step 3 into the model in the step 4 in a text mode, and training 12 models for a model graph emotion detection task and an emotion type fine classification task respectively; Step two, adding countermeasure training on the language mode of the base version of the VL-BERT in the step 4, and adding countermeasure training on the language, vision, language and visual modes of the base versions of UNITER and Villa, wherein 7 models are trained for each of a model graph emotion detection task and an emotion type fine classification task; Training the base and large versions of VL-BERT, UNITER and Villa in the step 4 by using FL, ASL, NTR-FL and MC loss functions respectively, and training 24 models for emotion type fine classification tasks; Step four, training the base and large versions of VL-BERT, UNITER and Villa in step 4 by respectively using MR-head and P-head classification heads to train 12 models for the emotion type fine classification task; And (3) inputting the characteristics of the region of interest in the step (2) into the multi-mode classification model obtained by the improved scheme to obtain the detection result of the corresponding task.
2. The method for detecting the multi-modal model map according to claim 1, wherein the step 5 further comprises a fifth step of taking the basic multi-modal classification model in the step 4 and the improved multi-modal classification models in the first to fourth steps in the step 5 as base models, selecting a plurality of base models, integrating the selected base models in a mode of taking the average value of the predicted logic values of each base model, wherein the integrated model is the multi-modal model map emotion detection model, the integration method is as follows, Wherein, the For integrating model pairs Is used to predict the value of the logic, Is the first The individual basis model pairs sample Is used to predict the value of the logic, To select the number of base models.
3. The method for detecting emotion of a multi-modal model map according to claim 1, wherein in step 1, the text content in the model map image is removed, specifically, the text position in the model map is detected first by using OCR, then the text in the model map is covered and the text content in the image is removed by using DEEPFILLV image restoration technology.
4. The multi-mode model emotion detection method of claim 1, wherein in step 2, a specific training mode of the trained FASTER RCNN model is that a plurality of images are randomly selected from a GQA data set to obtain a data set for training, a FASTER RCNN model is trained, data enhancement is performed before a model image a is input into a trained FASTER RCNN model, and the data enhancement mode comprises one or two of random horizontal overturn and random image size scaling.
5. The method according to claim 1, wherein the step 3 uses a network entity to identify entity information in the model image A and uses FairFace face detector to identify character feature information in the image.
6. The method for detecting multimode mode pattern emotion as defined in claim 1, wherein in step 4, the classification head specifically uses multimode mode pattern feature vectors For input, sequentially passing through the full connection layer FC, the activation function Gaussian error linear unit GELU, the layer normalization LN, the Dropout layer and the full connection layer FC, and finally outputting the logic value of each class of the model diagram , wherein, The training method adopts a two-class cross entropy loss function for classifying the class number of the task.
7. The method of claim 1, wherein after the entity information and the character feature information are added, the text input is "< CLS > meme text < SEP > ENTITY TAGS < SEP > CHARACTER FEATURE TAGS < SEP >", wherein < CLS > and < SEP > are special input words, "meme text" are model diagram text, "ENTITY TAGS" are text composed of each entity name in the model diagram, different entity names are separated by < SEP >, and "CHARACTER FEATURE TAGS" are text composed of the character feature information in the model diagram, and the character feature information of different characters is separated by < SEP >.
8. The multi-modal meme emotion detection method according to claim 1, wherein scheme two of said step 5 is specifically: Loss function of adding disturbance in visual mode And a loss function that adds a disturbance in the language modality The following formula is shown: if disturbance is added in both visual and linguistic modes at the same time, then the loss function during the countermeasure training The following formula is shown: Wherein, the And The visual and language modality inputs are respectively, And Against disturbances in the visual and speech modalities in training respectively, In order to be a sample tag, In order not to use the standard class loss function in the countermeasure training, the model can still output correct labels under the condition of adding input disturbance, , For KL divergence, for keeping the logit values of the model output before and after adding the input disturbance consistent, The logic value output by the model when no input disturbance is added.
9. The multi-modal meme emotion detection method according to claim 1, wherein the third aspect of step 5 is specifically: The FL loss function is represented by the following formula: Wherein, the For the size of the data set, Is the category number; for the sample Corresponding category Is used for the logical value of (1), For the sample Corresponding category Is used for the identification of the tag of (c), As a function of the sigmoid, To focus on factors, to control the degree of attenuation of simple sample weights, Is a super parameter and is used for balancing the weight of the positive and negative samples; the ASL loss function is represented by the following formula: Wherein, the And Concentration factors of positive and negative samples, respectively, for controlling the degree of attenuation of simple sample weights in the positive and negative samples, Is a probability offset threshold; The NTR-FL loss function is shown as follows: Wherein, the Tolerating regularized intensity for the negative samples; for the intrinsic bias of the model, For a class probability a priori, Is of the category Is the positive number of samples; is a super parameter and is used for embedding the intrinsic bias of the model into the model training process; Is a normalized coefficient; the MC loss function, namely the probability value of the sample belonging to the positive class calculated by modifying the loss function, is specifically expressed as follows: Wherein, the For the sample Belongs to the category of Is a function of the probability of (1), Modified samples for maximum limiting loss Belongs to the category of Is a function of the probability of (1), Is of the category A collection of subclasses, and In the case of the BCE, however, 。
10. The multi-modal meme emotion detection method according to claim 1, wherein scheme four of step 5 is specifically: The MR-head classifying head is expressed as a tree structure G, and each non-leaf node except the root node of the category hierarchy relation tree T is added with a new leaf node which is a special sub-node of each non-leaf node, wherein the root node in G corresponds to the input feature vector of the MR-head classifying head Each of the remaining non-leaf nodes corresponds to a hidden activated output vector in the MR-head Each leaf node corresponds to an output logic value of a certain class Each edge corresponds to a network module, the input of the network module is a characteristic vector represented by a node corresponding to the tail of the edge, the output of the network module is a characteristic vector represented by a node corresponding to the head of the edge, each edge pointing to a non-leaf node corresponds to a basic full connecting block, the basic full connecting block is formed by connecting an FC layer, a GELU layer, an LN layer and a Dropout layer in series, each edge pointing to the leaf node corresponds to a full connecting layer, the logic value of the class corresponding to the leaf node in the output dimension 1;T is the output corresponding to the leaf node in the same position in G, the logic value of the class corresponding to the non-leaf node in the T is the output corresponding to a special sub-node of the non-leaf node in the same position in G, and the class is Is of the logic value of (1) Category of Is of the logic value of (1) In the hierarchical multi-label classification task, the output of the FC layer in the basic full-connection block is the hidden vector of the non-leaf node MR-head pair of the Each element in the tree is averaged, and the obtained average value is added to the output corresponding to the special sub-node of the non-leaf node And (3) the following steps: Wherein, the For the original logic value, i.e. the output corresponding to a particular child node of a non-leaf node, Is a hidden vector of a non-leaf node, In order to hide the layer output dimensions, Setting category for final logic value The corresponding non-leaf node is , Is a special child node of , Is the hidden vector of (a) ; The P-head classification head further introduces a parent class probability prior to strengthen the relation between the child class probability value output by the model and the parent class probability value on the basis of MR-head through a full probability formula, and the parent class is set as Sub-category of Through the full probability formula, the sample Belongs to the category of The probability of (2) can be expressed as: Wherein, the Representing a sample Category(s) Is a sub-category logic value of the MR-head output Subject to parent class logic values Influence, will As a kind of , As parent class prior probabilities Thus in P-head, samples Belongs to the category of The probability of (2) is expressed as: the P-head modifies the calculation formula of the prior probability of the parent category as shown in the following formula: Wherein, the Is super-parameter and is used for determining the value of the parent category logic And performing linear transformation.

Description

Multimode mould method for detecting emotion of graph Technical Field The invention belongs to the field of semantic recognition, and particularly relates to a multimode mode pattern emotion detection method. Background With the popularity of the pattern on the social platform, different emotions are increasingly expressed on the network by taking the pattern as a medium, which necessitates the development of a multi-mode pattern detection model capable of automatically identifying the emotion of the pattern and classifying the emotion types of the pattern. And (3) carrying out emotion detection on the model graph, namely giving a model graph, and enabling the model graph to finish the classification problem of whether the model graph expresses a certain emotion. The emotion type classification of the model graph is the hierarchical multi-label classification problem of the model graph. Given a model graph, the model needs to judge whether the model graph expresses a certain emotion or not, and further needs to judge the fine granularity emotion types of the model graph, for example, the fine granularity emotion types of positive emotion have endorsements, admirable, excitement and the like, the fine granularity emotion types of negative emotion have anger, sadness, disappointment and the like, and the fine granularity emotion types of neutral emotion have confusion, curiosity, surprise and the like. For the purpose of pattern emotion detection, it is common practice to use a multimodal pre-training model directly to fine tune on the pattern dataset. For the problem of classification of the emotion detection of the model graph, a classification head of the multi-mode pre-training model adopts a multi-layer perceptron with an output dimension of 1 during fine adjustment, and a loss function adopts a classification cross entropy loss. For the problem of hierarchical multi-label classification of emotion type fine classification, the problem of multi-label classification is generally solved by decomposing the problem into a plurality of two-class classification, a multi-mode pre-training model classification head during fine tuning adopts a multi-layer perceptron with an output dimension of c, wherein c is the number of emotion types which can be identified by the model, and a loss function is the sum of c two-class cross entropy losses. In addition, in order to make the model output meet the category hierarchy constraint, a necessary post-processing step is also required, and a commonly used post-processing method is maximum limit, that is, a parent probability value of the model output takes the maximum value of all the child probability values, and that each category is also its own child is specified. In order to further improve the emotion detection performance of the model's model diagram, zhu R et al perform fine data preprocessing on the model diagram data in "Enhance multimodal transformer with external label and in-domain pretrain:Hateful meme challenge winning solution"(arXiv preprint arXiv:2012.08290) work, extract the entity and character characteristic information appearing in the model diagram, input the entity and character characteristic information into a multi-mode pre-training model, and simultaneously provide an extended VL-BERT structure for the multi-mode pre-training model, so as to promote the cross-mode fusion capability of the model. In addition, in order to improve the performance of the model emotion detection model as much as possible, a common method is to perform model integration based on different kinds of multi-modal pre-training models. For example, in Zhu R et al, model integration was performed based on the base and large versions of the 4 multimodal pre-training models VL-BERT, UNITER, villa and ERNIE-ViL. For the hierarchical multi-label classification task of emotion type fine classification, the classification head cannot extract necessary class hierarchical relations by using a multi-layer perceptron, and the common imbalance situation of positive and negative sample distribution in the hierarchical multi-label classification task cannot be treated by using two-class cross entropy loss. In addition, the superior performance of the integrated model depends on a plurality of different base learners, and the number of base models which can be trained by model integration based on different kinds of multimode pre-training models is small, so that the performance of the integrated model is limited. Disclosure of Invention In order to overcome the defects in the prior art, the invention aims to construct a multi-mode model for detecting the emotion of a pattern, and the model can realize the following two classification tasks: (1) And carrying out emotion detection on the model graph, namely giving a model graph, wherein the model can finish the classification problem of whether the model graph expresses a certain emotion or not, such as a goodwill attitude and a malicious