CN-121981713-A - Social media irony information detection method and system based on tri-modal fusion

CN121981713ACN 121981713 ACN121981713 ACN 121981713ACN-121981713-A

Abstract

The invention belongs to the technical field of ironic information detection, and provides a method and a system for detecting social media ironic information based on tri-modal fusion, which are used for acquiring images, original texts and image attribute text information of social media data to form multi-modal input, extracting semantic features from each mode by utilizing a pre-trained model, and mapping the semantic features to a shared semantic space in a trans-modal manner; in the shared semantic space, deep interactive information among three modes is dynamically captured and aggregated through a mixed mode attention mechanism to generate a final fusion representation containing multi-mode context, a multi-task learning paradigm is built by introducing mutual learning, countermeasure training and mode consistency loss, a classifier is trained by utilizing the multi-task learning paradigm, and the final fusion representation is processed by utilizing the trained classifier to determine whether the social media data contains ironic information. The invention can remarkably improve the detection performance of ironic information in social media environment.

Inventors

ZHANG LINA
FENG JING
YANG RUNTAO

Assignees

山东大学

Dates

Publication Date: 20260505
Application Date: 20260129

Claims (10)

1. The method for detecting the social media irony information based on the tri-modal fusion is characterized by comprising the following steps of: acquiring images, original texts and image attribute text information of social media data to form multi-modal input, extracting semantic features from each mode by utilizing a pre-trained model, and mapping the semantic features to a shared semantic space in a trans-mode manner; In the shared semantic space, deep interactive information among three modes is dynamically captured and aggregated through a mixed mode attention mechanism, and a final fusion representation containing multi-mode contexts is generated; introducing mutual learning, countermeasure training and modal consistency loss, constructing a multi-task learning paradigm, and training a classifier by using the multi-task learning paradigm; the final fused representation is processed using the trained classifier to determine whether the social media data includes ironic information.
2. The method for detecting social media irony information based on tri-modal fusion of claim 1, wherein the process of extracting semantic features from each modality using a pre-trained model includes extracting high-dimensional feature vectors of images using a visual encoder of CLIP; Extracting feature vectors of the original text by using a text encoder of the CLIP; And generating a descriptive text from each image by adopting a pre-trained BLIP model, and processing the generated image attribute text by a text encoder of the CLIP to obtain a feature vector.
3. The method for detecting social media irony information based on tri-modal fusion according to claim 1, wherein the process of cross-modal mapping of semantic features to a shared semantic space comprises the steps of utilizing three independent cross-modal alignment modules, wherein each module is responsible for mapping corresponding CLIP features to a task-specific shared semantic space with uniform dimensions, and for each modal CLIP feature, the cross-modal alignment module is composed of a two-layer feedforward neural network, and the aligned tri-modal features are obtained by utilizing learning projection of the cross-modal alignment module.
4. The method for detecting social media irony information based on tri-modal fusion as claimed in claim 1, wherein in the shared semantic space, the process of dynamically capturing and aggregating deep interactive information among tri-modalities through a mixed-modal attention mechanism comprises stacking tri-modal features with consistent batch sizes into a tensor and inputting the tensor into a multi-headed self-attention layer, wherein each modal feature is used as query, key and value, attention weights of all modalities are calculated, and the attention mechanisms are spliced, and the output of the attention mechanisms is obtained through a linear projection layer.
5. The method for detecting social media irony information based on tri-modal fusion as claimed in claim 4, wherein the output of the attention mechanism is introduced into residual connection and layer normalization processing, and the processed modal features are averaged to obtain a final fusion representation.
6. The method for detecting social media irony information based on tri-modal fusion of claim 1, wherein the process of introducing mutual learning, countermeasure training and modal consistency loss comprises a total loss function comprising a main classification loss function and the mutual learning, countermeasure training and modal consistency loss function, wherein the main classification loss function adopts standard cross entropy loss to measure the difference between probability distribution predicted by a classifier and a real label, and the main classification loss function has the highest weight in the total loss function.
7. The method for detecting social media irony information based on tri-modal fusion as claimed in claim 1, wherein the process of introducing mutual learning loss comprises the steps of configuring a lightweight auxiliary classifier for each aligned modal feature, wherein each auxiliary classifier shares the same classified label supervision information with a main classifier, each auxiliary classifier predicts irony probability under each modal feature, and the loss of the auxiliary classifier adopts standard cross entropy loss; the weighted average of all auxiliary classifier losses forms part of the total loss function.
8. The method for detecting social media irony information based on tri-modal fusion of claim 1, wherein introducing an countermeasure training penalty comprises introducing a countermeasure training mechanism, generating a modal independent representation by modal discriminant learning, the modal discriminant predicting source modal categories of features, the discriminant penalty employing cross entropy penalty, a weighted average of all discriminant penalty forming part of a total penalty function.
9. A method of detecting social media irony information based on tri-modal fusion as claimed in claim 1 wherein the process of introducing modal consistency loss includes introducing modal consistency loss by penalizing distances between aligned modal features, encouraging them to semantically close to each other, employing cosine embedding loss for measurement, focusing more on encouraging similarity of feature vectors in directions rather than strict scale matching, weighted values of modal consistency loss forming part of a total loss function.
10. A social media irony information detection system based on tri-modal fusion, comprising: The three-mode feature extraction and alignment module is configured to acquire images, original texts and image attribute text information of social media data to form multi-mode input, extract semantic features from each mode by utilizing a pre-trained model, and map the semantic features to a shared semantic space in a trans-mode manner; The mixed mode attention fusion module is configured to dynamically capture and aggregate deep interactive information among three modes through a mixed mode attention mechanism in a shared semantic space, and generate a final fusion representation containing multi-mode contexts; The collaborative training module is configured to introduce mutual learning, countermeasure training and modal consistency loss, construct a multi-task learning paradigm and train the classifier by using the multi-task learning paradigm; The sarcasm classifier module is configured to process the final fused representation with the trained classifier to determine whether the social media data includes sarcasm information.

Description

Social media irony information detection method and system based on tri-modal fusion Technical Field The invention belongs to the technical field of ironic information detection, and particularly relates to a social media ironic information detection method and system based on tri-modal fusion. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. In today's highly interconnected digital society, social media platforms have become the core hub for information dissemination, emotional expression, and perspective communication. The massive user generated content, especially the multi-modal text of the drawing and the metallocene, greatly enriches the form of human communication. However, this rich expression also presents a unique challenge in terms of an accurate understanding of irony (Sarcasm). Irony is a ubiquitous practice of making and maintaining its core feature in that words are not unimaginary-i.e., literally of opposite meaning to actual intent-are commonly used to express cynicism, humor, criticism or contempt. This misalignment between speech and real intent makes automatic detection of ironic information a long-term and extremely challenging problem in the fields of Natural Language Processing (NLP) and artificial intelligence. Failure to accurately identify irony will have profound effects on numerous downstream applications. For example, in emotion analysis tasks, if a statement of literally positive sarcandra is misjudged as positive emotion, the true emotion figure of the user will be distorted, and in the field of recommendation systems or man-machine interaction, the sarcandra may cause mismatch of recommended content or embarrassment or even conflict of man-machine interaction. Therefore, an intelligent system capable of accurately and robustly understanding irony is developed, and the intelligent system has important academic research value and has huge practical application potential. Traditional ironic information detection studies focus mainly on text modalities, capturing ironic cues in text by analyzing linguistic features, emotion dictionaries, syntactic structures, or using advanced pre-trained language models, which exhibit strong capabilities in plain text scenarios, capable of recognizing long distance dependencies between words and contextual semantics. However, human irony understanding is far beyond the language level. The true meaning of a large amount of irony expression is often deeply embedded in non-verbal information, such as facial expressions, intonation, or image context common in social media. For example, if a simple "true stick |" is fitted with a picture of the disaster site, the irony will be apparent. The contrast or complementarity between text and images is a common form of irony expression and is also critical information that plain text models cannot capture. This clearly shows that ironic detection has exceeded the scope of single modality, becoming an increasingly typical multi-modality understanding task. The multimodal understanding task introduces image information into ironic detection, while significantly enhancing the understanding depth of the model, at the same time brings the complex challenges inherent in multimodal learning, particularly in the following aspects: modal heterogeneity and representation gap text data exists in the form of discrete symbol sequences (words), while image data consists of a continuous matrix of pixels, both of which differ essentially in data structure, feature dimensions and semantic expression logic. How to effectively extract high-quality features which can be mutually aligned in terms of semantics from the two distinct modalities and construct a unified representation space is a primary and fundamental problem of multi-modal fusion. Simple feature stitching often fails to capture deep correlations between modalities and may even introduce noise. Irony intentions often do not simply exist in independent information of text or images, but are expressed by complex interactions such as contrast, complementarity, augmentation, or contradiction between modalities. For example, text may express a positive, an image implying a negative, or text and image together may be directed to a cynical object. How to design an exquisite and efficient fusion mechanism so that the model can dynamically capture the deep and nonlinear cross-modal interactions and integrate the deep and nonlinear cross-modal interactions into a coherent and discriminant multi-modal representation is a key point for determining the performance of the fusion model. Context dependency and semantic ambiguity: irony is a language phenomenon that is highly context dependent. Even multi-modal information, it may be difficult to accurately judge its true intent in the absence of sufficient external background knowledge or insufficient understanding of the