CN-122024128-A - False news video depolarization interpretation generation method and system

CN122024128ACN 122024128 ACN122024128 ACN 122024128ACN-122024128-A

Abstract

The invention relates to the technical field of computer vision and natural language processing, in particular to a false news video depolarization interpretation generation method and system. The method comprises the steps of obtaining a training set containing false news videos and headline texts, extracting semantic keywords, classifying the semantic keywords into semantic mutual exclusion categories, extracting visual features of candidate areas of training samples through a fast R-CNN, constructing a video object hybrid dictionary, obtaining category probabilities through a pre-training classifier, calculating candidate area intervention likelihoods through a classifier weight matrix and the hybrid dictionary, further determining visual features of the training samples IVOD, obtaining context perception representation matrices based on IVOD features, the false news videos and the headline texts, generating natural language interpretation, and obtaining the natural language interpretation of the actual false news videos through a comparison training model of the generated interpretation and the real interpretation by utilizing the trained model. The invention can eliminate the mixed interference in the aspects of vision and explanation and improve the accuracy of false news identification.

Inventors

CHEN LIZHI
QIAN ZHONG

Assignees

苏州大学

Dates

Publication Date: 20260512
Application Date: 20251230

Claims (10)

1. The false news video depolarization interpretation generation method is characterized by comprising the following steps of: the method comprises the steps of obtaining a false news video training set, wherein each training sample in the training set comprises a false news video and a false news headline text; extracting semantic keywords of all training samples in the false news video training set, and classifying the semantic keywords into a plurality of semantically mutually exclusive categories; the video frame sequence of each training sample is subjected to false news video depolarization interpretation to generate a fast R-CNN of a model, and visual characteristics of each candidate region in each training sample are extracted; Constructing a video object hybrid dictionary based on visual features of candidate areas in all training samples under each category; Based on the video object hybrid dictionary, the weight matrix of the classifier, the visual characteristics of each candidate region in each training sample and the probability of the visual characteristics of each candidate region belonging to each category, acquiring the intervention likelihood of each candidate region in each training sample, and determining IVOD visual characteristics of each candidate region in each training sample; Based on IVOD visual features, false news videos and false news headline texts of each candidate region in each training sample, obtaining a context perception representation matrix of each candidate region in each training sample; generating a natural language interpretation of each training sample based on the context-aware representation matrix of each candidate region in each training sample; based on the natural language interpretation and the real interpretation of each training sample, training a false news video depolarization interpretation generation model, and acquiring the natural language interpretation of the actual false news video by using the trained false news video depolarization interpretation generation model.
2. The method for generating false news video depolarization interpretation according to claim 1, wherein the method for constructing the video object hybrid dictionary based on the visual features of the candidate regions in all training samples under each category comprises: For each category, taking the average value of the visual characteristics of each candidate region in all training samples under the category as a standard visual characteristic representation vector of the category in the video object hybrid dictionary; Combining the standard visual characteristic representation vectors of all the categories to obtain the video object hybrid dictionary.
3. The method for generating false news video depolarization interpretation according to claim 2, wherein the formula for obtaining the intervention likelihood of each candidate region in each training sample based on the video object hybrid dictionary, the weight matrix of the classifier, the visual characteristics of each candidate region in each training sample and the probability of each candidate region belonging to each category is: , Wherein, the For the likelihood of intervention of the current candidate region, In the form of a Softmax classifier, As a matrix of weights for the classifier, As a visual feature of the current candidate region, Visual features belonging to the current candidate region The probability of the individual category(s), Intermixing dictionary for video objects The standard visual features of the individual categories represent vectors, Is a category index.
4. The method for generating a false news video depolarization interpretation according to claim 1, wherein the method for generating a natural language interpretation for each training sample based on the context-aware representation matrix for each candidate region in each training sample comprises: Constructing an initial interpretation aspect space based on different interpretation aspects of the false news; labeling the interpretation aspects of each candidate region in each training sample in the training set by utilizing the large-scale language model, and generating a corresponding basis text; all the text-based codes are used as text feature vectors, and a hybrid dictionary in interpretation is built through a linear projection layer which can be learned; The method comprises the steps of (1) passing a context perception representation matrix of each candidate region in each training sample through a classifier, and obtaining the probability that the context perception representation matrix of each candidate region in each training sample belongs to each interpretation aspect in an interpretation aspect hybrid dictionary; Acquiring intervention vocabulary prediction probability of each candidate region in each training sample based on the interpretation aspect hybrid dictionary, a weight matrix of a projection layer in a pre-training transducer decoder, a context perception representation matrix of each candidate region in each training sample and probabilities of each interpretation aspect belonging to the interpretation aspect hybrid dictionary; Determining a depolarized multimodal representation of each candidate region in each training sample based on the intervention vocabulary prediction probability of each candidate region in each training sample; And enabling the depolarized multi-modal representation of each candidate region in each training sample to pass through a pre-training transducer decoder to obtain natural language interpretation of each training sample.
5. The method for generating false news video depolarization interpretations according to claim 4, wherein the formula for obtaining the intervention vocabulary prediction probability of each candidate region in each training sample based on the interpretation-aspect hybrid dictionary, the weight matrix of the projection layer in the pre-training transducer decoder, the context-aware representation matrix of each candidate region in each training sample, and the probabilities of each interpretation that the context-aware representation matrix belongs to each interpretation aspect in the interpretation-aspect hybrid dictionary is as follows: , Wherein, the The probability is predicted for the intervening vocabulary of the current candidate region, In the form of a Softmax classifier, To pretrain the weight matrix of the projection layer in the transducer decoder, A matrix is represented for the context awareness of the current candidate region, The context-aware representation matrix for the current candidate region belongs to the first The probability of the individual interpretation aspect is that, Hybrid dictionary for explaining aspects The characteristic representation of the individual explained aspects is that, For purposes of explanation.
6. The method for generating the false news video depolarization interpretation according to claim 1, wherein the method for extracting the semantic keywords of the false news video training set comprises the following steps: For each training sample in the false news video training set, converting the audio stream in the false news video into a text transcription; And extracting semantic keywords of the text transcription manuscript of each training sample in the false news video training set by using a large language model.
7. The method for generating the false news video depolarization interpretation according to claim 1, wherein the method for extracting semantic keywords of all training samples in the training set of the false news video and classifying the semantic keywords into a plurality of semantically mutually exclusive categories comprises: And extracting semantic keywords of all training samples in the false news video training set, and classifying the semantic keywords into a plurality of semantically mutually exclusive categories by using a clustering algorithm.
8. The method for generating false news video depolarization interpretation according to claim 1, wherein the process of obtaining the probability that the visual features of the candidate regions in each training sample belong to each category comprises: And acquiring the probability that the visual features of the candidate areas in each training sample belong to each category by using a pre-trained classifier.
9. The method for generating the false news video depolarization interpretation according to claim 1, wherein the method for obtaining the context-aware representation matrix of each candidate region in each training sample based on IVOD visual features, false news video, and false news headline text of each candidate region in each training sample comprises: converting the audio stream in the false news video of each training sample into a text transcription manuscript, and respectively segmenting and embedding the text transcription manuscript and the false news headline text to obtain audio text characteristics and headline text characteristics; and splicing IVOD visual features, audio features and text features of each candidate region in each training sample, and obtaining a context perception representation matrix of each training sample through a transducer encoder.
10. A false news video depolarization interpretation generation system, comprising: the data acquisition module is used for acquiring a false news video training set, wherein each training sample in the training set comprises a false news video and a false news headline text; The category acquisition module is used for extracting semantic keywords of all training samples in the false news video training set and classifying the semantic keywords into a plurality of categories with mutually exclusive semantics; The visual feature extraction module is used for extracting visual features of candidate areas in each training sample by using a Faster R-CNN of a model generated by performing false news video depolarization interpretation on a video frame sequence of each training sample; The video object hybrid dictionary construction module is used for constructing a video object hybrid dictionary based on visual features of candidate areas in all training samples under each category; IVOD a visual feature acquisition module, configured to acquire intervention likelihood of each candidate region in each training sample based on the video object hybrid dictionary, the weight matrix of the classifier, the visual features of each candidate region in each training sample and the probabilities of the visual features belonging to each category, and determine IVOD visual features of each candidate region in each training sample; The context perception representation matrix acquisition module is used for acquiring the context perception representation matrix of each candidate area in each training sample based on IVOD visual features, false news video and false news headline text of each candidate area in each training sample; The natural language interpretation generation module is used for generating natural language interpretation of each training sample based on the context perception representation matrix of each candidate region in each training sample; the generation module is used for training the false news video depolarization interpretation generation model based on the natural language interpretation and the real interpretation of each training sample, and acquiring the natural language interpretation of the actual false news video by using the trained false news video depolarization interpretation generation model.

Description

False news video depolarization interpretation generation method and system Technical Field The invention relates to the technical field of computer vision and natural language processing, in particular to a false news video depolarization interpretation generation method and system. Background In recent years, with the popularization of social media platforms, the propagation of false news videos has become a serious problem affecting social public opinion, public safety and personal trust systems. To address this challenge, researchers began to go from traditional false content binary classification detection, gradually towards interpretable false news analysis, aimed at generating human-understandable reasons to elucidate why content was judged to be false. In this context, a false news video interpretation (FNVE) task has arisen that generates natural language interpretations by analyzing inconsistencies (e.g., logical spurious, visual-text mismatch, etc.) between video content and related textual statements. In order to realize accurate false news video interpretation, the existing method mostly relies on a pre-trained multi-modal model (such as a architecture based on a transducer or a multi-modal large language model MLLM), and generates interpretation by extracting multi-modal information such as visual object characteristics, audio translation text, news headlines and the like in the video, and fusing and decoding the information. The extraction of visual object features is a key link, and the conventional method mostly introduces a classical two-stage target detector of Faster R-CNN as a core visual feature extraction tool. The method has the core advantages of selecting Faster R-CNN (computer-aided design) and the strong regional object positioning and characteristic characterization capability, wherein the model generates a global characteristic diagram through a backbone convolution network, efficiently generates a candidate object region by means of a Regional Proposal Network (RPN), extracts regional characteristic vectors with fixed dimensions through RoI pooling, finally realizes accurate regression and category classification of an object boundary box, can provide fine-granularity and high-identification visual object characteristics for FNVE tasks, and lays a foundation for multi-modal fusion and false evidence mining. Although the existing multi-mode method based on Faster R-CNN has remarkable advantages in visual feature extraction, key defects exist in the model learning process, so that semantic deviation exists in the extracted visual features, and finally the interpretation generation accuracy is affected. From the learning mechanism, the classifier training of fast R-CNN relies on statistical correlation of observed data, with the core learning goal being a fitting conditional probability P (y|x) (where X is the input visual feature and Y is the object class label). However, there are a large number of confounding factors Z (i.e., visual context, such as background objects, environmental elements, etc.) in the actual visual scene that affect both the generation of the visual features X and the determination of the category labels Y. For example, the "wire" often co-occurs with "electrician" (Y) as a background element (Z), directly affecting the visual feature (X) generation of the "electrician", resulting in models learning false correlations through the backdoor path of "x≡z→y", i.e., mistakenly strongly binding the visual feature of the "wire" to the "electrician" category, rather than learning the true causal relationship P (y|do (X)) between the core feature of the "electrician" itself and the category. The learning deviation directly causes the visual characteristic semantic impurity, generates the object confusion effect, and is easy to learn the entanglement semantic among co-occurrence objects. The visual features of the foreground object are highly coupled with the background context features, the model is easy to erroneously relate background elements to foreground semantics to form a confusion effect of an object layer, the recognition capability of the true semantic concept is weakened, and finally, the generated interpretation deviates from core false evidence, so that a serious deviation problem exists. Disclosure of Invention Therefore, the technical problem to be solved by the invention is to overcome the defects that the existing multi-mode false news video interpretation method based on the Faster R-CNN causes visual characteristic semantic deviation and produces object confusion effect to influence interpretation accuracy due to learning mechanism defects. In order to solve the technical problems, the invention provides a false news video depolarization interpretation generation method, which comprises the following steps: the method comprises the steps of obtaining a false news video training set, wherein each training sample in the training set comprises a fal