CN-117036885-B - Cross-mode video positive energy evaluation method, system and computer storage medium for fusing video picture and scene text information

CN117036885BCN 117036885 BCN117036885 BCN 117036885BCN-117036885-B

Abstract

A cross-mode video positive energy evaluation method, a system and a computer storage medium integrating video picture and scene text information relate to the field of video sensitive content analysis. The method solves the problem that the existing regression-based method only considers visual characteristics in the video and ignores scene text information in the video. The method comprises the following steps of obtaining a video segment, carrying out feature selection on the video segment by using a pre-trained R3D model to obtain a plurality of feature vectors, carrying out global average pooling operation on the obtained plurality of feature vectors, obtaining video picture features through full connection, extracting scene text features in the video picture features, namely deleting repeated scene text features or sentences, enabling a text editor to be used for extracting the scene text features, carrying out average pooling operation on mark embedding output by a BERT component to obtain feature vectors of each sentence, inputting the obtained video picture features and the obtained scene text features into a feature fusion module at the same time, carrying out joint coding on two mode information by using a visual encoder and a scene text encoder and polymerizing cross-mode fusion tokens, taking output of the feature fusion module as input of an MLP module, and obtaining positive energy scores of videos through quick processing of the MLP module. The method is also suitable for the field of extraction of video picture information and scene text information.

Inventors

LIU SHAOHUI
Mi Yachun
SHU YAN
JIANG FENG

Assignees

人民网股份有限公司
哈尔滨工业大学

Dates

Publication Date: 20260508
Application Date: 20230814

Claims (6)

1. The cross-mode video positive energy evaluation method for fusing video pictures and scene text information is characterized by comprising the following steps of: s1, acquiring video clips, wherein the number of frames of the video clips is a preset number of frames; S2, performing feature selection on the video segment by using a pre-trained R3D model to obtain a plurality of feature vectors; S3, carrying out global average pooling GAP operation on the plurality of feature vectors obtained in the S2, and obtaining video picture features through fully connected FC; s4, extracting scene text features in the video picture features in the step S3 by using an OCR tool, and performing de-duplication processing on the extracted scene text features by setting the numerical value of the frame interval, namely deleting repeated scene text features or sentences; S5, using a sub-network in the pre-training SBERT model as a text encoder, wherein the text encoder is used for extracting scene text characteristics in S4, and simultaneously carrying out mean value pooling operation on mark embedding output by the BERT component so as to obtain a feature vector of each sentence; s6, inputting the video picture features obtained in the S4 and the scene text features obtained in the S5 into a feature fusion module at the same time, and respectively using a visual encoder and a scene text encoder and aggregating a cross-mode fusion token to jointly encode the two-mode information; S7, taking the output of the feature fusion module in the S6 as the input of the MLP module, and obtaining the positive energy score of the video through the processing of the MLP module; S4, the numerical value of the frame interval is 10; The feature fusion module in S7 comprises two Transformer Encode structures; in the S6 And Indicating the polymerization time An input image token and a scene text token of the visual encoder and the scene text encoder; the input fusion token for aggregation of the first visual text and scene text is represented as , The workflow of the visual transducer layer at the aggregation stage is updated as: Wherein, the The output image features corresponding to the fusion token, Y is an intermediate feature obtained after the output features of the multi-head attention part are combined with the input features; the workflow of the scene text transducer layer in the aggregation phase is updated as: Wherein, the For the output features of the next layer corresponding to the scene text token, And outputting the scene text characteristics corresponding to the fusion token.
2. The method for cross-modal video positive energy assessment fusing video pictures and scene text information according to claim 1, wherein the pre-trained R3D model in S2 is a training of the R3D model by using a Kinetic dataset in advance.
3. The method for cross-modal video positive energy assessment of fusion of video pictures and scene text information according to claim 1, wherein the SBERT model in S5 has two BERT models, and the two BERT models are twin networks.
4. The method of cross-modal video positive energy assessment fusing video pictures and scene text information as recited in claim 1 wherein the number of layers of both Transformer Encode is set to 24.
5. A cross-modality video positive energy system that fuses video pictures and scene text information, the system implemented based on the method of claim 1, the system comprising The video feature extraction module is used for extracting the features of the video segments by using a pre-trained R3D model, carrying out global average pooling GAP operation on the obtained video segments, and obtaining video picture features through fully connected FC; the text feature extraction module is used for extracting scene text features in the video picture by using an OCR tool, and performing de-duplication processing on the extracted scene text features through the numerical value of the frame interval, namely deleting repeated scene text features or sentences; The feature fusion module is used for inputting the obtained video picture features and the obtained scene text features into the feature fusion module at the same time, respectively using a visual encoder and a scene text encoder and aggregating cross-mode fusion tokens to jointly encode the two-mode information, taking the output of the obtained feature fusion module as the input of the MLP module, and obtaining the positive energy score of the video through the processing of the MLP module.
6. A computer readable storage medium, characterized in that it stores a computer program configured to implement the steps of the cross-modality video positive energy assessment method of fusing video pictures and scene text information of any one of claims 1-4 when called by a processor.

Description

Cross-mode video positive energy evaluation method, system and computer storage medium for fusing video picture and scene text information Technical Field The present invention relates to the field of video sensitive content analysis. Background In recent years, user Generated Content (UGC) video has become the primary content form of platforms for trembling, fast handedness, and reddish books, attracting tens of millions and billions of users, respectively. With the recent advancement of video shooting equipment, the price is more and more civilian, so the cost of producing short video is greatly reduced. This results in a large number of short video users being both consumers and creators, resulting in a dramatic increase in the number of short videos on the network. Thus, short video has become a major source of human information. However, the quality of these short videos tends to be uneven. This is mainly reflected in the subjective positive energy problem of video content. Considering that the purpose of authoring videos by different users is different, this results in a few problems of low colloquial, negative or unhealthy in the videos. To solve this problem, most of the current work is often done by referencing the video classification task in deep learning. However, since a short video often belongs to multiple categories, there is often a great limitation in evaluating video content solely by means of classification tasks. Therefore, few efforts have proposed using regression-based methods, i.e., score evaluation of each video based on its positive energy, to more accurately evaluate whether the content carried by the short video meets the prevailing social value. However, existing regression-based methods only consider visual features in the video, and ignore scene text information in the video. But in many cases text semantic cues in the video play a very important role in the final subjective content quality assessment of the video. Disclosure of Invention The invention aims to solve the problem that the existing regression-based method only considers visual characteristics in the video and ignores scene text information in the video. In order to achieve the above purpose, the present invention provides the following technical solutions: Scheme one, a cross-mode video positive energy evaluation method for fusing video pictures and scene text information, the method comprises the following steps: s1, acquiring video clips, wherein the number of frames of the video clips is a preset number of frames; S2, performing feature selection on the video segment by using a pre-trained R3D model to obtain a plurality of feature vectors; S3, carrying out global average pooling GAP operation on the plurality of feature vectors obtained in the S2, and obtaining video picture features through fully connected FC; s4, extracting scene text features in the video picture features in the step S3 by using an OCR tool, and performing de-duplication processing on the extracted scene text features by setting the numerical value of the frame interval, namely deleting repeated scene text features or sentences; S5, using a sub-network in the pre-training SBERT model as a text encoder, wherein the text editor is used for extracting scene text characteristics in S4, and carrying out mean value pooling operation on mark embedding output by the BERT component so as to obtain feature vectors of each sentence; s6, inputting the video picture features obtained in the S4 and the scene text features obtained in the S5 into a feature fusion module at the same time, and respectively using a visual encoder and a scene text encoder and aggregating a cross-mode fusion token to jointly encode the two-mode information; And S7, taking the output of the feature fusion module in the S6 as the input of an MLP model, and obtaining the positive energy score of the video through the processing of the MLP model. Further, a preferred embodiment is provided, wherein the pre-trained R3D model in S2 is a pre-trained R3D model using a Kinetic data set. Further, in a preferred embodiment, in S4, the value of the frame spacing is 10. Further, there is provided a preferred embodiment, wherein the SBERT model in S5 has two BERT models, and the two BERT models are twin networks. Further, a preferred embodiment is provided, wherein the feature fusion module in S7 includes two Transformer Encode structures. Further, a preferred embodiment is provided, wherein the number of layers of both Transformer Encode is set to 24. Further, a preferred embodiment is provided, wherein in S6AndIndicating the polymerization timeAn input image token and a scene text token of the visual encoder and the scene text encoder; the input fusion token for aggregation of the first visual text and scene text is represented as The workflow of the visual transducer layer at the aggregation stage is updated as: Wherein, the Is the output image feature corresponding to the fused token