CN-122024238-A - Visual text multi-modal action recognition method based on deep learning

CN122024238ACN 122024238 ACN122024238 ACN 122024238ACN-122024238-A

Abstract

The application relates to the technical field of motion recognition, in particular to a visual text multi-mode motion recognition method based on deep learning, which comprises the steps of obtaining video data and tag set data, and processing the video data by adopting a video encoder; the method comprises the steps of screening visual token to be fused from tag set data, calculating fusion scores of the token based on a multi-criterion token fusion strategy, fusing the visual token based on a binary matching strategy and the multi-criterion fusion scores, coding video feature input, interacting the fused visual token with an information token to obtain video coding, carrying out text enhancement on the information token by adopting a text encoder to obtain text representation, and carrying out cross-modal alignment on the video coding and the text representation. The method comprehensively improves the accuracy of the action recognition method through designs such as a multi-criterion token fusion strategy, a bipartite matching strategy, a plurality of lightweight adapters and the like.

Inventors

LEI JIANJUN
HU DINGYUAN
WANG YING

Assignees

重庆邮电大学

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (9)

1. The visual text multi-modal action recognition method based on deep learning is characterized by comprising the following steps of: S1, acquiring video data and tag set data, and processing the video data by adopting a video encoder to obtain video characteristic input, wherein the video characteristic input is embedded with learnable information token and position information; s2, screening out visual token to be fused from the tag set data, and calculating multiple criterion fusion scores among the token based on a multiple criterion token policy, wherein the multiple criterion token fusion policy fuses the token by quantizing the information quantity of the token and taking the information quantity as a main criterion for measuring the fusion tendency of the token and the similarity score and the fusion frequency score as secondary criteria; s3, fusing the vision token based on the multiple criterion fusion score to obtain a fused vision token; S4, coding the video feature input, interacting the fusion visual token with the information token, and integrating the information token into the video feature input to obtain a video code ; S5, carrying out text enhancement on the information token by adopting a text encoder to obtain text representation The text encoder consists of a plurality of transducer layers and a text adapter; S6, adopting a similarity calculation frame to code the video With text representation And performing cross-modal alignment to obtain an action recognition result.
2. The deep learning based visual text multimodal action recognition method of claim 1, wherein the multi-criterion token fusion policy comprises: S21, calculating the average attention score of the token based on a self-attention mechanism; S22, calculating the information quantity score of token fusion based on the average attention score; S23, calculating similarity scores and fusion frequency scores among the token, wherein the fusion frequency scores are inversely related to the number of the fused token; s24, calculating multiple criterion fusion scores among the token based on the information quantity scores, the similarity scores and the fusion times scores The formula is: ; Wherein, the Representing a set of multiple auxiliary criteria, including an information quantity score, a similarity score, and a fusion number score, Represents the kth fusion criteria score in addition to the information content score, Representing calculations between the ith and jth token, The information amount score is represented as a function of the information amount, Representing the temperature coefficient used to adjust the effect of the corresponding criterion.
3. The deep learning-based visual text multi-modal action recognition method of claim 2, wherein the fusion frequency score is given by the formula: ; Wherein, the The score of the number of fusion times is indicated, Representing the number of tokens that the corresponding token has fused.
4. The deep learning-based visual text multi-modal action recognition method according to claim 1 is characterized in that visual token fusion is carried out based on multiple criterion fusion scores, a bipartite matching strategy is adopted during fusion, and one round of fusion is split into two times to be completed, wherein the number of fusion targets is half of that of each time.
5. The deep learning based visual text multimodal action recognition method of claim 4, wherein the bipartite matching strategy comprises: S31, aggregating token According to the parity of serial numbers, the serial numbers are evenly split into source token sets With target token set ; S32, based on And (3) with The binary edge matrix and the multiple criteria are fused to form a decision matrix, and the decision matrix is optimized; s33, utilizing decision matrix Obtaining a token set to be fused ; S34, pair Performing weighted average; S35, putting the weighted average token into the screened target token set to obtain a fused source token set With target token set And exchanging the positions of the source token set and the target token set, and repeating S32-S34 to obtain the fusion visual token.
6. The method for recognition of multiple modal actions of visual text based on deep learning of claim 1, wherein said encoding the fused visual token comprises: S41, inputting video characteristics The method comprises the steps of encoding, interacting a fusion visual token with an information token through a transducer layer, wherein an information interaction adapter is arranged behind each layer of transducer, and the information interaction adapter consists of a fixed number of transducer layers; S42, collecting the information token of the last round of information interaction, and adding additional class token Information token The method comprises the steps of sending the data into a space-time adapter to be processed to obtain a token-like representation, wherein the space-time adapter consists of a fixed number of transformer layers, and the number of the space-time adapter is the same as that of the information interaction adapter; S43, projecting the processed token-like representation to a video language space through a linear layer to obtain a final video coding 。
7. The deep learning based visual text multimodal action recognition method of claim 1, wherein the text enhancement of the information token comprises: s51, acquiring text labels of videos Employing text adapter pairs Performing context expansion to obtain enhanced semantic representation The text adapter consists of an expander and a semantic optimizer, and is used for given text labels The text adapter adopts an expander to expand the text adapter to obtain semantic rich description The semantic optimizer comprises a convolution sampling network and a linear layer, wherein semantic rich description is sequentially processed by downsampling to extract semantic information, upsampling to strengthen, and GeLU activation and linear layer processing to obtain strengthened semantic representation ; S52, pair Token is made and projected as word embedding Constructing an input of a transducer layer according to word embedding; S53, adopting a transducer layer pair Processing; S54, obtaining the output characteristics of the last layer of transformation object after processing Will be Linearly projecting into video language space to obtain final text representation 。
8. The method for recognition of visual text multimodal motion based on deep learning of claim 1, wherein the video is encoded With text representation Performing cross-modal alignment, and calculating a framework by adopting symmetrical similarity, wherein the formula is as follows: ; ; Wherein, the Representing the probability of a visual to text match, Representing the probability of a text-to-visual match, Representing the degree of cosine similarity, The batch size at the time of training is represented, Representing the temperature coefficient.
9. The deep learning-based visual text multi-modal motion recognition method of claim 8, wherein the symmetrical similarity calculation framework constructs a video-text contrast loss function using Kullback-Leibler divergence The cross-modal alignment is realized by maximizing the similarity of the matched sample pairs, and the formula is as follows: ; Wherein, the Representing a true category label.

Description

Visual text multi-modal action recognition method based on deep learning Technical Field The application relates to the technical field of action recognition, in particular to a visual text multi-modal action recognition method based on deep learning. Background Video motion recognition is one of key technologies in the fields of man-machine interaction and intelligent perception, and a mainstream motion recognition method in the prior art is usually realized based on a convolutional neural network or a transducer network. However, both of the above solutions have respective short plates: For the mainstream motion recognition system based on the convolutional neural network, as shown in fig. 1, optical flow calculation in the convolutional network is high in cost and depends on predefined motion characterization, three-dimensional convolution needs to be performed simultaneously in time and space dimensions, so that the quantity of parameters and calculation amount are increased rapidly, and the convolution kernel is fixed in size, so that long-distance time sequence dependence is difficult to model. For the mainstream action recognition system based on the transducer network, as shown in fig. 2, the self-attention mechanism of the transducer core has secondary computation complexity of time and space, is difficult to process long videos and is more dependent on large-scale data for training, and partial token may have limited contribution to action recognition (such as background area), but standard self-attention can uniformly calculate all token relations, so that computation is wasted. And direct joint modeling space-time attention computation is costly, while separate modeling may lose space-time relevance. Therefore, there is a need for an action recognition method that can comprehensively solve the above problems. Disclosure of Invention In view of the above, the present application discloses a visual text multi-modal action recognition method based on deep learning, so as to solve the problems in the prior art, including: S1, acquiring video data and tag set data, and processing the video data by adopting a video encoder to obtain video characteristic input, wherein the video characteristic input is embedded with learnable information token and position information; s2, screening out visual token to be fused from the tag set data, and calculating multiple criterion fusion scores among the token based on a multiple criterion token policy, wherein the multiple criterion token fusion policy fuses the token by quantizing the information quantity of the token and taking the information quantity as a main criterion for measuring the fusion tendency of the token and the similarity score and the fusion frequency score as secondary criteria; s3, fusing the vision token based on the multiple criterion fusion score to obtain a fused vision token; S4, coding the video feature input, interacting the fusion visual token with the information token, and integrating the information token into the video feature input to obtain a video code ; S5, carrying out text enhancement on the information token by adopting a text encoder to obtain text representationThe text encoder consists of a plurality of transducer layers and a text adapter; S6, adopting a similarity calculation frame to code the video With text representationAnd performing cross-modal alignment to obtain an action recognition result. The beneficial effects of the application include: The invention discloses a visual text multi-mode action recognition method based on deep learning, which comprises the design of a feature embedding layer, a visual encoder and a text encoder, and the design of three lightweight adapters of an information interaction adapter, a space-time adapter and a text adapter, wherein a multi-criterion token fusion strategy is creatively designed in the visual encoding process, the recognition accuracy of a model is improved based on the knowledge distillation principle, the multi-mode action feature can be effectively fused, the calculated amount of the model in an reasoning stage is greatly reduced, the action recognition can be more efficiently carried out, and a new design thought is provided for a person skilled in the art; The self-attention mechanism helps to establish long-distance time sequence dependence, more effectively extracts space-time information, and utilizes an adapter based on a transducer architecture to finish fusion of frame-level features, so that the limitation of poor effect of a simple fusion strategy of two-dimensional convolution is overcome, and complex calculation and high-difficulty deployment of three-dimensional convolution are avoided; Compared with a model based on a transformer network, the method provided by the application utilizes abundant semantic information in the label text through vision and text multi-mode contrast learning, further strengthens video action representation, enables the action