CN-117372927-B - Small sample behavior recognition method based on cross-modal comparison learning network

CN117372927BCN 117372927 BCN117372927 BCN 117372927BCN-117372927-B

Abstract

A small sample behavior recognition method based on a cross-modal contrast learning network relates to a computer vision technology. A. Given some videos, n seg frames are randomly extracted from each video. B. The sampled video frames are input to a spatio-temporal enhancement module to obtain enhanced video vectors. C. And B, inputting the generated visual vector into a semantic generation network to generate a semantic vector. D. And connecting the visual vector and the semantic vector to form a mixed feature vector, and inputting the mixed feature vector into a nonlinear contrast mapping head to obtain a transformed mixed feature vector. E. And B, generating a synthetic vector by using the visual vector and Gaussian noise obtained in the step B, and inputting the synthetic vector into a nonlinear contrast mapping head to obtain a final synthetic vector. F. And D, regarding the transformed mixed feature vector generated in the step D and the synthetic vector generated in the step E as class prototypes, and calculating the distance between the class prototypes by using cosine similarity to obtain the prediction possibility. Compared with the small sample behavior recognition method of the current mainstream, the classification performance of the method is improved.

Inventors

WANG HANZI
WANG XIAO
YAN YAN

Assignees

厦门大学
上海人工智能创新中心

Dates

Publication Date: 20260505
Application Date: 20231023

Claims (5)

1. A small sample behavior recognition method based on a cross-modal contrast learning network is characterized by comprising the following steps: 1) Giving a small sample behavior identification data set, wherein the data set comprises a plurality of videos, each video consists of a plurality of video frames, and sampling the video frames by using a sparse time sampling strategy; 2) Inputting the sampled video frames into a backbone network and obtaining visual features , Represent the first Visual characteristics of individual video frames, video characteristics Inputting a space-time enhancement module to obtain a visual vector; 3) Inputting the visual vector generated in the step 2) into a semantic generation network to generate a semantic vector; 4) Connecting the visual vector generated in the step 2) and the semantic vector generated in the step 3) to form a mixed feature vector, inputting the mixed feature vector into a nonlinear contrast mapping head, and obtaining a transformed mixed feature vector ; The nonlinear contrast mapping head comprises two full connection layers and a ReLU activation function for better learning the transformed mixed eigenvectors Designing a cross-modal contrast learning loss to monitor network training, wherein if the sample in the support set and the sample in the query set have the same class label, the sample in the support set is a positive sample, if the sample in the support set and the sample in the query set have different class labels, the sample in the support set is a negative sample, and the cross-modal contrast learning loss is calculated by the method The definition is as follows: , Wherein, the Representing the mixed feature vector of the query sample, Representing the mixed feature vector of the positive samples, A hybrid feature vector representing a negative sample; representing a cosine similarity measure of the number of samples, Representing a temperature parameter; 5) Generating a synthetic vector by using the visual vector obtained in the step 2) and Gaussian noise, and inputting the synthetic vector into a nonlinear contrast mapping head to obtain a final synthetic vector; 6) Regarding the mixed feature vector of the transformation generated in step 4) and the synthetic vector generated in step 5) as class prototypes, using cosine similarity metrics Calculating the distance between class prototypes; the cosine similarity measure is utilized The distance between class prototypes is calculated as follows: , Wherein, the Representing query samples Is a model of the class of the model, Representing the label as Is a class prototype of the action class of (a), Representation support set The class model is used for the purpose of generating a class model, Representing the number of classes in the support set, Representing query samples Belonging to class labels Is used for the prediction probability of (1).
2. A small sample behavior recognition method based on cross-modal contrast learning network as set forth in claim 1, wherein in step 1), said sampling of video frames is performed by Randomly split into Randomly extracting a frame from each video clip; randomly decimated The video frames form a new video clip; ; Wherein, the The representation is from the first Video frames randomly decimated from video clip to video clip.
3. The small sample behavior recognition method based on cross-modal contrast learning network as claimed in claim 2, wherein in step 2), the specific steps of obtaining the visual vector are as follows: First, in the visual feature With three different The 2D convolution layer obtains query, key, value features, respectively expressed as , And In (1) And Between obtaining weights using element multiplication operations Thereby using Weighting of The process is defined as follows: ; , Wherein, the Representing an element multiplication operation; Representing weighted value features, and in order to preserve the original visual features, the features are then identified Obtaining visual features using a residual operation And, in order to acquire the timing information, use one 2D convolutional layer processing visual features The process is expressed as: , Wherein, the Representing the motion characteristics over a time t, Representation of A 2D convolution operation is performed with the result that, Is shown in The visual characteristics of the moment in time, Is shown in The visual characteristic of the moment, the motion characteristic of the last moment is set to be zero, the motion characteristics of different moments are connected together to form the final motion characteristic, which is defined as , wherein, Representing the join operation, then utilizing the average pooling operation in the spatial and temporal dimensions to obtain pooled motion features After that, the motion characteristics are pooled The query, key, value embedding is obtained by using three different full connection layers, and the process is expressed as: , , , Wherein, the , , Respectively represent query, key, value embedding by And Is weighted by the product of (2) Finally, the residual connection is utilized to save the original time sequence context information, and the process is expressed as: Wherein, the method comprises the steps of, To obtain a visual vector.
4. A small sample behavior recognition method based on cross-modal contrast learning network as set forth in claim 3, wherein in step 3), said generating network generates semantic vectors by the specific steps of Generating networks using semantics Generating semantic vectors expressed as: Wherein, the method comprises the steps of, Representing the generated semantic vector; The method is a nonlinear neural network and comprises two full-connection layers and a ReLU activation function, wherein the obtained semantic vector is more similar to a corresponding real class semantic vector by utilizing a semantic generation loss, and the definition is as follows: Wherein, the A class label is represented and is displayed, Representing a Word2Vec network, Representing the number of action classes.
5. The method for identifying small sample behaviors based on cross-modal comparison learning network of claim 4, wherein in step 5), the specific step of generating a composite vector by using the visual vector obtained in step 2) and Gaussian noise is that the generator is Using the vision vector and gaussian noise obtained in step 2) Generating a composite vector, and a discriminator For distinguishing whether the composite vector is true or false, generator Distinguishing device In order to better optimize the prototype generation contrast network, a generator Distinguishing device Using objective functions Learning is performed, which is defined as follows: , Wherein, the The visual vector is represented by a vector of vision, The semantic vector is represented by a set of vectors, The characteristics of the composition are represented, Representing the distribution of the training samples, Representing the distribution of the synthetic vector and the semantic vector, and inputting the synthetic vector into a nonlinear contrast mapping head to obtain a final synthetic vector.

Description

Small sample behavior recognition method based on cross-modal comparison learning network Technical Field The invention relates to a computer vision technology, in particular to a small sample behavior recognition method based on a cross-modal contrast learning network. Background The small sample behavior identification is one of important research directions in the field of computer vision, and has important roles in the fields of intelligent monitoring, abnormal behavior detection, human-computer interaction and the like. Small sample behavior recognition aims at using fewer samples to identify the category of a given video. In recent years, deep learning-based methods have achieved significant success in the field of computer vision. However, these powerful deep learning-based methods require a large number of labeled samples to train to obtain a better training model. In real life, acquiring huge tag data is very expensive or impractical, because data collection is a very tedious process, which consumes a lot of manpower and material resources. These problems may limit the application of these deep learning based methods in real-world scenarios. Therefore, how to obtain a robust model with a small number of labeled samples under a small sample learning setting is one of the key issues in solving video classification. In addition, image classification takes great effort. The detection performance of the image classification applied directly to the video classification field may be drastically reduced. This is because video contains an additional time dimension compared to images and therefore has more complex intrinsic properties. Using existing image classification methods to address video classification tasks often produces unsatisfactory detection results. Therefore, how to efficiently acquire timing information in video is one of the key issues of video classification. Disclosure of Invention The invention aims to provide a small sample behavior recognition method based on a cross-mode contrast learning network, aiming at the problems of insufficient data, space-time information mining and the like in small sample behavior recognition, and the recognition method has higher precision. The invention comprises the following steps: 1) Giving a small sample behavior identification data set, wherein the data set comprises a plurality of videos, each video consists of a plurality of video frames, and sampling the video frames by using a sparse time sampling strategy; 2) Inputting the sampled video frames into a backbone network, obtaining visual features F v＝{F1,F2,…,Fi,…,Fseg},Fi to represent visual features of the ith video frame, and inputting the video features F v into a space-time enhancement module to obtain visual vectors; 3) Inputting the visual vector generated in the step 2) into a semantic generation network to generate a semantic vector; 4) Connecting the visual vector generated in the step 2) and the semantic vector generated in the step 3) to form a mixed feature vector, inputting the mixed feature vector into a nonlinear contrast mapping head, and obtaining a transformed mixed feature vector 5) Generating a synthetic vector by using the visual vector obtained in the step 2) and Gaussian noise, and inputting the synthetic vector into a nonlinear contrast mapping head to obtain a final synthetic vector; 6) Taking the transformed mixed feature vector generated in step 4) and the synthetic vector generated in step 5) as class prototypes, and calculating the distance between the class prototypes by using a cosine similarity measure psi (·,). In step 1), the specific step of sampling the video frames may be to randomly divide a video V a into n seg video segments, and randomly extract a frame from each video segment. Thus, the randomly decimated n seg video frames form a new video segment V' a＝{I1,I2,…,Ii,…,Iseg, where I i represents the randomly decimated video frames from the ith video segment. In step 2), the specific step of obtaining the visual vector may be: first, three different 1×12d convolution layers are utilized on visual feature F v to obtain query, key, value features, respectively denoted as AndAt the position ofAndThe weight W j is obtained by element multiplication operation, so that the weight W j is used for weightingThe process is defined as follows: wherein, the element multiplication is indicated by the letter. Representing weighted value features, and in order to preserve the original visual features, the features are then identifiedTo obtain the timing information, the visual feature X' is processed using a 3X 32 d convolution layer, which can be expressed as: Y(t)=conv(X′t+1)-X′t,1≤t≤nseg-1 Wherein Y (t) represents the motion feature at time t, conv (·) represents a 3X 32D convolution operation, X 't+1 represents the visual feature at time t+1, X' t represents the visual feature at time t, in particular, the motion feature at the last time is set to zero, the motion featu