CN-121999302-A - Fine granularity multi-mode video behavior recognition method and system guided by motion saliency

CN121999302ACN 121999302 ACN121999302 ACN 121999302ACN-121999302-A

Abstract

The invention discloses a motion significance guided fine-grained multi-mode video behavior recognition method and system, wherein the method comprises the steps of selecting first K image blocks with the most motion significance from all image blocks corresponding to DRDI images of each time segment to serve as dynamic image blocks, selecting corresponding K image blocks from RGB images to serve as static image blocks, splicing the dynamic image blocks and the static image blocks under each time segment, inputting the spliced image blocks into a visual encoder to conduct fine-grained multi-mode interactive learning to obtain feature representations corresponding to the current time segment, processing the feature representations of all the time segments to obtain video global feature representations, conducting text feature extraction on text description of behavior categories to obtain text feature representations, calculating similarity between the video global feature representations and the text feature representations, and determining video behavior recognition results according to the similarity. The video behavior recognition method and device can improve accuracy of video behavior recognition.

Inventors

WU HANBO
SONG RUI
WANG CHAOQUN

Assignees

山东大学

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (10)

1. The fine granularity multi-mode video behavior recognition method guided by the motion significance is characterized by comprising the following steps of: the method comprises the steps of obtaining RGB video and corresponding depth video of the RGB video to be subjected to behavior recognition, uniformly dividing the RGB video and the corresponding depth video into T time segments, and constructing RGB-DRDI image pairs of all the time segments; Uniformly dividing an RGB image and a DRDI image in an RGB-DRDI image pair of each time slice into a plurality of mutually non-overlapping image blocks respectively; The first K image blocks with the most motion significance are selected from all image blocks corresponding to DRDI images of each time slice to be used as dynamic image blocks, and the K image blocks corresponding to the space positions are also selected from RGB images which are positioned in the same time slice as DRDI images to be used as static image blocks; Splicing the dynamic image blocks and the static image blocks under each time segment, inputting the spliced image blocks into a visual encoder of a CLIP model for fine-grained multi-mode interactive learning to obtain feature representations corresponding to the current time segment; and calculating the similarity between the video global feature representation and the text feature representation, and determining a video behavior recognition result according to the text description corresponding to the maximum similarity.
2. The motion saliency-guided fine-grain multi-mode video behavior recognition method of claim 1, wherein uniformly dividing an RGB video and its corresponding depth video into T time segments, constructing RGB-DRDI image pairs of all time segments, comprising: Uniformly dividing RGB video into T RGB time slices, and randomly sampling an RGB image frame in each RGB time slice, wherein T is a positive integer; Dividing the depth video into T depth time slices in the same dividing mode as the RGB video, and calculating residual frames between adjacent depth frames in each depth time slice to obtain a residual image sequence of each depth time slice; the RGB image frames and depth residual dynamic images DRDI under the same time slice are used as a group of image pairs, so that RGB-DRDI image pairs of all time slices corresponding to the RGB video and the corresponding depth video are built.
3. The motion saliency guidance fine granularity multi-mode video behavior recognition method of claim 2, wherein calculating a residual frame between adjacent depth frames within each depth time slice results in a residual image sequence for each depth time slice, comprising: within each segment, calculating residuals between adjacent depth frames to capture motion changes; ; Wherein, the Is the first The first of the depth time slices A frame depth image is displayed in a frame format, Is the total number of frames in the segment; Represent the first A plurality of depth time slices; First, the The first segment of Residual error frames Expressed as: ; Wherein, the Represent the first A sequence of residual images of the individual slices, Is the first The first of the depth time slices And (5) a frame depth image.
4. The motion saliency guidance fine-granularity multi-mode video behavior recognition method according to claim 2, wherein the sequence of residual images of each depth time segment is sorted and pooled to obtain a depth residual dynamic image DRDI corresponding to each depth time segment, specifically comprising: To represent the temporal evolution of the behavior within each segment, for a sequence of residual images The method comprises the steps of applying sequencing pooling to generate a single Zhang Shendu residual dynamic image, wherein the sequencing pooling is a time sequence coding method, and preserving the time sequence of a frame sequence by learning a linear sequencing function which changes with time; First for a residual image sequence Performing time-varying mean vector operation to obtain a smoothed feature sequence : ; Wherein, the Represent the first In the time segment The feature vectors after the residual error frames are smoothed; Represent the first The first segment of -Residual frames; The final dynamic image is obtained by optimizing a linear ordering function: If the time steps Then satisfy ; Wherein the ranking vector For the purpose of representing the timing information, Represent the first In the time segment The feature vectors after the residual error frames are smoothed; solving for a ranking vector Is a function of the objective function of: ; Wherein, the In order to relax the variables of the variables, Is a regularization parameter; Using optimized vectors And performing dimension transformation to obtain a depth residual dynamic image.
5. The motion saliency guide fine-granularity multi-mode video behavior recognition method according to claim 1, wherein the first K image blocks with the most motion saliency are selected from all image blocks corresponding to DRDI images of each time slice as dynamic image blocks, and the K image blocks corresponding to space positions are also selected from RGB images which are in the same time slice as DRDI images as static image blocks, and the method specifically comprises the following steps: Calculating DRDI a motion significance score corresponding to each image block of the image; selecting the first K moving image blocks with the most motion significance from DRDI images according to the motion significance scores; And selecting K static image blocks from the RGB image according to the principle that the spatial positions are consistent, wherein the spatial positions of the static image blocks in the RGB image are consistent with the spatial positions of the dynamic image blocks in the DRDI image.
6. The motion saliency guidance fine granularity multi-modal video behavior recognition method as set forth in claim 5, wherein the calculating DRDI a motion saliency score for each image block of an image includes: First, the DRDI images of individual time slices Is input into a 2D convolutional neural network for feature learning, and the 2D convolutional neural network adopts ResNet 18, After feature learning, obtaining a space feature map with the size of The image blocks of the spatial feature map are in one-to-one correspondence with the RGB images and the DRDI 14 x 14 non-overlapping image blocks; after carrying out average pooling operation on the space feature map in the channel dimension, an activation function is applied to obtain a motion significance score map : ; Each value of (1) represents an importance estimate for a corresponding image block, Represents an average pooling of the data in the pool, The activation function is represented as a function of the activation, Representing a two-dimensional convolutional neural network.
7. The method for recognizing motion significance guided fine-granularity multi-mode video behavior according to claim 1, wherein the method comprises the steps of splicing a dynamic image block and a static image block under each time segment, inputting the spliced image blocks into a visual encoder of a CLIP model for fine-granularity multi-mode interactive learning to obtain a feature representation corresponding to a current time segment, and obtaining a video global feature representation by adopting an average pooling operation for the feature representations of all the time segments, and the method comprises the following steps: Mapping the static image block and the dynamic image block of each time slice into a static image block embedding sequence and a dynamic image block embedding sequence through a linear projection layer respectively; Then, splicing the static image block embedded sequence and the dynamic image block embedded sequence, and adding space position coding and modal coding to the splicing result to obtain an input sequence of the CLIP visual encoder; Next, inputting the input sequence to a visual encoder of the CLIP model to obtain an output sequence of image block marks; Applying average operation to the output sequence of the image block mark to obtain a feature representation corresponding to each time segment; Performing linear projection on the characteristic representation corresponding to each time segment to obtain a projection result of each time segment; And carrying out time sequence average pooling operation on projection results of all the time slices to obtain video global feature representation.
8. The motion saliency guidance fine-granularity multi-mode video behavior recognition method according to claim 7, wherein the step of performing an averaging operation on the output sequence of image block markers to obtain feature representations corresponding to each time segment, performing a linear projection on the feature representations corresponding to each time segment to obtain projection results of each time segment, and performing a time-sequence averaging pooling operation on the projection results of all time segments to obtain a video global feature representation comprises: Through the process of After layer coding, the output sequences of all image block marks are obtained by average operation to obtain segment-level multi-mode characteristic representation and pass through a linear projection matrix Mapping to a common potential space for subsequent visual and text feature alignment, the calculation formula is as follows: ; Wherein, the Represent the first An output value of the layer; the average operation is represented by a number of times, Represent the first Characteristic representations corresponding to the individual segments; Characterization of T fragments Performing an averaging pooling of the time dimension to obtain a final video level feature representation 。
9. The motion saliency-guided fine-granularity multi-mode video behavior recognition method of claim 1, wherein text descriptions of behavior categories are input into a text encoder of a CLIP model for text feature extraction to obtain text feature representations, and enhancement processing is performed by using a large language model before.
10. A motion saliency-guided fine grain multi-modal video behavior recognition system, comprising: The acquisition module is configured to acquire RGB video to be subjected to behavior recognition and corresponding depth video thereof, uniformly divide the RGB video and the corresponding depth video thereof into T time segments and construct RGB-DRDI image pairs of all the time segments; The dividing module is configured to uniformly divide the RGB image and the DRDI image in the RGB-DRDI image pair of each time slice into a plurality of mutually non-overlapping image blocks respectively; A selecting module configured to select the first K image blocks with the most motion significance from all image blocks corresponding to DRDI images of each time slice as dynamic image blocks, and also select K image blocks corresponding to spatial positions from RGB images which are in the same time slice as DRDI images as static image blocks; The extraction module is configured to splice the dynamic image blocks and the static image blocks under each time segment, input the spliced image blocks into a visual encoder of a CLIP model for fine-grained multi-mode interactive learning to obtain feature representations corresponding to the current time segment; The recognition module is configured to input text description of the behavior type into a text encoder of the CLIP model to extract text features to obtain text feature representations, calculate similarity between the video global feature representations and the text feature representations, and determine a video behavior recognition result according to the text description corresponding to the maximum similarity.

Description

Fine granularity multi-mode video behavior recognition method and system guided by motion saliency Technical Field The invention relates to the technical field of video human behavior recognition, in particular to a fine granularity multi-mode video behavior recognition method and system guided by motion saliency. Background Video human behavior recognition is a basic task in the field of computer vision, and is attracting attention because of wide application in application scenes such as man-machine interaction, intelligent monitoring and service robots. For RGB-D behavior recognition, researchers have proposed a variety of information fusion strategies. The main stream method at present mainly adopts a decision level fusion or feature level fusion scheme. And for the decision-level fusion method, each mode data is independently processed, and finally, the overall prediction is obtained by averaging or weighting the identification results of the independent channels. However, such methods typically model each modality individually during training and fuse the output only at the end stage, failing to fully mine the complementary information between modalities, thereby limiting the overall discriminatory ability. In contrast, feature level fusion approaches attempt to jointly model features from different modalities. Some methods simply splice or weight fusion to the single-mode features in the last layer, and other methods realize the interaction of the cross-mode features through a backbone network structure in the feature learning process. However, most of these methods rely on a rough feature aggregation mechanism, cannot effectively describe fine-grained interaction relationships among different modalities, and are difficult to capture local cross-modality dependencies, so that fusion features are limited in representation capability and limited in performance improvement. Disclosure of Invention In order to solve the defects of the prior art, the invention provides a fine granularity multi-mode video behavior recognition method and system guided by motion saliency. In one aspect, a fine granularity multi-modal video behavior recognition method for motion saliency guidance is provided, comprising: the method comprises the steps of obtaining RGB video and corresponding depth video of the RGB video to be subjected to behavior recognition, uniformly dividing the RGB video and the corresponding depth video into T time segments, and constructing RGB-DRDI image pairs of all the time segments; Uniformly dividing an RGB image and a DRDI image in an RGB-DRDI image pair of each time slice into a plurality of mutually non-overlapping image blocks respectively; The first K image blocks with the most motion significance are selected from all image blocks corresponding to DRDI images of each time slice to be used as dynamic image blocks, and the K image blocks corresponding to the space positions are also selected from RGB images which are positioned in the same time slice as DRDI images to be used as static image blocks; Splicing the dynamic image blocks and the static image blocks under each time segment, inputting the spliced image blocks into a visual encoder of a CLIP model for fine-grained multi-mode interactive learning to obtain feature representations corresponding to the current time segment; and calculating the similarity between the video global feature representation and the text feature representation, and determining a video behavior recognition result according to the text description corresponding to the maximum similarity. In another aspect, a fine grain multi-modal video behavior recognition system for motion saliency guidance is provided, comprising: The acquisition module is configured to acquire RGB video to be subjected to behavior recognition and corresponding depth video thereof, uniformly divide the RGB video and the corresponding depth video thereof into T time segments and construct RGB-DRDI image pairs of all the time segments; The dividing module is configured to uniformly divide the RGB image and the DRDI image in the RGB-DRDI image pair of each time slice into a plurality of mutually non-overlapping image blocks respectively; A selecting module configured to select the first K image blocks with the most motion significance from all image blocks corresponding to DRDI images of each time slice as dynamic image blocks, and also select K image blocks corresponding to spatial positions from RGB images which are in the same time slice as DRDI images as static image blocks; The extraction module is configured to splice the dynamic image blocks and the static image blocks under each time segment, input the spliced image blocks into a visual encoder of a CLIP model for fine-grained multi-mode interactive learning to obtain feature representations corresponding to the current time segment; The recognition module is configured to input text description of the behavior type into a text encoder of