CN-122023915-A - Multi-mode fusion character interaction identification method and system based on large model

CN122023915ACN 122023915 ACN122023915 ACN 122023915ACN-122023915-A

Abstract

The invention provides a multi-mode fusion character interaction recognition method and system based on a large model, wherein the method comprises the steps of firstly, carrying out target detection, gesture estimation and capture extraction on each frame of image of a target video to obtain visual features, geometric features and text features of the target video; the method comprises the steps of obtaining a time sequence dimension hidden state by taking the structured geometric features as different types of entities through a bidirectional cyclic neural network BiRNN, realizing information interaction between entities in class and class through a message attention mechanism, and finally outputting a character interaction identification result through decision-making level fusion. The invention fully digs the topological relation between the skeletal joints and the entities, combines the semantic-visual dual-guide module to realize the deep interaction of the semantic and visual information, improves the alignment degree between modes, and improves the robustness and the recognition accuracy of the model under a complex scene.

Inventors

GAO QING
WU BOHONG

Assignees

中山大学·深圳
中山大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. A multi-mode fusion character interaction recognition method based on a large model is characterized by comprising the following steps: S1), acquiring a target video, and performing target detection, gesture estimation and capture extraction on each frame of image of the target video to acquire visual features, geometric features and text features of the target video; s2), constructing a feature extraction model, wherein the feature extraction model comprises a geometric feature extraction module and a semantic-visual dual-guide module; Carrying out structured geometric feature extraction on the geometric features by adopting a geometric feature extraction module; The semantic-visual double-guide module is used for carrying out multi-head cross attention calculation on text semantic features, HOI action labels and visual features, so that deep interaction between the semantic and visual information is realized, and the visual features of the fusion semantic are obtained; S3) constructing a space-time diagram with a visual center, taking the structural geometric features as different types of entities, acquiring hidden states of time sequence dimensions through a bidirectional cyclic neural network BiRNN, realizing information interaction between entities in class and between classes through a message attention mechanism, and finally outputting character interaction identification results through decision-level fusion.
2. The multi-modal fusion person interaction recognition method based on the large model as claimed in claim 1, wherein in the step S1), the extraction of the visual features of the target video specifically comprises the following steps: s11, acquiring a feature map of each frame of image by using a pre-trained target detector Boundary box of all object entities in the frame image ; S12), acquiring skeleton point matrixes of all human body entities in each frame of image through a gesture estimator ; Computing a matrix of skeletal points for each individual And defining a corresponding human body entity bounding box of the minimum bounding rectangle Obtaining bounding boxes of all human body entities ; S13), boundary frame of human body entity Bounding box with object entity Performing contrast feature map And performing pooling treatment on the ROI alignment and the global average, and performing human visual feature set and object visual feature set.
3. The multi-modal fusion person interaction recognition method based on the large model as claimed in claim 2, wherein in the step S1), the extraction of the geometric features of the target video is specifically as follows: Taking diagonal points of a boundary frame as object key points, and matrix human skeleton points Combining the human body key points and the object key points to construct a multi-interaction-diagram set so as to acquire geometric features; Constructing a human-centered map for everyone and all objects ; Building object association graphs for all objects ; Constructing human body contact diagrams for all human bodies ; Building a global position map for all entities ; In all constructed graphs, in addition to the coordinates of each bone point and keypoint, velocity information is additionally introduced, represented as a one-step forward differential, 、、 And Together forming a geometric feature.
4. The multi-modal fusion person interaction recognition method based on the large model as claimed in claim 3, wherein in the step S1), the text semantic features are extracted as follows: the capture of each frame of the video is obtained by calling the API of the pre-trained visual language large model VLM and utilizing Prompt to carry out constraint, and meanwhile, the texts of all interactive action tags in the data set are subjected to token conversion through BLIP and CLIP and text coding to obtain tag knowledge The Caption is passed through a CLIP text encoder to obtain corresponding text encoding characteristics The same video segment is utilized to acquire corresponding visual coding characteristics by a CLIP visual coding module frame by frame Computing text encoding features by frame-by-frame And visual coding features The cosine distance of (2) replaces the distance value in the dynamic time warping algorithm to obtain the text semantic features aligned in time sequence 。
5. The multi-modal fusion person interaction recognition method based on the large model of claim 4, wherein in step S2), the geometric feature extraction module comprises a plurality of independent-weight graph rolling networks and linear layers, and the structural geometric features are obtained through graph rolling operation and feature mapping by adapting different data distributions through the adaptive adjacency matrix; And allocating a weight-independent graph rolling network for each graph in the interactive graph set, and adapting to different data distribution through an adaptive collage matrix for each graph rolling network in the geometric feature extraction module.
6. The multi-modal fusion person interaction recognition method based on the large model as claimed in claim 5, wherein in step S2), for the graph First, calculate its adaptive adjacency matrix ; Then, through A state transition matrix capable of learning, a self-adaptive adjacent matrix Sum picture Performing a graph rolling operation to obtain the first A person-centric geometric feature; Flattening the geometric features of all the obtained graphs, embedding the geometric features into the same dimension in the MLP, and adding all the geometric features centered on the human body with the visual features of the corresponding human body through a leachable scalar weight to obtain the visual-geometric features of the human body; Contact map for object Performing the same operation to obtain object vision-geometric characteristics; For human body contact diagram Global position map Putting it into And And splicing and mapping to obtain the interactive geometric features.
7. The method for multi-modal fusion person interaction recognition based on large models as in claim 6, wherein in step S2), said semantic-visual dual-boot module comprises two transform decoders for text semantic features Semantic features obtained by incorporating image capture into HOI action tags based on a transducer decoder Containing HOI interaction information from the global alignment of the image to a specific sub-action space.
8. The method for multi-modal fusion person interaction recognition based on large models as in claim 7 wherein in step S2), in the first Transforme decoder, for time Text semantic features of (a) And tag knowledge From tag knowledge Text semantic features as queries Multi-head cross attention calculation is carried out as sum value, and the last layer output is taken as semantic feature ; Semantically characterizing to a second Transforme decoder Human visual feature set as query vector Visual characteristics as key and value and finally output fusion semantics 。
9. The multi-modal fusion person interaction recognition method based on the large model as claimed in claim 8, wherein in the step S3), the method is specifically as follows: for the time edge, bi-directional recurrent neural network BiRNN is used for extraction Human vision-geometry characteristics of time of day Is hidden in both directions ; For the space edge, a message attention mechanism is designed to realize the inter-entity intra-class and inter-class message transmission; Aggregating the information of the human category and the object category to obtain human aggregation information and object aggregation information; outputting a segment level identification result by utilizing the segment level network of the assignment; after the space diagram outputs the recognition result of the human-object interaction sub-action taking vision as the center, the space diagram is fused with the visual characteristics of the semantic meaning And realizing decision-level fusion through weighted summation, and finally outputting the identification result.
10. A multimodal fusion person interaction identification system based on a large model, comprising: The preliminary feature extraction module is used for carrying out target detection, gesture estimation and capture extraction on each frame of image of the target video so as to obtain visual features, geometric features and text features of the target video; The feature extraction and fusion module is used for calling a feature extraction model, carrying out structural geometric feature extraction on geometric features through a geometric feature extraction module of the feature extraction model, carrying out multi-head cross attention calculation on text semantic features, HOI action labels and visual features through a semantic-visual dual-guide module, realizing deep interaction of semantic and visual information, and obtaining visual features fused with the semantic; And the interaction recognition module is used for constructing a space-time diagram with a visual center, taking the structural geometric features as different types of entities, acquiring the hidden state of the time sequence dimension through a bidirectional cyclic neural network BiRNN, realizing the information interaction between the entities in the class and between the classes through a message attention mechanism, and finally outputting a character interaction recognition result through decision-making level fusion.

Description

Multi-mode fusion character interaction identification method and system based on large model Technical Field The invention relates to the technical field of computer vision, in particular to a multi-mode fusion character interaction identification method and system based on a large model. Background Character interaction (Human-Object Interaction, HOI) recognition is one of the core tasks in the field of computer vision, aimed at recognizing the ternary relationship of < person, interaction category, object > from images or videos. The existing HOI recognition method focuses on static images, and most of interaction actions in the real world have time sequence dynamic change characteristics, such as picking up, putting down and grabbing, the static images hardly contain time information of people and objects, and the interaction actions are difficult to distinguish by the HOI detection method of the static images. Thus, character interaction recognition for video data is a key to solving such problems. The core difficulty with video person interactions is that occlusion and view angle variation of the video data itself, dynamic background interference, ambiguity of fine-grained motion all make the characterization capability of any single modality insufficient to capture these subtle differences. Multi-modal fusion is considered as a potential solution to break through the limitations of single modalities, where human gestures, an abstract representation of character actions, become the first choice. In the prior art, for example, 2G-GCN and Hier-GAT, the improvement of the interactive recognition performance of video characters is realized by introducing human body gestures, however, the method adopts a simple graph structure to carry out shallow coding on gesture information, and the topological association between bone joints cannot be fully mined. The semantic modality is used as a bridge for connecting visual characteristics and external knowledge, so that the model can be helped to maintain robust understanding of interactive behaviors in a data limited or complex environment. While Visual Language Models (VLMs) have made significant progress in recent years, their enormous parametric volumes have resulted in high cost of downstream task fine tuning, and some research attempts to use VLM generated image descriptions (captons) to guide semantic learning, the inter-frame relevance of dynamic interactions in HOI video has made frame-by-frame Caption generation challenging accuracy. In addition, the intrinsic characteristics of the data among multiple modes have obvious differences, the complementarity among the modes and the complementarity exist, the contribution weights of the different modes are difficult to be weighted in a self-adaptive manner by the existing method (such as simple feature stitching), and the mode conflict is easy to occur particularly in a complex scene. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a multi-mode fusion character interaction recognition method and a system based on a large model, the invention adopts a multi-fusion strategy of message attention, feature level and decision level fusion to effectively fuse modes, thereby realizing accurate segmentation and identification of the video HOI. In a first aspect, the present invention provides a multi-modal fusion person interaction recognition method based on a large model, including the steps of: S1), acquiring a target video, and performing target detection, gesture estimation and capture extraction on each frame of image of the target video to acquire visual features, geometric features and text features of the target video; S2), constructing a feature extraction model, wherein the feature extraction model comprises a geometric feature extraction module and a semantic-visual dual-guide module, adopting the geometric feature extraction module to extract structural geometric features of the geometric features, and carrying out multi-head cross attention calculation on text semantic features, HOI action labels and visual features through the semantic-visual dual-guide module to realize deep interaction of the semantic and visual information and obtain visual features fused with the semantic; S3) constructing a space-time diagram with a visual center, taking the structural geometric features as different types of entities, acquiring hidden states of time sequence dimensions through a bidirectional cyclic neural network BiRNN, realizing information interaction between entities in class and between classes through a message attention mechanism, and finally outputting character interaction identification results through decision-level fusion. Preferably, in step S1), the extracting of the visual feature of the target video specifically includes the following steps: s11, acquiring a feature map of each frame of image by using a pre-trained target detector Boundary box of all object enti