CN-121999538-A - Behavior recognition method and system based on attribute-aware semantic collaborative network

CN121999538ACN 121999538 ACN121999538 ACN 121999538ACN-121999538-A

Abstract

The invention discloses a behavior recognition method and a behavior recognition system based on an attribute-aware semantic collaborative network, which comprise the steps of obtaining a skeleton sequence from a video, extracting geometric statistical information from the skeleton sequence, constructing a geometric-aware cue vector, obtaining modulation parameters based on the geometric-aware cue vector, executing linear modulation on space-time characteristics extracted from a backbone network according to the modulation parameters, inputting the attribute-aware semantic collaborative network, carrying out collaborative modeling on space structure information and time dynamic information of the skeleton sequence through a space semantic channel and a time semantic channel, obtaining a behavior recognition model through semantic consistency optimization training, inputting the video to be recognized into the behavior recognition model, outputting behavior types, and synchronously outputting geometric interpretation bases corresponding to recognition results. According to the invention, the bone geometric structure information can be explicitly injected into the space-time representation learning process, so that the integrated output of the identification result and the judgment basis is realized, and the stability and the interpretability of the model in a complex scene are improved.

Inventors

XU XIN
YU JIAHUI
CUI YANXIN
WANG JINLONG
MOU TAO
ZHANG XINGHAI

Assignees

浙江大学
杭州沃才高科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. A behavior recognition method based on attribute-aware semantic collaborative network is characterized by comprising the following steps: (1) Extracting a human skeleton key point coordinate sequence from a video containing human behaviors, and carrying out coordinate alignment and scale normalization on the human skeleton key point coordinate sequence to obtain a standardized skeleton sequence; (2) Extracting geometric statistical information from the standardized skeleton sequence, and constructing a geometric perception prompt vector; (3) Acquiring modulation parameters based on the geometric perception prompt vector, and performing linear modulation on the space-time characteristics extracted from the main network according to the modulation parameters to obtain the space-time characteristics after attribute modulation; (4) Inputting the linear modulated space-time characteristics into an attribute perception semantic collaborative network, collaborative modeling is carried out on space structure information and time dynamic information of a skeleton sequence through a space semantic channel and a time semantic channel, and a behavior recognition model is obtained through semantic consistency optimization training; (5) Inputting the video to be identified into the behavior identification model, outputting the behavior category, and synchronously outputting the geometric interpretation basis corresponding to the identification result.
2. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 1, wherein step (2) includes: (2-1) construction of the th for normalized skeletal sequences Carrying out smooth saturation normalization on the joint pair distance matrix of the frame to obtain joint pair distance distribution; (2-2) performing soft quantization coding on the distance distribution by adopting a radial basis function kernel group to obtain a global distance spectrum attribute vector ; (2-3) Constructing a first based on the set of bone segments Bone segment aware distance spectrum attribute vector for a frame : (2-4) Will And (3) with Splicing to obtain a frame-level geometrical attribute vector ; For all frame level geometric attribute vectors Performing time-dimensional averaging to obtain a sequence-level geometric perception prompt vector 。
3. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 2, wherein step (2-1) includes: Construction of the first for normalized bone sequences Joint-to-distance matrix of frame: wherein C represents the number of coordinate channels, T represents the number of time frames, Representing the number of the nodes; Represent the first Frame No Joints and the first The Euclidean distance between the individual joints; 、 Respectively the first Frame No First, second Standardized coordinate vectors of the individual joints; Carrying out smooth saturation normalization on the joint pair distance matrix to obtain normalized distances, thereby obtaining joint pair distance distribution: Wherein, the Representing the normalized joint pair distance; Is a robust reference scale.
4. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 2, wherein step (2-2) includes: soft quantization coding is carried out on the distance distribution by adopting a radial basis function kernel group to obtain the first Global distance spectrum attribute vector for a frame : Set radial basis function kernel group comprising Distance kernel, the first From the core center The definition is as follows: Bandwidth of a communication device The definition is as follows: Wherein, the Is a bandwidth adjustment coefficient; Constructing a first joint pair distance based on the total joint pair distance Global distance spectrum attribute vector for a frame : Wherein, the Represent the first Frame at the first Global response values on the individual distance kernels; 。
5. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 2, wherein step (2-3) includes: defining a set of bone segments Wherein Represented by the first Joints and the first Constructing the first joint based on the joint set Bone segment aware distance spectrum attribute vector for a frame : Wherein, the Representing the number of bone segments; Represent the first Frame at the first Bone segment response values on the individual distance kernels; 。
6. the behavior recognition method based on attribute-aware semantic collaborative network according to claim 1, wherein step (3) includes: (3-1) sequence level geometric sense hint vector Normalizing to obtain attribute prompt vector ; (3-2) Attribute hint vector through two layers of perceptrons Nonlinear mapping and dimension reduction are carried out to obtain an embedded vector ; (3-3) Embedding the vector using FiLM condition generator pairs Performing linear transformation to generate modulation parameters; (3-4) extracting spatiotemporal signatures of skeletal sequences by means of a graph-convolution network Time-space characteristic record according to modulation parameters Performing linear modulation to obtain space-time characteristics after linear modulation 。
7. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 6, wherein step (4) includes: (4-1) constructing a spatial semantic channel, and obtaining output characteristics of the spatial semantic channel ; (4-2) Constructing a temporal semantic channel, and obtaining output characteristics of the temporal semantic channel ; (4-3) Passing the projection heads respectively 、 Mapping to spatial alignment vectors in contrast learning space Time pair Ji Xiangliang Will be And (3) with Fusion to obtain global fusion features ; (4-4) Setting momentum encoders of the momentum update, the outputs of the momentum encoders being respectively corresponding to 、、 Momentum branch representation of (2) 、、 ; And (4-5) constructing cross-domain contrast loss, and carrying out semantic consistency optimization training to obtain a behavior recognition model.
8. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 7, wherein step (4-1) includes: Constructing a space semantic channel for capturing semantic relation of human bones on a space topological structure, and acquiring output characteristics of the space semantic channel : Wherein, the Representing a channel-by-channel multiplication operation; representation based on skeleton topology adjacency matrix Is a space diagram convolution operation; 、 scaling parameters and offset parameters of the spatial semantic channel respectively, are generated by FiLM condition generator Mapping is obtained.
9. The behavior recognition method based on attribute-aware semantic collaborative network according to claim 7, wherein step (4-2) includes: constructing a time semantic channel for capturing dynamic rules of action changing along with time, and acquiring output characteristics of the space semantic channel : Wherein, the Representing utilization of a time sequential adjacency matrix Performing graph convolution; 、 scaling parameters and offset parameters of the time semantic channel respectively, are generated by FiLM condition generator Mapping is obtained.
10. A behavior recognition system based on attribute-aware semantic collaborative networks, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor; Inputting the video to be identified into the behavior identification model, executing the computer program by the processor to realize the prediction of the behavior type in the video to be identified, and synchronously outputting the geometric interpretation basis corresponding to the identification result; the behavior recognition model is obtained by training steps in the method of any one of claims 1-9.

Description

Behavior recognition method and system based on attribute-aware semantic collaborative network Technical Field The invention relates to the technical field of artificial intelligence and computer vision, in particular to a behavior recognition method and system based on attribute-aware semantic collaborative network. Background Human behavior recognition is widely applied to the fields of medical rehabilitation, sports and the like, and has high performance, but in a high-reliability application scene, the existing method still has the problem of insufficient interpretability, namely, the traditional model only outputs action types and cannot interpret why the judgment is made, and the traditional model lacks of interpretability and semantic transparency. For example, in rehabilitation training, a doctor cannot know the specific basis for determining posture violations by the model. In the prior art （Lu,Mingqi,Xiaobo Lu,and Jun Liu. Momentum Contrastive Teacher for Semi-Supervised Skeleton Action Recognition. IEEE Transactions on Image Processing (2025).）, a student-teacher double encoder is designed, positive sample pairs are generated through bone cloud coloring data enhancement, space-time characteristics are extracted through double paths (space GCN and time convolution), and classification accuracy and characteristic clustering effect are optimized. The Chinese patent document with publication number CN120708290A discloses a space-time decoupling human body behavior recognition method based on dynamic semantic guidance mask, which performs semantic guidance mask processing on a skeleton sequence in space and time dimensions, combines an encoder and cross-domain contrast learning training to obtain a recognition model, and finally outputs a human body behavior prediction result to an input video. However, the above-mentioned prior art and the behavior recognition method (e.g. SkeleMoCLR, actCLR) combining self-supervised contrast learning and space-time decoupling network have improved recognition accuracy, but still have the following disadvantages: (1) The geometrical-semantic mapping is missing, namely the model does not establish an explicit mapping relation between geometrical characteristics such as joint pair distance, bone segment structural relation and the like and action semantics, so that decisions cannot be interpreted; (2) The space branch relies on the preset skeleton topology to carry on the local relation modeling, the time branch focuses on the dynamic change modeling between the adjacent frames, the two usually draw the characteristic separately and then fuse, have not realized the cooperative regulation to space structure information and time dynamic information (for example how the joint angle change rate influences the decision over time) around the unified geometrical attribute suggestion yet; (3) The interpretation result generation mechanism is missing, namely the existing method generally only outputs behavior categories, and the existing method lacks a skeleton geometric basis output mechanism corresponding to the recognition result. Therefore, it is needed to provide a new bone behavior recognition method, which encodes the bone geometric statistical distribution into a geometric perception prompt, and modulates the bone space-time characteristics based on the geometric perception prompt, and simultaneously realizes the collaborative modeling of spatial semantics and temporal semantics, so that the model can output geometric interpretation basis corresponding to the recognition result while outputting the behavior category, thereby improving the interpretability, traceability and application reliability of the behavior recognition method. Disclosure of Invention The invention provides a behavior recognition method and a system based on attribute-aware semantic collaborative network, which can realize behavior recognition with recognition performance and interpretability. The technical scheme of the invention is as follows: A behavior recognition method based on attribute-aware semantic collaborative network comprises the following steps: (1) Extracting a human skeleton key point coordinate sequence from a video containing human behaviors, and carrying out coordinate alignment and scale normalization on the human skeleton key point coordinate sequence to obtain a standardized skeleton sequence; (2) Extracting geometric statistical information from the standardized skeleton sequence, and constructing a geometric perception prompt vector; (3) Acquiring modulation parameters based on the geometric perception prompt vector, and performing linear modulation on the space-time characteristics extracted from the main network according to the modulation parameters to obtain the space-time characteristics after attribute modulation; (4) Inputting the space-time characteristics subjected to attribute modulation into an attribute-aware semantic collaborative network, performing collaborat