CN-120612732-B - Human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion

CN120612732BCN 120612732 BCN120612732 BCN 120612732BCN-120612732-B

Abstract

The invention provides a human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion, which comprises the steps of splicing a joint Tokens with a space CLS Tokens of the same frame after feature extraction is completed by using a space Transformer to obtain a feature fusion module, carrying out structural coding on a joint point sequence based on a human anatomy structure to construct a human body topological structure association module, constructing a time sequence cross Transformer module, stacking the space Transformer, the feature fusion module, the human body topological structure association module and the time sequence cross Transformer module to obtain a trunk feature extraction network, carrying out deep feature extraction on input early-stage fusion data and feature fusion data by adopting a 10-layer trunk feature extraction network, classifying by using a classification head, mapping features output by the trunk network to dimensions with the same number as the number of behavior categories, outputting a prediction score for each category, and obtaining a human behavior recognition result according to the prediction score. The method can effectively identify human behaviors in the multi-mode data collected by the unmanned aerial vehicle.

Inventors

JI XIAOFEI
TIAN SHUWEN
SONG YIFENG

Assignees

沈阳航空航天大学

Dates

Publication Date: 20260505
Application Date: 20250527

Claims (5)

1. The human behavior recognition method based on the fusion of the acquired multi-mode data of the unmanned aerial vehicle is characterized by comprising the following steps: S1, acquiring brightness weights of RGB data and infrared data through a brightness decoupling and sensing module, respectively embedding the brightness weights into an input sequence, and then performing early fusion on the two modes to be used as network input; s2, performing multi-frame coding on the joint point sequence through a multi-layer perceptron to generate a joint Tokens, and splicing the joint Tokens with a space CLS Tokens of the same frame after feature extraction is completed by using a space Transformer to obtain a feature fusion module; S3, carrying out structural coding on the joint point sequence based on the human anatomy structure, and adding the joint point sequence and the attention moment array after feature fusion to construct a human topology structure association module; s4, grouping the fusion features of the S3 along a time dimension, respectively carrying out average pooling and maximum pooling to generate a time difference correlation matrix, and constructing a time sequence cross transducer module; S5, stacking the space Transformer, the feature fusion module, the human body topological structure association module and the time sequence cross Transformer module to obtain a trunk feature extraction network; S6, deep feature extraction is carried out on the input early-stage fusion data and the feature fusion data by adopting a 10-layer trunk feature extraction network, classification is carried out by using a classification head, the features output by the trunk network are mapped to the dimensionality which is the same as the number of the behavior categories, a prediction score is output for each category, and a human behavior recognition result is obtained according to the prediction score; S1 specifically comprises: s11, performing space conversion on RGB data to obtain a corresponding HSL color space, performing channel decoupling on the HSL color space, reserving a brightness channel L, and stacking RGB three channels and the corresponding brightness channel L to obtain input data of a brightness perception network; S12, dividing the input data obtained in the S11 into two types of strong and weak according to brightness values, marking the data, sending the data into a brightness perception network formed by two 3*3 rolling layers and an average pooling layer to extract brightness characteristics, and obtaining strong/weak classification probability of a 0-1 interval through two full connection layers and a Softmax function; S13, taking the strong-class Softmax value obtained in the S12 as a continuous brightness value and inputting the continuous brightness value into a gating function, constructing a nonlinear weight curve by utilizing the continuity of the Softmax probability and the gating function of introducing an exponential function, and realizing self-adaptive distribution of weight, wherein the gating function is shown in a formula (1); Wherein, the For a continuous value of the luminance, Is a smooth transition coefficient for adjusting the weight in the brightness transition region Output gating function as weight of RGB modality The infrared mode weight is 。
2. The human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion according to claim 1, wherein in S2, the joint point sequence is subjected to multi-frame encoding by a multi-layer perceptron to generate a joint Tokens, and the method comprises the following steps: Extracting human body joint point data in the RGB sequence and the infrared sequence by using a human body posture estimation algorithm, and stacking multi-frame data according to time sequence to generate a joint point sequence; the joint point sequence is subjected to multi-frame coding through a multi-layer perceptron, and a joint Tokens with the same dimension as the characteristic of the space CLS Tokens is generated.
3. The human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion of claim 1, wherein S3 specifically comprises: S31, abstracting joint points into nodes in a graph structure according to the topological structure of human bones, endowing the edges of adjacent joints with a distance value of 1, and accumulating and calculating the distances of non-adjacent joints through shortest paths to generate a physical distance matrix Wherein the elements are Representing joints And (3) with Anatomical distance between; s32, the physical distance matrix generated in the step S31 As a priori knowledge, for each distance value Assigning a learnable scalar parameter Generating a parameter vector Structural coding matrix The elements are generated by table look-up mode Thereby structuring the coding matrix Is indexed to the parameter vector by the anatomical distance of the corresponding joint pair The standard value of the model is automatically adjusted by training Importance of each distance component in (a); S33, in a standard self-attention mechanism, calculating the attention weight matrix through the dot product of Query and Key Key to obtain semantic similarity, and introducing the semantic similarity into a structural coding matrix Then, the matrix is directly overlapped on the dot product result as a bias term, and the calculation of the attention weight is corrected; Will be Each element represented as an attention score matrix Is the first of (2) The element is shown in the formula (2) of the human body topological structure associated attention weight calculation: Wherein, the And Respectively represent query matrices Is the first of (2) Individual elements and key matrix Is the first of (2) The number of elements to be added to the composition, Reaction joint And (3) with The content relevance of the feature vector, For a key matrix The square root of which is used to scale the dot product result, Encoding a matrix for a structure Is directly superimposed into the attention score as a biasing term, so that the model is subject to rigid constraints of the anatomical structure while focusing on semantic similarity.
4. The human behavior recognition method based on unmanned aerial vehicle acquisition multi-modal data fusion of claim 1, wherein S4 comprises: S41, compressing the input characteristic channel to And transpose the time dimension to the first place to obtain the time dimension characteristics after space-time separation , wherein, As a dimension of time it is possible to provide, In order to reduce the number of channels after the dimension reduction, Counting the number of joints of a human body; s42, the time dimension characteristics obtained in S41 Divided into two groups along the channel dimension , wherein, Representing channel dimensions Are equally divided into two groups, each group of channels becomes Then ; For a pair of Respectively carrying out average pooling and maximum pooling to generate a global time sequence trend vector And local saliency vector Wherein, the method comprises the steps of, Compressing the features of each time frame into scalar quantities to obtain global average pooling operation and maximum pooling operation respectively , In order to adjust the fully connected layer of the feature distribution, Normalize weights to [0,1] for Sigmoid operation; s43, for the global time sequence trend vector obtained in S42 And local saliency vector Performing difference attention dot product operation to generate a difference perception matrix, wherein a difference perception matrix calculation formula is shown in formula (3); Wherein, the For the difference perception item, the difference is automatically expanded into Matrix representing inter-frame difference weights.
5. The human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion according to claim 1, wherein the classification head in S6 is composed of two full connection layers and a Softmax function.

Description

Human behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion Technical Field The invention discloses a human body behavior recognition method based on unmanned aerial vehicle acquisition multi-mode data fusion, which relates to the technical field of computer vision. Background Human behavior recognition is an important research direction in the field of computer vision, and is widely applied to the fields of security, rescue, traffic and the like. Although the data collected by the traditional fixed equipment can realize higher identification accuracy, the problems of single visual angle, poor environmental adaptability, weak model generalization capability and the like exist, and the equipment has high cost, fixed installation position and difficult large-scale application. The unmanned plane has low cost and strong maneuverability, can be provided with various data acquisition devices to acquire multi-mode data, has various data environments, rich information and complementarity, can obviously improve the robustness and generalization capability of the model through fusion processing, and is more suitable for practical application. Therefore, human behavior recognition based on unmanned aerial vehicle acquisition multi-modal data fusion is attracting increasing attention. At present, the human behavior recognition method based on multi-mode data fusion is mainly divided into three types of data-level fusion, feature-level fusion and decision-level fusion. The data level fusion aims at performing operations such as data splicing and stacking after aligning data of different modes in a data preprocessing stage, and can keep the integrity and the authenticity of original data. Aiming at the problem that the single-mode input of the conventional transducer model is difficult to consider the space-time characteristics, the prior art document JingY,Wang F.Tp-vit:A two-pathway vision transformer for video action recognition[C]IEEE International Conference onAcoustics,Speech and Signal Processing(ICASSP).IEEE,2022:2185-2189. provides a double-channel cooperative framework based on multi-mode input. The architecture splices a serialized skeleton token generated by a human joint point and a double-channel RGB vision token at an input layer, constructs multi-mode data to be embedded and input into a double-channel transducer. The dual path Transformer works cooperatively through two parallel processing paths, the slow path using a high resolution (224 x 224) and low frame rate (8 frames) configuration to capture fine static spatial details, and the fast path using a low resolution (112 x 112) and high frame rate (32 frames) configuration to efficiently extract dynamic timing information, the two paths sharing encoder parameters to reduce computational redundancy. However, shallow interaction is realized only through data splicing, cross-modal deep semantic association is not fully mined, and a strategy for sharing encoder parameters can limit the learning ability of different modal differential features while reducing computational redundancy, so that the collaborative modeling effect of the space-time relationship is affected. The feature level fusion realizes multi-modal feature interaction through a neural network middle layer, deep semantic association of different modes is integrated by adopting feature splicing, mapping or attention mechanisms, joint feature expression is constructed, and cross-modal complementary information can be effectively captured. In the prior art document Das S,Sharma S,Dai R,et al.Vpn:Learning video-pose embedding for activities of daily living[C]//European conference on computer vision.Cham:Springer International Publishing,2020:72-90., in order to improve the recognition capability of fine-grained actions, a video-gesture embedded network (VPN) is provided, and cross-modal feature complementation is realized by jointly modeling RGB video and 3D skeleton data. And modeling a human joint topological relation by using a graph rolling network (GCN) aiming at the joint node flow to generate high semantic attitude characteristics. And mapping the two types of features to a unified semantic space through a space embedding module, enhancing modal alignment and interaction, and introducing normalized Euclidean loss to optimize embedding consistency. However, the superposition of the graph-convolution network and the attention mechanism leads to insufficient integration of the model, which leads to increased redundancy parameter quantity of the network structure, and meanwhile, the utilization of complementary information is limited due to feature fusion which is carried out only by relying on spatial embedding mapping. The decision-level fusion is to independently perform feature extraction and classification on each mode data, integrate decision results in a model output layer through a weighted average mode, a voting mechanism mode or a probability