CN-116386147-B - Human interactive behavior recognition method and system based on multi-view comparison

CN116386147BCN 116386147 BCN116386147 BCN 116386147BCN-116386147-B

Abstract

The invention belongs to the field of computer vision, and provides a human interactive behavior recognition method and system based on multi-view comparison, wherein the method and system are used for acquiring the position information of human joints in each frame of video data; the method comprises the steps of constructing a skeleton space-time diagram based on position information of human joints in each frame, adaptively deleting edges or nodes through a diagram convolutional neural network based on the skeleton space-time diagram, constructing an enhanced view deleting the nodes or the edges, well relieving the problem of uneven distribution, reserving enough information meeting the minimum for a behavior recognition task in each view by adopting an information bottleneck principle, increasing the difference between the enhanced view and an original skeleton space-time diagram, maximizing information related to the behavior recognition task, obtaining multi-view representation, and obtaining human interactive behavior recognition results according to the multi-view representation. To better learn multi-view representation learning of interaction behavior from different aspects.

Inventors

LV LEI
PANG CHEN
Geng pei

Assignees

山东师范大学

Dates

Publication Date: 20260508
Application Date: 20230522

Claims (9)

1. The human interactive behavior recognition method based on multi-view comparison is characterized by comprising the following steps of: Acquiring position information of human joints in each frame of video data; Constructing a skeleton space-time diagram based on the position information of human joints in each frame; Based on the skeleton space-time diagram, self-adaptively deleting the edges or nodes of the skeleton space-time diagram through a diagram convolutional neural network, and constructing an enhanced view for deleting the nodes or edges; the method for constructing the enhancement view of deleting the node or the edge based on the skeleton space-time diagram through the graph convolutional neural network and adaptively deleting the edge or the node of the skeleton space-time diagram comprises the following steps: each layer of deletable node of the neural network of the learning graph is rolled, and a node deletion view is created after the affected nodes are shielded; meanwhile, each layer of deletable edge of the convolutional neural network of the learning graph is filtered to create an edge deletion view after noise edges are filtered; Replacing the nodes of the skeleton space diagram to be deleted with the representation of the partial subgraph of the nodes to be deleted so as to obscure the original representation and keep the corresponding edges; the parameters are controlled by using a multi-layer perceptron to control whether the node is shielded, Wherein, the Indicating whether the ith node of the first layer requires masking, An ith node of the first layer; the information bottleneck principle is adopted, the difference between the enhanced view and the original skeleton space-time diagram is increased, meanwhile, information related to the behavior recognition task is maximized, and enough information meeting the minimum is reserved for the behavior recognition task in each view, so that multi-view representation is obtained; And classifying based on the obtained multi-view representation to obtain a human interactive behavior recognition result.
2. The human interactive behavior recognition method based on multi-view comparison of claim 1, wherein the constructing a skeleton space-time diagram based on the position information of human joints in each frame specifically comprises: In the space dimension, determining the space position of the joint point according to the coordinate information of the joint point in each frame, and then drawing corresponding edges according to the natural structure in the human body to obtain a space topological graph of the skeleton sequence; after the construction of the spatial topological graph of the skeleton sequence is completed, the nodes representing the same joint in adjacent frames are connected to form the skeleton space-time sequence graph.
3. The human interactive behavior recognition method based on multi-view comparison according to claim 1, wherein the expression for creating a node deletion view after shielding the affected nodes is as follows: in the formula, Refers to the i-th node in the layer i network layer, From a parameterized Bernoulli distribution, which indicates whether nodes are reserved or not , Is a collection of edges.
4. The method for identifying human interactive behavior based on multi-view comparison of claim 1, wherein the information bottleneck principle is adopted to increase the difference between the enhanced view and the original skeleton space-time diagram and maximize the information related to the behavior identification task, specifically, the redundant information in each view is removed by minimizing the mutual information between the enhanced view and the original diagram through negative comparison learning loss, and the rest information is reserved.
5. The human interactive behavior recognition method based on multi-view comparison as set forth in claim 1, wherein the skeleton space-time diagram input diagram convolution neural network is in the form of an adjacency matrix, and the adjacency matrix is n in size N, n represents the number of nodes in the graph, and if there is a relation between two nodes, the element value at the corresponding position in the adjacency matrix is 1, otherwise, is 0.
6. The human interactive behavior recognition method based on multi-view contrast according to claim 1, wherein the redundant information in each view is removed by minimizing mutual information between the enhanced view and the original view by adopting negative contrast learning loss, and the expression for retaining the remaining information is as follows: in the formula, The original view is represented and, An enhanced view is represented and, The BPR loss is represented, so that the difference between the enhanced view and the original view is as maximized as possible, Representing mutual information between enhanced views of two deleted nodes, Representing mutual information between enhanced views of two deleted edges, Representing the degree of similarity between the two vectors, And For two different enhanced views of the deleted node, And For two different enhanced views of the deleted edge, For vector representations of corresponding nodes in the enhanced view of two different deleted nodes, Is that Is a vector representation of a corresponding edge in the enhanced view of two different deleted edges.
7. A human interactive behavior recognition system based on multi-view comparison, comprising: the joint information acquisition module is used for acquiring the position information of the human joint in each frame of the video data; the skeleton space-time diagram construction module is used for constructing a skeleton space-time diagram based on the position information of the human joints in each frame; the enhanced view construction module is used for adaptively deleting the edges or nodes of the skeleton space-time diagram based on the skeleton space-time diagram through a diagram convolutional neural network, and constructing an enhanced view for deleting the nodes or edges; the method for constructing the enhancement view of deleting the node or the edge based on the skeleton space-time diagram through the graph convolutional neural network and adaptively deleting the edge or the node of the skeleton space-time diagram comprises the following steps: each layer of deletable node of the neural network of the learning graph is rolled, and a node deletion view is created after the affected nodes are shielded; meanwhile, each layer of deletable edge of the convolutional neural network of the learning graph is filtered to create an edge deletion view after noise edges are filtered; Replacing the nodes of the skeleton space diagram to be deleted with the representation of the partial subgraph of the nodes to be deleted so as to obscure the original representation and keep the corresponding edges; the parameters are controlled by using a multi-layer perceptron to control whether the node is shielded, Wherein, the Indicating whether the ith node of the first layer requires masking, An ith node of the first layer; The multi-view representation module is used for increasing the difference between the enhanced view and the original skeleton space-time diagram by adopting an information bottleneck principle, maximizing information related to the behavior recognition task, and reserving enough minimum information for the behavior recognition task in each view to obtain multi-view representation; And the behavior recognition module is used for classifying based on the obtained multi-view representation to obtain a human interactive behavior recognition result.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of a multi-view contrast based human interaction behavior recognition method according to any of claims 1-6.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a multi-view contrast based human interaction behavior recognition method according to any of claims 1-6 when the program is executed.

Description

Human interactive behavior recognition method and system based on multi-view comparison Technical Field The invention belongs to the field of computer vision, and particularly relates to a human interactive behavior recognition method and system based on multi-view comparison. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. With the gradual popularization and popularization of high-definition video monitoring, the monitoring video data is more and more. In the security field, especially in public places, the behaviors of people need to be monitored in real time, so that accidents are prevented. Along with the rapid development of computer vision technology, the accuracy of individual behavior recognition is greatly improved, but the problem of human body activities with complex relationships among multiple people is not fully solved. In real life, some common behaviors are mostly some interactive behaviors, such as handshake, hugging, frame taking and the like. Compared with single-person actions, the interactive action complexity is higher, the variety of limb actions is more in the process of completing the interactive action, and the change among limbs is more diversified. Therefore, how to efficiently extract the features of the interaction behavior and model and analyze the interaction behavior is a challenging problem. Human behavior recognition methods can be classified into three types, namely a human behavior recognition method based on RGB video, a human behavior recognition method based on depth map and a human behavior recognition method based on skeleton sequence according to the source of human motion data. The RGB video data provides the required space and time information for human behavior recognition, but does not include human motion structure information (such as positions and angles of various joints or body parts of a human body and relative relations between the joints or the angles of the human body) distributed in a three-dimensional space, can only provide a two-dimensional space state of a person, and is easily interfered by factors such as complex background, illumination, visual angle change and the like, so that the motion recognition accuracy is reduced. Compared with RGB video data, the depth map can provide information such as the distance between the viewpoint and the object, the coordinates of each joint of the human body or each component of the human body in the three-dimensional space, the outline and texture of the human body in the three-dimensional space and the like, and can separate the character from the background, but the method such as the depth map needs larger memory space and stronger computing capacity of the computing equipment. The skeleton sequence can define the human body posture through the relative position information of the joint points, and can more truly represent the geometric structure of the human body movement mode. Compared with the image characteristics, the skeleton characteristics are more compact, the motion description of the human body is more specific, and the image characteristics are not easily influenced by illumination and background changes. In summary, the inventors found that the following technical problems exist in the prior art: (1) And (5) interaction noise. In the interaction, due to errors or shielding of the sensor, noise interference often exists, so that the relationship between body parts interacted by the subjects cannot be clearly simulated, which is key information of interaction identification. Whereas the graph convolution based model is susceptible to input graph quality, which means that aggregating misleading neighborhood information may lead to suboptimal performance. (2) Skeletal data has a variety and complexity. Different persons may have different heights, body shapes, postures and modes of action, and the same behavior may have different execution speeds, amplitudes and angles. Such non-uniform distribution of data will tend to bias the graph convolution based model toward data having a distribution or distributions that prevent learning of the behavioral representation. Disclosure of Invention In order to solve at least one technical problem in the background art, the invention provides a human interactive behavior recognition method and system based on multi-view comparison, which can learn whether to delete edges or nodes, convert an original skeleton diagram into related views, integrate different views into compact representation of a downstream behavior recognition task, and simultaneously jointly optimize the downstream behavior recognition task in an end-to-end manner, so that the robustness of a model is further improved. In order to achieve the above purpose, the present invention adopts the following technical scheme: the first aspect of the invention provides a human interactive beh