CN-121999527-A - Pedestrian gesture recognition method, device, equipment, storage medium and program product

CN121999527ACN 121999527 ACN121999527 ACN 121999527ACN-121999527-A

Abstract

The application discloses a pedestrian gesture recognition method, a device, equipment, a storage medium and a program product, and relates to the technical field of image recognition, wherein the pedestrian gesture recognition method combines small target recognition and detail perception capability of a model by extracting a multi-scale feature map in continuous video frames; through bidirectional path fusion, a group of enhanced feature images which are rich in semantics and retain details are generated, and the problem that the accuracy of estimating the posture of a small-scale pedestrian is greatly reduced due to the fact that the edge detail information such as hands, ankles and the like is easily lost in the downsampling process of the traditional feature extraction scheme is avoided. By carrying out space-time dimension feature enhancement modeling on the enhancement feature map, long-distance dependence loss caused by local receptive field limitation of the pure convolution neural network, weak occlusion inference capability and inter-frame prediction jitter caused by time sequence information loss of a single-frame model are avoided.

Inventors

CHEN MING
WU JIE
TAO BAOQUAN
LI JING

Assignees

湖北文理学院

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (10)

1. A pedestrian gesture recognition method, characterized in that the method comprises: acquiring an input continuous video frame sequence, and extracting multi-scale features of the continuous video frame sequence to obtain a multi-scale feature map; performing bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map; performing feature enhancement modeling on the enhancement feature map to obtain space-time enhancement features; And detecting based on the space-time enhancement features to obtain a pedestrian gesture recognition result.
2. The pedestrian gesture recognition method of claim 1 wherein the step of performing multi-scale feature extraction on the sequence of consecutive video frames to obtain a multi-scale feature map comprises: Performing multi-scale feature extraction on the continuous video frame sequence through a cross-stage partial network to obtain a multi-scale feature map; The cross-stage part network comprises a plurality of stacked cross-stage part modules, wherein each cross-stage part module comprises a first branch, a second branch and a splicing layer; Correspondingly, the step of extracting the multi-scale characteristics of the continuous video frame sequence through the cross-stage partial network to obtain a multi-scale characteristic map comprises the following steps: performing convolution compression through the first branch to obtain a first characteristic; Performing depth global on the input continuous video frame sequence through a plurality of residual blocks of the second branch to obtain a high-dimensional second feature; splicing the first features and the second features through the splicing layer to obtain an output feature diagram of the cross-stage part module; and obtaining a multi-scale characteristic diagram by connecting the output characteristic diagrams of the cross-stage partial modules in series.
3. The pedestrian gesture recognition method of claim 1, wherein the step of performing bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map includes: inputting the multi-scale feature map into a path aggregation network for semantic enhancement and detail enhancement processing to obtain an initial enhancement map; and carrying out multi-resolution enhancement on the initial enhancement map through a high-resolution network to obtain an enhancement feature map.
4. The pedestrian gesture recognition method of claim 1 wherein the step of feature-enhancing modeling the enhanced feature map to obtain spatio-temporal enhanced features comprises: Carrying out space local perception enhancement on the enhancement feature map to obtain local enhancement features; And carrying out space-time dimension joint modeling on the local enhancement features to obtain space-time enhancement features.
5. The pedestrian gesture recognition method of claim 4 wherein the step of performing joint modeling of the temporal and spatial dimensions on the local enhancement features to obtain the temporal and spatial enhancement features comprises: in the space dimension, the local enhancement features are processed through a self-attention mechanism based on a moving window, so that the space attention weight is obtained; Adding a time position code to the local enhancement feature in a time dimension, and determining the time attention weight of the current frame based on the local enhancement feature of the continuous frame; and correcting the local enhancement characteristic of the current frame based on the spatial attention weight and the temporal attention weight to obtain a space-time enhancement characteristic.
6. The pedestrian gesture recognition method of claim 1 wherein prior to the step of acquiring the input sequence of consecutive video frames, further comprising: Training to generate a pedestrian gesture recognition model, wherein the pedestrian gesture recognition model is used for acquiring an input continuous video frame sequence, extracting multi-scale features of the continuous video frame sequence to obtain a multi-scale feature map, carrying out bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map, carrying out feature enhancement modeling on the enhanced feature map to obtain space-time enhanced features, and detecting based on the space-time enhanced features to obtain a pedestrian gesture recognition result; and in the training stage of the pedestrian gesture recognition model, optimizing the network by adopting a mixed loss function, wherein the mixed loss function comprises a boundary box regression loss and a weighted key point mean square error loss.
7. A pedestrian gesture recognition device is characterized in that, the pedestrian gesture recognition apparatus includes: the multi-scale feature extraction module is used for acquiring an input continuous video frame sequence, and carrying out multi-scale feature extraction on the continuous video frame sequence to obtain a multi-scale feature map; the bidirectional path fusion module is used for carrying out bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map; The space-time feature enhancement module is used for carrying out feature enhancement modeling on the enhancement feature map to obtain space-time enhancement features; and the gesture recognition module is used for detecting based on the space-time enhancement features to obtain a pedestrian gesture recognition result.
8. A pedestrian gesture recognition apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the pedestrian gesture recognition method of any one of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the pedestrian gesture recognition method according to any one of claims 1 to 6.
10. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the steps of the pedestrian gesture recognition method according to any one of claims 1 to 6.

Description

Pedestrian gesture recognition method, device, equipment, storage medium and program product Technical Field The present application relates to the field of image recognition technology, and in particular, to a pedestrian gesture recognition method, apparatus, device, storage medium, and program product. Background Currently, in the fields of automatic driving, intelligent video monitoring and man-machine interaction, a pedestrian gesture recognition system is mainly realized based on a deep learning architecture. Existing mainstream systems are typically deployed on-board edge computing platforms (e.g., NVIDIA Jetson series, embedded FPGAs) or cloud high performance servers. The hardware system mainly comprises image acquisition equipment (camera), a preprocessing unit and a deep learning reasoning acceleration card (GPU/NPU). Although some progress has been made in the above-described techniques, in practical complex road scene (e.g., autopilot environment) applications, existing single-frame detection models lack modeling of the video time dimension. In a continuous video stream, due to illumination change or gesture deformation, the prediction results of adjacent frames are often discontinuous, so that obvious 'shake' phenomenon appears on gesture key points visually, and the stability of downstream behavior prediction is affected. Disclosure of Invention The application mainly aims to provide a pedestrian gesture recognition method, a device, equipment, a storage medium and a program product, and aims to solve the technical problem that the existing pedestrian gesture recognition technology is realized based on a single-frame detection model and is easy to generate obvious shaking phenomenon in vision. In order to achieve the above object, the present application provides a pedestrian gesture recognition method, including: acquiring an input continuous video frame sequence, and extracting multi-scale features of the continuous video frame sequence to obtain a multi-scale feature map; performing bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map; performing feature enhancement modeling on the enhancement feature map to obtain space-time enhancement features; And detecting based on the space-time enhancement features to obtain a pedestrian gesture recognition result. In one embodiment, the step of extracting multi-scale features from the continuous video frame sequence to obtain a multi-scale feature map includes: Performing multi-scale feature extraction on the continuous video frame sequence through a cross-stage partial network to obtain a multi-scale feature map; The cross-stage part network comprises a plurality of stacked cross-stage part modules, wherein each cross-stage part module comprises a first branch, a second branch and a splicing layer; Correspondingly, the step of extracting the multi-scale characteristics of the continuous video frame sequence through the cross-stage partial network to obtain a multi-scale characteristic map comprises the following steps: performing convolution compression through the first branch to obtain a first characteristic; Performing depth global on the input continuous video frame sequence through a plurality of residual blocks of the second branch to obtain a high-dimensional second feature; splicing the first features and the second features through the splicing layer to obtain an output feature diagram of the cross-stage part module; and obtaining a multi-scale characteristic diagram by connecting the output characteristic diagrams of the cross-stage partial modules in series. In an embodiment, the step of performing bidirectional path fusion on the multi-scale feature map to obtain an enhanced feature map includes: inputting the multi-scale feature map into a path aggregation network for semantic enhancement and detail enhancement processing to obtain an initial enhancement map; and carrying out multi-resolution enhancement on the initial enhancement map through a high-resolution network to obtain an enhancement feature map. In an embodiment, the step of performing feature enhancement modeling on the enhancement feature map to obtain a spatio-temporal enhancement feature includes: Carrying out space local perception enhancement on the enhancement feature map to obtain local enhancement features; And carrying out space-time dimension joint modeling on the local enhancement features to obtain space-time enhancement features. In an embodiment, the step of performing space-time dimension joint modeling on the local enhancement feature to obtain a space-time enhancement feature includes: in the space dimension, the local enhancement features are processed through a self-attention mechanism based on a moving window, so that the space attention weight is obtained; Adding a time position code to the local enhancement feature in a time dimension, and determining the time attention weight of the current frame based on the local