CN-122024320-A - Lightweight human body posture estimation method based on shared convolution and layered receptive field

CN122024320ACN 122024320 ACN122024320 ACN 122024320ACN-122024320-A

Abstract

The invention belongs to the technical field of computer vision, and particularly relates to a lightweight human body posture estimation method based on a shared convolution and layered receptive field, which is realized based on an improved YOLO11n-pose network architecture and comprises a main network, a neck network and a head network which are sequentially connected, wherein the main network comprises a first standard convolution layer, a plurality of characteristic processing module groups, a rapid space pyramid pooling module and a cross-stage partial space attention module which are sequentially connected, each characteristic processing module group is formed by sequentially connecting a second standard convolution layer and a hollow heavy parameterization module based on the layered receptive field, the neck network comprises a network structure for up-sampling and characteristic fusion, the head network comprises a lightweight detection head, and the problem that the accuracy is reduced due to the fact that the conventional human body posture estimation method is difficult to deploy on edge equipment and non-rigid deformation of a human body is solved.

Inventors

QI HUI
XU QIAN
YANG JING
LI SHENGLI
LI HUAJUN
SHI CHAO

Assignees

太原师范学院
山西华兴科软有限公司
太原市迎泽区山水城小学校

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (7)

1. The lightweight human body posture estimation method based on the shared convolution and the layered receptive field is characterized by being realized based on an improved YOLO11n-pose network architecture, wherein the improved YOLO11n-pose network architecture comprises a main network, a neck network and a head network which are connected in sequence; The main network comprises a first standard convolution layer, a plurality of characteristic processing module groups, a rapid space pyramid pooling module and a cross-stage partial space attention module which are sequentially connected, wherein each characteristic processing module group is formed by sequentially connecting a second standard convolution layer and a hole re-parameterization module based on a layered receptive field; The neck network comprises a network structure for upsampling and feature fusion; The head network comprises a light-weight detection head, and the light-weight detection head is an improvement on the structure of the original YOLO11n-pose network architecture detection head; the method comprises the following steps: S1, acquiring an input image to be detected, and inputting the input image into the backbone network; s2, performing preliminary feature extraction on an input image through the first standard convolution layer to obtain an initial feature map; S3, sequentially processing the initial feature images through the feature processing module groups to obtain a first multi-scale feature image, wherein in each feature processing module group, the input features are transformed through a second standard convolution layer, and then feature information of different scales is extracted through a cavity re-parameterization module; S4, carrying out pooling operation of different scales on the first multi-scale feature map in parallel through the rapid spatial pyramid pooling module, splicing and fusing results, extracting multi-scale spatial features, and obtaining a second multi-scale feature map; S5, processing the second multi-scale feature map through the cross-stage partial space attention module, focusing key information through an attention mechanism, and performing feature enhancement processing to obtain an optimized feature map; s6, inputting the optimized feature map output by the backbone network into a neck network, and performing multi-scale feature fusion through an up-sampling and feature fusion network structure in the neck network to obtain a fusion feature map; And S7, inputting the fusion feature map output by the neck network to a lightweight detection head of the head network for processing, and outputting a final human body posture estimation result.
2. The method for estimating a lightweight human body posture based on shared convolution and layered receptive fields of claim 1, wherein said hole re-parameterization module is constructed by: firstly, replacing sub-modules respectively used for extracting medium-scale and large-scale characteristic information in an expansion type residual error module with a high-efficiency characteristic extraction module to obtain a replaced expansion type residual error module; secondly, replacing a bottleneck module in the double convolution cross-stage module with the replaced expansion residual module, so as to form the cavity re-parameterization module; the convolution kernel sizes of the sub-modules for extracting the medium-scale and large-scale characteristic information are respectively set to be 5 and 7.
3. The method for estimating a lightweight human body posture based on shared convolution and layered receptive fields of claim 1, wherein in step S4, said fast spatial pyramid pooling module specifically performs the following operations: Executing pooling operation of at least two different scales on the input first multi-scale feature map in parallel, and retaining original features; Splicing the pooling results with different scales with the original features along the channel dimension; and carrying out channel compression and fusion on the spliced features by using 1X1 convolution, and outputting the second multi-scale feature map.
4. The method for estimating a lightweight human body posture based on shared convolution and layered receptive fields of claim 1, wherein in step S5, said cross-phase partial spatial attention module specifically performs the following operations: dividing the input second multi-scale feature map into a first branch and a second branch; The first branch performs identity mapping or direct transfer; The second branch is sequentially processed by a part of space attention mechanism and a plurality of convolution blocks; and fusing the output characteristics of the first branch with the output characteristics of the second branch, and outputting the optimized characteristic diagram through a convolution layer.
5. The method for estimating a lightweight human body posture based on shared convolution and layered receptive fields according to claim 1, wherein in step S6, the network structure of up-sampling and feature fusion realizes multi-scale feature fusion by: Receiving an optimized feature map output from a backbone network; Fusing the high-level semantic features with the low-level detail features through a top-down upsampling path; Enhancing semantic information of low-level features through a bottom-up path aggregation network; and performing dimension adjustment and quality optimization on the fused features through a convolution layer, and outputting the fused feature map.
6. The method for estimating a lightweight human body posture based on a shared convolution and a layered receptive field of claim 1, wherein the lightweight detection head is structurally improved in that initial convolution processing paths of a positioning branch and a classifying branch in an original YOLO11n-pose network architecture detection head are modified into a shared convolution processing path sharing the same group of convolution kernel parameters.
7. The method for estimating a lightweight human body posture based on a shared convolution and a layered receptive field of claim 6, wherein the improvement of the lightweight detection head further comprises introducing a scale layer before the shared convolution processing path for scaling adaptation of different size input features, and performing feature normalization processing in the shared convolution processing path by using a grouping normalization layer instead of a batch normalization layer.

Description

Lightweight human body posture estimation method based on shared convolution and layered receptive field Technical Field The invention belongs to the technical field of computer vision, and particularly relates to a lightweight human body posture estimation method based on shared convolution and layered receptive fields. Background Human body posture estimation is used as a core task in the field of computer vision, and the human body posture estimation shows wide application value in numerous fields such as intelligent monitoring, man-machine interaction, exercise rehabilitation and the like by accurately sensing the space position and the motion state of key parts of a human body. For example, in an intelligent monitoring scene, real-time early warning of abnormal behaviors can be realized based on human body gesture analysis, in a man-machine interaction scene, a natural and smooth limb interaction mode can be provided for a user, and in a sports rehabilitation scene, data support can be provided for training scheme optimization through gesture quantitative analysis. However, in the practical application process, a large number of human body posture estimation tasks need to be completed on the edge device, and the edge device is limited by factors such as hardware cost, physical volume, energy consumption control and the like, so that the computing capacity and the storage space of the edge device are often limited obviously. In order to pursue detection precision, the existing mainstream posture estimation algorithm generally adopts a complex network structure design and contains a large number of parameters, so that extremely high calculation overhead and storage requirements are generated in the operation process of the algorithm, hardware constraint of edge equipment is difficult to adapt, and the problem severely restricts large-scale popularization and application of the human body posture estimation technology in an edge calculation scene. Based on the method, a lightweight human body posture estimation algorithm suitable for a computing resource limited scene is researched and developed, and on the premise that the posture estimation precision meets the actual requirements, the consumption of the algorithm on computing resources and storage resources is obviously reduced, so that the method becomes a key technical problem to be solved in the current field. Disclosure of Invention The invention provides a lightweight human body posture estimation method based on shared convolution and layered receptive fields, which aims to solve the problem that the existing human body posture estimation method is difficult to deploy on edge equipment and the precision is reduced due to non-rigid deformation of a human body. The invention is realized by adopting the following technical scheme: the lightweight human body posture estimation method based on the shared convolution and the layered receptive field is realized based on an improved YOLO11n-pose network architecture, wherein the improved YOLO11n-pose network architecture comprises a main network, a neck network and a head network which are connected in sequence; The main network comprises a first standard convolution layer, a plurality of characteristic processing module groups, a rapid space pyramid pooling module and a cross-stage partial space attention module which are sequentially connected, wherein each characteristic processing module group is formed by sequentially connecting a second standard convolution layer and a hole re-parameterization module based on a layered receptive field; The neck network comprises a network structure for upsampling and feature fusion; The head network comprises a light-weight detection head, and the light-weight detection head is an improvement on the structure of the original YOLO11n-pose network architecture detection head; the method comprises the following steps: S1, acquiring an input image to be detected, and inputting the input image into the backbone network; s2, performing preliminary feature extraction on an input image through the first standard convolution layer to obtain an initial feature map; S3, sequentially processing the initial feature images through the feature processing module groups to obtain a first multi-scale feature image, wherein in each feature processing module group, the input features are transformed through a second standard convolution layer, and then feature information of different scales is extracted through a cavity re-parameterization module; S4, carrying out pooling operation of different scales on the first multi-scale feature map in parallel through the rapid spatial pyramid pooling module, splicing and fusing results, extracting multi-scale spatial features, and obtaining a second multi-scale feature map; S5, processing the second multi-scale feature map through the cross-stage partial space attention module, focusing key information through an attention mechanism, and perfor