CN-122024332-A - Whole body posture estimation method and system based on downward fish eyes and enhanced EgoPoseFormer model

CN122024332ACN 122024332 ACN122024332 ACN 122024332ACN-122024332-A

Abstract

The application relates to a whole body posture estimation method and system based on a downward fish eye and enhanced EgoPoseFormer model, wherein the method enlarges the human body capturing range through the head display bottom visual angle layout, compensates geometrical nonlinearity in a characteristic extraction stage by utilizing distortion perception convolution, and maintains the consistency of characteristic space when a camera is dynamically swayed by matching with a attention mechanism of an embedded posture compensation term. Meanwhile, by combining the cross-frame time sequence gating and the multi-mode filtering algorithm, the inertia data is utilized to perform kinematic correction on the visual predicted value, and the track smooth output is realized in a shielding or rapid motion scene. According to the scheme, the calculation force expenditure is reduced through model quantization and operator fusion, and low-delay real-time whole body gesture tracking is realized in a mobile terminal chip environment.

Inventors

ZHANG YIBAI
HUANG CUNLIN
Hao Zhentong

Assignees

杭州吾知混合现实技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A whole body posture estimation method based on a downward fish eye and enhancement EgoPoseFormer model, which is characterized by comprising the following steps: Acquiring an original image acquired by a fisheye camera arranged at the bottom of the head-mounted display device and real-time six-degree-of-freedom pose data calculated by a sensor of the head-mounted display device; Preprocessing the original image to generate a normalized image; Inputting the normalized image, the real-time six-degree-of-freedom pose data and preset user body type parameters into an enhanced EgoPoseFormer model for carrying out pose prediction to obtain a first 3D whole-body joint point coordinate, wherein the enhanced EgoPoseFormer model carries out distortion perception convolution operation in a feature extraction stage and introduces a compensation item based on the real-time six-degree-of-freedom pose data in an attention mechanism calculation; and inputting the first 3D whole body joint point coordinates and high-frequency motion data of an inertial measurement unit of the head-mounted display device into a fusion stabilizer, updating the state by using a filtering algorithm, and outputting smoothed second 3D whole body joint point coordinates.
2. The whole body posture estimation method based on the downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the fisheye camera is mounted at 40 to 70 mm below the binocular center of the head-mounted display device, the optical axis of the fisheye camera is 25 ° to 40 ° in pitch relative to the horizontal axis of the head-mounted display device, and the field angle FOV of the fisheye camera is 180 ° to 200 °.
3. The method for estimating the whole body posture based on the downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the specific step of preprocessing the original image includes: distortion correction using geometric imaging formula based on equidistant projection model, or And inputting the original image into a fisheye distortion correction network, outputting a pixel level displacement field through the fisheye distortion correction network, and resampling the original image based on the displacement field to generate the normalized image.
4. The method for estimating the whole body posture based on the downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the distortion-aware convolution operation includes: mapping a standard convolution kernel W to an original image coordinate system based on an internal reference and a distortion model of the fisheye camera to obtain a distortion perception convolution kernel W distorted (u, v): W distorted (u,v)=W(Φ -1 (u,v)) Where (u, v) is the pixel point coordinates and Φ -1 is the inverse function of the distortion correction.
5. The method of whole body pose estimation based on downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the attention mechanism calculation includes performing the following operations: Attention(Q,K,V)=softmax([QK T / ]+λ·P pose ·M align )V Wherein Q is a query vector, K is a key vector, V is a value vector, d k is a vector dimension, lambda is a weight parameter, P pose is pose coding data, and M align is a dynamic coordinate alignment matrix.
6. The method of estimating the whole body pose based on the downward fisheye and enhancement EgoPoseFormer model according to claim 5, wherein the dynamic coordinate alignment matrix M align is calculated from the real-time six degrees of freedom pose data by: M align = ·R reference Wherein R current is the rotation matrix of the head display at the current moment, and R reference is the rotation matrix of the head display at the reference moment.
7. The method of whole body pose estimation based on downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the enhancement EgoPoseFormer model performs cross-frame timing feature extraction by the following formula: h t =LSTM(h t-1 ,x t )·g t +(1-g t )·x t Wherein h t is the hidden state at the current time, x t is the original input data at the current time, and g t is the gating weight parameter.
8. The method for estimating the whole body posture based on the downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the filtering algorithm is square root unscented kalman filtering, and the state update specifically includes: Converting the first 3D whole body joint point coordinates from a camera coordinate system to a world coordinate system; Constructing a state vector X containing position, speed and acceleration information of all-body joints, wherein the state vector X is a data matrix of 3N rows and 3 columns, and N is the number of the joints; executing a filtering prediction step by using the angular velocity data and the acceleration data acquired by the inertial measurement unit; and performing a filtering update step by taking the first 3D whole body joint point coordinate converted to the world coordinate system as an observed value so as to output the second 3D whole body joint point coordinate.
9. The method of whole body pose estimation based on downward fisheye and enhancement EgoPoseFormer model according to claim 1, wherein the enhancement EgoPoseFormer model includes at least one of the following features: Model parameters are quantized by INT 8; Adopting a grouping query attention GQA mechanism; the convolutional layer, the batch normalization layer and the activation layer are fused in a convolutional neural network.
10. A whole body posture estimation system based on a downward fisheye and enhancement EgoPoseFormer model, comprising: The data acquisition module is used for acquiring an original image of the head-display bottom fisheye camera and real-time six-degree-of-freedom pose data; The preprocessing module is used for performing geometric distortion processing on the original image to generate a normalized image; The gesture estimation engine is used for inputting the normalized image, the real-time six-degree-of-freedom gesture data and the user body type parameter into the enhanced EgoPoseFormer model to conduct gesture prediction to generate a first 3D whole-body joint point coordinate, wherein the enhanced EgoPoseFormer model executes distortion perception convolution operation in a feature extraction stage and introduces a compensation item based on the real-time six-degree-of-freedom gesture data in an attention mechanism calculation; and the multi-mode fusion module is used for carrying out filtering fusion processing on the first 3D whole body articulation point coordinates and the motion data of the inertial measurement unit and outputting second 3D whole body articulation point coordinates.

Description

Whole body posture estimation method and system based on downward fish eyes and enhanced EgoPoseFormer model Technical Field The application relates to the technical field of human body posture estimation, in particular to a whole body posture estimation method and system based on a downward fish eye and enhanced EgoPoseFormer model. Background With the rapid development of extended reality (XR) technologies, including Virtual Reality (VR), augmented Reality (AR) and Mixed Reality (MR), real-time whole-body motion capture has become a core key technology for improving user immersion, realizing Avatar (Avatar) driving and human-computer interaction. At present, the technical scheme for realizing whole body motion capture is mainly divided into an external scheme and an internal scheme. Traditional external schemes rely on external positioning base stations such as laser positioning or external multi-camera motion capture systems, and although the schemes have higher positioning accuracy, the schemes generally face the bottlenecks of high system cost, complex deployment process, limited user activity range, difficulty in popularization in consumer level scenes and the like. In order to promote portability and ease of use of the system, an Inside-Out visual tracking technique based on a camera of a head-mounted display device (HMD) is becoming the mainstream. However, existing Inside-Out whole body pose estimation schemes still face the following challenges: Firstly, at present, a front-facing camera is mainly adopted by consumer-level head displays, and the field of view (FOV) of the front-facing camera is mainly concentrated in a hand and controller tracking area, so that the lower body of a user, particularly the legs and feet, are often out of the field of view of the camera, and complete whole body posture recovery is difficult to realize. In the self-center (Egocentric) gesture estimation task, the head-display camera frequently shakes along with the head movement of a user, the dynamic camera coordinate system leads to severe drift of image characteristics in time sequence, and meanwhile, the motion blurring and limb self-shielding phenomena generated by rapid movement often lead to obvious shaking or losing of a gesture prediction result. Disclosure of Invention In order to solve the problems, the application provides a method and a system for estimating the whole body posture based on a downward fish eye and an enhanced EgoPoseFormer model, which do not need peripheral equipment and have high robustness. In order to achieve the above object, in a first aspect, an embodiment of the present application provides a method for estimating a whole body posture based on a downward fisheye and enhancement EgoPoseFormer model, including the steps of: Acquiring an original image acquired by a fisheye camera arranged at the bottom of the head-mounted display device and real-time six-degree-of-freedom pose data calculated by a sensor of the head-mounted display device; Preprocessing the original image to generate a normalized image; Inputting the normalized image, the real-time six-degree-of-freedom pose data and preset user body type parameters into an enhanced EgoPoseFormer model for carrying out pose prediction to obtain a first 3D whole-body joint point coordinate, wherein the enhanced EgoPoseFormer model carries out distortion perception convolution operation in a feature extraction stage and introduces a compensation item based on the real-time six-degree-of-freedom pose data in an attention mechanism calculation; and inputting the first 3D whole body joint point coordinates and high-frequency motion data of an inertial measurement unit of the head-mounted display device into a fusion stabilizer, updating the state by using a filtering algorithm, and outputting smoothed second 3D whole body joint point coordinates. Preferably, the fisheye camera is mounted at 40 to 70 mm below the binocular center of the head-mounted display device, a pitch angle of an optical axis of the fisheye camera with respect to a horizontal axis of the head-mounted display device is 25 ° to 40 °, and a field angle FOV of the fisheye camera is in a range of 180 ° to 200 °. Preferably, the specific step of preprocessing the original image includes: distortion correction using geometric imaging formula based on equidistant projection model, or And inputting the original image into a fisheye distortion correction network, outputting a pixel level displacement field through the fisheye distortion correction network, and resampling the original image based on the displacement field to generate the normalized image. Preferably, the distortion-aware convolution operation includes: mapping a standard convolution kernel W to an original image coordinate system based on an internal reference and a distortion model of the fisheye camera to obtain a distortion perception convolution kernel W distorted (u, v): Wdistorted(u,v)=W( Φ -1(u,v)) Where (