CN-121810667-B - Physical perception and graph diffusion-based articulating object posture generation method

CN121810667BCN 121810667 BCN121810667 BCN 121810667BCN-121810667-B

Abstract

The invention discloses an articulated object posture generation method based on physical perception and graph diffusion, and relates to the field of computer vision. Firstly reconstructing the componentized geometric and physical properties of an object from an image through a double-branch neural implicit network, and initializing component connection relations. Subsequently, the component relationship graph is refined by a kinematic fit and a time series consistency check, and the pose prior distribution of the component on the SE (3) manifold is inferred by using a physically enhanced graph diffusion process. And finally, constructing a conditional diffusion model on the SE (3) manifold on the condition of the priori and reconstructed information, and performing inverse sampling through a physically guided two-step inverse sampling framework to generate a set of diverse and physically reasonable articulating object posture assumption sets. The invention realizes the efficient generation of diversified and high physical rationality postures of articulated objects by tightly coupling physical laws with data driving generation.

Inventors

Fan Manping
ZHANG XINYUE
ZHENG YINGYING
WANG ZIHENG
GUAN QING
YU YONGBIN
WANG XIANGXIANG

Assignees

电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260306

Claims (8)

1. The method for generating the gesture of the articulated object based on physical perception and graph diffusion is characterized by comprising the following steps of: step 1, constructing a double-branch joint representation network, respectively reconstructing the part geometric shape and the physical attribute of an articulated object from image data, initializing the joint connection relation among the parts, and outputting the relation diagram initialization parameters among the parts; The dual-branch joint representation network comprises a geometric branch and a physical branch; The geometric branching comprises extracting component level geometric features of each component by using an independent geometric feature extraction network, stacking all component level geometric feature vectors into a matrix, performing maximum pooling aggregation operation on the matrix along component dimension, obtaining global geometric codes with fixed dimension by a full connection layer, and simultaneously predicting signed distance values and attribution logic values of query points from the component level geometric features by a linear transformation layer; Physical branching by predicting component-level physical attribute vectors for each component i by an independent physical feature extractor based on the extracted physical features, simultaneously acquiring global physical codes describing global physical attributes by a global physical feature extractor and a linear layer, predicting joint parameters of a joint k of a joint i connected with its parent component by a lightweight joint parameter encoder based on the component-level physical attribute vectors The output complete parameters comprise axial direction, joint type, screw pitch and movement range; step 2, diffusing a physical enhanced hierarchical component relation diagram, inputting relation diagram initialization parameters into a physical enhanced diagram diffusion model for reasoning, and reasoning the gesture probability distribution of each component under the guidance of a task; Step 201, defining an optimized joint constraint energy function according to joint types, and accurately estimating joint parameters, then utilizing multi-view time sequence observation data to carry out motion consistency test and confidence assessment on a kinematic relation defined by the accurately estimated joint parameters, and then outputting a determined part relation diagram with the accurate joint parameters and the confidence; Step 202, defining hidden states and message functions of nodes based on component gestures and physical information, calculating attention weights of the nodes by adopting a physically perceived graph attention mechanism, and carrying out weighted aggregation based on messages and the attention weights of neighbor nodes to obtain an aggregated message of the current node; Step 203, firstly defining a gesture energy function, measuring the satisfaction degree of the gesture on joint constraint and the consistency of the gesture and task description, and predicting the gesture probability distribution at the minimum value of the gesture energy function to approximate the gesture distribution of each component to be in the following state The Gaussian distribution on manifold, output the gesture priori distribution parameter of each part; Step 3, constructing a conditional diffusion probability model on the SE (3) manifold by taking the geometric shape, the physical attribute and the gesture probability distribution of the component as conditions, adopting a physically guided double-step reverse sampling frame to gradually recover and optimize the gesture from noise, and finally generating a set of diversified and physically reasonable articulated object gesture hypothesis sets; step 301. The conditional diffusion probability model includes two processes of forward diffusion and inverse denoising, the forward diffusion process is triggered from the component pose by generating a corresponding lie algebra The inverse denoising process starts from the random noise, takes multi-mode condition information as guidance, learns a parameterized denoising network, predicts and removes the noise step by step, and recovers the component gesture conforming to physical and geometric constraints; Step 302, a task condition feature vector is obtained by a task through a pre-trained text encoder, global physical codes, global geometric codes and task condition feature vectors are spliced, dynamic condition fusion is carried out on current noise gestures and spliced features based on a cross-modal attention mechanism to obtain fused condition vectors, a score network of an inverse denoising process is constructed, the noise gestures, the diffusion steps and the fused condition vectors of the t-th step are given, inverse denoising is carried out, and the training comprises standard diffusion model denoising loss and differentiable physical regularization loss based on a physical rule, wherein the physical regularization loss comprises penetration penalty, static stability penalty, motion energy penalty and joint constraint penalty; step 303, at In the cut space, performing Langmuir dynamics update based on the learned score network, obtaining a preliminary denoising proposal from noise, and then obtaining a proposal gesture on SE (3) through exponential mapping; calculating the physical violation degree under the proposed gesture based on the physical regularization loss, back-propagating to obtain a physical correction gradient, correcting the prediction result in the tangent space, and repeating the step until reaching an iteration stop condition; step 304, slave Starting from different random noise types, running step 303, generating N independent sampling tracks, finally obtaining N candidate gesture hypotheses, performing offline physical evaluation on each candidate gesture, wherein the offline physical evaluation comprises evaluation of collision scores, evaluation of stability scores, energy scores and joint compliance scores of penetration between components, sorting all candidate gestures according to evaluation results, and outputting the first M gestures with highest scores.
2. The method for generating the posture of the articulated object based on physical perception and graph diffusion according to claim 1, wherein the step 1 is specifically as follows: Step 101, three kinds of coding of position, direction and gesture are carried out on input image data, and then three kinds of coding results are spliced to obtain shared characteristics; step 102, constructing a dual-branch joint representation network, wherein the dual-branch joint representation network comprises a geometric branch and a physical branch; Step 103, outputting initializing parameters of a component relation diagram, wherein the initializing parameters comprise node attributes, an edge candidate set, global geometric codes and global physical codes, the node attributes are component attributes and comprise signed distance values, component-level physical attribute vectors and local coordinate system transformation, and each edge of the edge candidate set comprises joint parameters.
3. The method for generating the posture of the articulated object based on physical perception and graph diffusion according to claim 2, wherein the joint parameters are expressed as , wherein, Is a unit vector in the direction of the joint axis; is a joint type, which includes rotary, translational, fixed and spiral joints, Is a pitch parameter and is effective only on a spiral joint, and the range of motion Indicating the allowed range of articulation, Represents a set of real numbers, At the lower limit of the value of the threshold, Is the upper limit.
4. A method for generating an articulated object pose based on physical perception and graph diffusion according to claim 3, wherein said step 201 is specifically as follows: Step 2011, for each candidate edge Observing data using multi-view time sequence , The pose of the component i is represented, Representing the pose of component j by optimizing the joint constraint energy function To accurately estimate its joint parameters, the formula is as follows: , Wherein, the As the joint parameters of the joint k, For an accurate estimation of the joint parameters, According to the joint type definition: , Wherein, the The rotation matrices of the components i and j respectively, Is a three-dimensional rotating group, and the three-dimensional rotating group, Is that To its lie algebra Is a logarithmic mapping of (a) to (b), The positions of the components i and j are indicated, For a desired relative rotation vector, Unit vector representing direction of vector perpendicular to joint axis Components of (2); Step 2012, using the multi-view time series observation data, estimating joint parameters based on the accurate estimation Calculating the theoretical relative movement amount which parts i and j should exhibit at each time step based on the joint parameters At the same time, the measured relative motion quantity at the corresponding moment is directly extracted from the actual observation data The specific form of the relative motion quantity is determined according to the joint type, namely the relative rotation angle is determined for a rotary joint, the axial linear displacement is determined for a translational joint, the generalized displacement of rotation-translational coupling is determined for a spiral joint, and the specific definition of the generalized displacement is consistent with the motion constraint of the corresponding joint type in the step 2011; by comparing the difference between the theoretical predicted value and the actual observed value, the confidence score of the candidate connection is calculated The formula is as follows: , Wherein, the The number of time steps is indicated and, Setting threshold value for scale parameter When (when) Judging false connection and filtering; Step 2013, determining a component relation diagram output, outputting a refined component relation diagram, wherein nodes comprise component codes with symbol distance values, physical attribute vectors and a local coordinate system, and edges comprise accurate joint parameters Confidence level 。
5. The method for generating the posture of the articulated object based on physical perception and graph diffusion according to claim 4, wherein the step 202 specifically comprises the following steps: step 2021 hidden status of each node Encoder through cyclic neural network Maintenance and updating: , Wherein, the Pose component i on SE (3) The mapping is a function of the vector quantity, For angular and linear speeds of components calculated from multi-view RGB-D time-series images or videos, For the component physical attribute vector, Is an external acting force; by message function Mapping hidden states of component nodes i, j to message vectors with external forces of joint parameters ; Message aggregation adopts a physically-perceived graph attention mechanism and attention weights The node state similarity and the connection rigidity are determined together, and the formula is as follows: , Wherein, the In order for the attention to be weighted, Is the set of neighbor nodes for node i, k is the neighbor node index, Is the hidden state of the neighbor node k, The function represents the similarity of hidden states of nodes i and j, The function represents the similarity of hidden states of node i and neighbor node k, Representing an estimate of the stiffness of the connection between the hidden states of nodes i and j, Representing an estimate of the connection stiffness between hidden states of node i and neighbor node k, Aggregation message for node i A weighted sum based on attention weights for all neighbor messages; 2022, carrying out dynamic calculation according to the joint type, specifically based on quasi-static assumption, and adding a coulomb friction model to calculate node stress and torque; The calculated node stress and torque calculate the linear acceleration of the node i according to Newton's second law and rotation law Sum angular acceleration ; Then the linear acceleration of the node i is calculated by a lightweight multi-layer perceptron Sum angular acceleration Angular velocity of component calculated from multi-view RGB-D time series image or video Sum linear velocity The four physical quantities are encoded into a feature vector of a fixed dimension ; The final node hidden state is updated by a gating circulation unit, and the gating circulation unit uses the current hidden state, the aggregation message and the feature vector Is input.
6. The method for generating the posture of the articulated object based on physical perception and graph diffusion according to claim 5, wherein the step 203 is specifically as follows: The formula of the gesture energy function is as follows: , Wherein, the As a function of the energy of the pose, In the form of a graph of the relationship, Parameters representing the joint between parts i and j, task represents a task, Measuring attitude Consistency with task descriptions; the probability distribution of the pose distribution is assumed to follow the boltzmann distribution, that is: , Wherein, the Representing the conditional probability distribution, performing a second order Taylor expansion at the minimum of the gesture energy function to approximate the gesture distribution of each component to an optimal gesture Is a central tangential space Gaussian distribution, the covariance matrix of which Defined in an optimal pose Cutting space at The formula is as follows: , Wherein, the Is shown in Gaussian distribution on manifold, mean On manifold, covariance Defining a tangent space at the mean point In (B) is of Is used for the matrix of the matrix, Outputting the attitude prior distribution parameter of each component i As a strong conditional prior to the conditional diffusion model in step3.
7. The method for generating the gesture of the articulated object based on physical perception and graph diffusion according to claim 6, wherein the physical regularization loss specifically comprises four physical constraints, and each constraint adjusts a weight through a weight coefficient; Penetration penalty : , Wherein, the Is taken as a point The components of the present invention are described in terms of, Representing the total number of points in the point set, Is that The function is activated and the function is activated, Representation points Values in the sign distance field of component j; Static stability penalty : , Wherein, the For small threshold, based on predicted part quality Calculating the whole gravity center with the position of the component, estimating a supporting polygon, forming a two-dimensional convex hull by projection of the contact point on a horizontal plane, Is projected horizontally to the center of gravity Distance to the convex hull boundary; Motion energy penalty : , Wherein, the For the height of the centre of gravity of the component i relative to the reference plane, A gravitational constant; Joint constraint punishment : , Wherein, the Representing a collection of edges in a component relationship graph, Represents the total number of joints and, Is the joint displacement calculated by the relative posture, 、 Is the lower and upper limits of joint displacement.
8. The method for generating the posture of the articulated object based on physical perception and graph diffusion according to claim 7, wherein the step 303 is specifically as follows: step 3031, in In the cut space, a score network based on learning Performing Langmuir dynamics updates from noise Obtain preliminary denoising proposal : , Wherein, the Representing the noise gesture of the t step, namely, a high-dimensional vector formed by splicing the lie algebraic coordinates corresponding to all component gestures in the t step of the back diffusion process; in order to score the network, As a result of the fused condition vector, For the update step size of the t-th step, Is a random noise vector; step 3032, predicting the obtained lie algebraic coordinates Deriving proposed poses on SE (3) by exponential mapping The physical regularization loss under the proposed pose is then calculated and back propagated to obtain a physical correction gradient And correcting the prediction result in the tangent space, wherein the formula is as follows: , Wherein, the The correction step length is an superparameter; Step 3033, correcting the noise gesture Projected back into the SE (3) manifold by exponential mapping, Obtaining the noise gesture of the next step Steps 3031-3033 are repeated until t=0.

Description

Physical perception and graph diffusion-based articulating object posture generation method Technical Field The invention relates to the field of computer vision, in particular to an articulated object posture generation method based on physical perception and graph diffusion. Background An articulating object is an object that is formed by two or more components that are joined by joints and that can undergo relative motion. Such as tables and chairs, cabinets, robotic arms, etc. Articulated object pose generation is one of the central challenges in computer vision and robotic manipulation. Along with the rapid development of the related fields, higher and higher requirements are put on the precision, physical rationality and gesture diversity of the gesture generation of the articulated object. Currently, the articulating object gesture generation method is mainly divided into a traditional geometric modeling method and a deep learning-based generation method. The traditional geometric modeling method relies on the characteristics of manual design and an accurate physical model, solves the object gesture through a kinematic equation, can ensure certain physical rationality, but has poor adaptability to complex scenes, is difficult to process interference factors in actual images such as shielding, blurring and the like, and has low gesture generation efficiency. The deep learning-based generation method is the current main research flow by virtue of strong feature extraction capability. The method based on the graph neural network models the articulated object into a graph structure formed by the component nodes and the joint edges, and certain progress is made in the aspect of modeling of the gesture structure through the association relationship among graph message transmission reasoning components. However, such methods often ignore physical properties (such as mass, inertia, joint constraint, etc.) of the object, so that the generated gesture is prone to physical violation problems, such as component penetration, joint movement out-of-range, etc., and cannot be directly applied to actual physical interaction scenes. In recent years, diffusion models have been gradually applied to gesture generating tasks because of their excellent generation diversity. However, the existing diffusion model-based method is mostly to perform sampling update in the euclidean space, and the gesture of the articulated object part belongs to a SE (3) manifold (SE (3) is a mathematical space formed by all three-dimensional rigid body motions (rotation+translation), and is a smooth manifold), and the direct migration of the diffusion process of the euclidean space to the SE (3) manifold can cause distortion of gesture representation. Meanwhile, the existing method does not effectively integrate physical constraint into a diffusion updating process, and the diversity and physical rationality of the gestures are difficult to balance. Therefore, how to improve the physical rationality and the structural accuracy of the gesture while ensuring the gesture generation diversity is a core challenge faced by the current articulating object gesture generation technology. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides an articulated object posture generation method based on physical perception and graph diffusion, which acquires the geometry and physical properties of an articulated object through component physical perception reconstruction, provides a reliable physical constraint basis for posture generation, improves the reasonability of component relation accuracy and posture priori reasoning through graph refinement and physical enhancement graph diffusion process, and constructs a physical guided conditional diffusion model on SE (3) manifold to realize the articulated object posture generation with diversity and physical reasonability. In order to achieve the above object, the technical scheme of the present invention is as follows: An articulated object posture generation method based on physical perception and graph diffusion comprises the following steps: step 1, constructing a double-branch joint representation network, respectively reconstructing the part geometric shape and the physical attribute of an articulated object from image data, initializing the joint connection relation among the parts, and outputting the relation diagram initialization parameters among the parts; step 2, diffusing a physical enhanced hierarchical component relation diagram, inputting relation diagram initialization parameters into a physical enhanced diagram diffusion model for reasoning, and reasoning the gesture probability distribution of each component under the guidance of a task; and 3, constructing a conditional diffusion probability model on the SE (3) manifold by taking the geometric shape, the physical attribute and the gesture probability distribution of the component as conditions, gr