CN-121973217-A - Robot binocular vision motion mapping method and system

CN121973217ACN 121973217 ACN121973217 ACN 121973217ACN-121973217-A

Abstract

The invention relates to the technical field of robot intelligent control and machine vision, and discloses a motion mapping method and a motion mapping system for binocular vision of a robot, which are used for generating scene perception data containing three-dimensional information of a target object and a human body double-arm joint point based on acquired binocular vision input information and a current operation task instruction; and fusing the task related visual features in the scene perception data with the space-time motion features to obtain fusion features, and mapping the fusion features into corresponding robot joint motion trajectories through a sequence mapping model. Therefore, the accuracy and the flexibility of the robot motion mapping can be improved by implementing the invention.

Inventors

LI DONGSHENG
LI TIANGANG
FAN YANYAN

Assignees

深圳和润达科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260204

Claims (10)

1. A method for motion mapping of binocular vision of a robot, the method comprising: generating scene perception data containing three-dimensional information of a target object and a human body double-arm joint point based on the obtained binocular vision input information and a current operation task instruction; extracting space-time motion characteristics of the human body double-arm joint points from the scene perception data; fusing the task related visual features in the scene perception data with the space-time motion features to obtain fusion features, and mapping the fusion features into corresponding robot joint motion tracks through a sequence mapping model; And, the binocular visual input information includes a binocular image pair including a left view image and a right view image, and generating scene perception data including three-dimensional information of a target object and a human double-arm joint point based on the obtained binocular visual input information and a current operation task instruction, including: Generating dense depth information through a preset stereo matching algorithm based on the binocular image pair, wherein the preset stereo matching algorithm comprises multi-scale feature extraction, matching cost body construction, cost body regularization and parallax calculation based on probability distribution; Based on the binocular image pair, the dense depth information and the acquired current operation task instruction, executing a multi-scale space-time attention mechanism, and extracting task related visual features related to the current operation task instruction, wherein the task related visual features comprise appearance geometric features, three-dimensional position posture information and human body posture track information of a target object; Detecting and tracking a coordinate sequence of a human body double-arm joint point in a three-dimensional space through a three-dimensional human body posture estimation model based on the binocular image pair and the dense depth information; And fusing the task related visual features with the coordinate sequences of the human body double-arm joint points to construct unified scene perception data.
2. The motion mapping method of robot binocular vision according to claim 1, wherein the binocular vision input information further includes binocular camera geometry parameters, the generating dense depth information based on the binocular image pair through a preset stereo matching algorithm includes: Performing multi-scale feature extraction on the left-view image and the right-view image to obtain feature pyramids of the binocular image pair under different resolutions; Calculating the matching cost between the corresponding features of the left-view image and the right-view image in a preset parallax range based on all the feature pyramids, and constructing an initial matching cost body; performing three-dimensional convolution regularization processing on the initial matching cost body to obtain an optimized matching cost body; calculating the parallax expected value of each pixel point through a sampling-Gaussian probability distribution model based on the optimized matching cost body; and converting the parallax expected value into dense depth information of the binocular image pair according to the binocular camera geometric parameters.
3. The method according to claim 2, wherein the calculating a matching cost between the left view image and the right view image corresponding feature in a preset parallax range based on all the feature pyramids, and constructing an initial matching cost body includes: channel grouping and dimension transformation processing are carried out on all the feature pyramids, and a first grouping feature set of the left-view image and a second grouping feature set of the right-view image are generated; For each discrete parallax value in a preset parallax range, carrying out pixel-by-pixel matching metric calculation on the first grouping feature set and the second grouping feature set subjected to displacement of the discrete parallax value, and generating a two-dimensional matching cost distribution diagram under the discrete parallax value; Superposing the two-dimensional matching cost distribution graphs under all the discrete parallax values to form a three-dimensional matching cost body; and performing filtering processing on the three-dimensional matching cost body to construct an initial matching cost body.
4. The method according to claim 1, wherein the performing a multi-scale spatiotemporal attention mechanism based on the binocular image pair, the dense depth information and the acquired current operational task instruction, extracting task related visual features related to the current operational task instruction, comprises: inputting the binocular image pair, the dense depth information and the encoded current operation task instruction into a multi-modal feature encoder to generate unified multi-scale space-time features; Generating a task perception query vector based on the current operation task instruction; performing cross-modal attention weight distribution on the multi-scale spatiotemporal features based on the task aware query vector to focus on visual elements related to the current operational task instruction in spatial and temporal dimensions; And carrying out weighted fusion and feature enhancement on the multi-scale space-time features based on the distributed attention weights, and extracting task related visual features associated with the current operation task corresponding to the binocular visual input information.
5. The method according to any one of claims 1-4, wherein the extracting the spatiotemporal motion features of the human dual-arm joint from the scene-aware data comprises: Extracting a coordinate sequence of the human body double-arm joint point in a three-dimensional space from the scene perception data; constructing a space-time diagram which takes joint points as diagram nodes and takes space connection and time connection among the joint points as diagram edges based on the coordinate sequence of the human body double-arm joint points; Performing space-time diagram convolution operation on the space-time diagram, and capturing space-time characteristics of the node through space diagram convolution and time convolution; and carrying out time sequence aggregation on the time-space characteristics to generate time-space motion characteristics representing the whole motion mode of the human body double arms.
6. The method for mapping motion of binocular vision of a robot according to any one of claims 1-4, wherein the fusing task related visual features in the scene perception data with the spatiotemporal motion features to obtain fused features, and mapping the fused features to corresponding robot joint motion trajectories through a sequence mapping model, comprises: Performing weighted fusion based on an attention mechanism on the task related visual features in the scene perception data and the space-time motion features to obtain fusion features; Inputting the fusion features as an input sequence to a sequence mapping model based on an encoder-decoder architecture; encoding the input sequence through an encoder of the sequence mapping model, and extracting semantic representation; and generating a corresponding robot joint motion track sequence based on the semantic representation through a decoder of the sequence mapping model.
7. A motion mapping system for robotic binocular vision, the system comprising: the generation module is used for generating scene perception data containing three-dimensional information of the target object and the human double-arm joint point based on the obtained binocular vision input information and the current operation task instruction; The extraction module is used for extracting the space-time motion characteristics of the human body double-arm joint points from the scene perception data; The mapping module is used for fusing the task related visual features in the scene perception data with the space-time motion features to obtain fusion features, and mapping the fusion features into corresponding robot joint motion tracks through a sequence mapping model; And the binocular vision input information comprises a binocular image pair, the binocular image pair comprises a left view image and a right view image, and the specific mode of generating scene perception data containing three-dimensional information of a target object and a human body double-arm joint point based on the obtained binocular vision input information and a current operation task instruction by the generation module comprises the following steps: Generating dense depth information through a preset stereo matching algorithm based on the binocular image pair, wherein the preset stereo matching algorithm comprises multi-scale feature extraction, matching cost body construction, cost body regularization and parallax calculation based on probability distribution; Based on the binocular image pair, the dense depth information and the acquired current operation task instruction, executing a multi-scale space-time attention mechanism, and extracting task related visual features related to the current operation task instruction, wherein the task related visual features comprise appearance geometric features, three-dimensional position posture information and human body posture track information of a target object; Detecting and tracking a coordinate sequence of a human body double-arm joint point in a three-dimensional space through a three-dimensional human body posture estimation model based on the binocular image pair and the dense depth information; And fusing the task related visual features with the coordinate sequences of the human body double-arm joint points to construct unified scene perception data.
8. The robotic binocular vision motion mapping system of claim 7, wherein the binocular vision input information further comprises binocular camera geometry parameters, the generating module generating dense depth information based on the pair of binocular images by a preset stereo matching algorithm in a specific manner comprising: Performing multi-scale feature extraction on the left-view image and the right-view image to obtain feature pyramids of the binocular image pair under different resolutions; Calculating the matching cost between the corresponding features of the left-view image and the right-view image in a preset parallax range based on all the feature pyramids, and constructing an initial matching cost body; performing three-dimensional convolution regularization processing on the initial matching cost body to obtain an optimized matching cost body; calculating the parallax expected value of each pixel point through a sampling-Gaussian probability distribution model based on the optimized matching cost body; and converting the parallax expected value into dense depth information of the binocular image pair according to the binocular camera geometric parameters.
9. A motion mapping system for robotic binocular vision, the system comprising: A memory storing executable program code; A processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the robot binocular vision motion mapping method of any one of claims 1-6.
10. A computer storage medium storing computer instructions which, when invoked, are adapted to perform the method of motion mapping for binocular vision of a robot according to any one of claims 1-6.

Description

Robot binocular vision motion mapping method and system Technical Field The invention relates to the technical field of intelligent control and machine vision of robots, in particular to a motion mapping method and system for binocular vision of robots. Background In the technical field of robots, how to enable robots to observe and learn human actions, and further autonomously complete complex operation tasks is always an important research direction. Traditional robot motion control methods rely on accurate environmental models and manually programmed trajectory planning. The operator needs to predefine detailed actions and parameters of each task step, and the robot can reliably execute in a structured environment, but has relatively insufficient flexibility and adaptability in the face of a dynamically changing or unknown unstructured environment. It is important to provide a technical scheme for improving accuracy and flexibility of robot motion mapping. Disclosure of Invention The invention provides a robot binocular vision motion mapping method and system, which can improve the accuracy and flexibility of robot motion mapping. In order to solve the technical problem, a first aspect of the present invention discloses a motion mapping method for binocular vision of a robot, the method comprising: generating scene perception data containing three-dimensional information of a target object and a human body double-arm joint point based on the obtained binocular vision input information and a current operation task instruction; extracting space-time motion characteristics of the human body double-arm joint points from the scene perception data; fusing the task related visual features in the scene perception data with the space-time motion features to obtain fusion features, and mapping the fusion features into corresponding robot joint motion tracks through a sequence mapping model; And, the binocular visual input information includes a binocular image pair including a left view image and a right view image, and generating scene perception data including three-dimensional information of a target object and a human double-arm joint point based on the obtained binocular visual input information and a current operation task instruction, including: Generating dense depth information through a preset stereo matching algorithm based on the binocular image pair, wherein the preset stereo matching algorithm comprises multi-scale feature extraction, matching cost body construction, cost body regularization and parallax calculation based on probability distribution; Based on the binocular image pair, the dense depth information and the acquired current operation task instruction, executing a multi-scale space-time attention mechanism, and extracting task related visual features related to the current operation task instruction, wherein the task related visual features comprise appearance geometric features, three-dimensional position posture information and human body posture track information of a target object; Detecting and tracking a coordinate sequence of a human body double-arm joint point in a three-dimensional space through a three-dimensional human body posture estimation model based on the binocular image pair and the dense depth information; And fusing the task related visual features with the coordinate sequences of the human body double-arm joint points to construct unified scene perception data. As an optional implementation manner, in the first aspect of the present invention, the binocular vision input information further includes binocular camera geometric parameters, and the generating dense depth information based on the binocular image pair through a preset stereo matching algorithm includes: Performing multi-scale feature extraction on the left-view image and the right-view image to obtain feature pyramids of the binocular image pair under different resolutions; Calculating the matching cost between the corresponding features of the left-view image and the right-view image in a preset parallax range based on all the feature pyramids, and constructing an initial matching cost body; performing three-dimensional convolution regularization processing on the initial matching cost body to obtain an optimized matching cost body; calculating the parallax expected value of each pixel point through a sampling-Gaussian probability distribution model based on the optimized matching cost body; and converting the parallax expected value into dense depth information of the binocular image pair according to the binocular camera geometric parameters. As an optional implementation manner, in the first aspect of the present invention, the calculating, based on all the feature pyramids, matching costs between the left view image and the right view image corresponding features within a preset parallax range, and constructing an initial matching cost body includes: channel grouping and dimension transformation proces