CN-122023506-A - Robot grabbing pose generation method and device

CN122023506ACN 122023506 ACN122023506 ACN 122023506ACN-122023506-A

Abstract

The application discloses a robot grabbing pose generation method and device, which are characterized by acquiring a natural language grabbing task instruction, analyzing and generating structural task parameters containing target object identification information and grabbing constraint information based on a large language model, executing open vocabulary target detection on an image based on the identification information, acquiring a target area mask by combining pixel level segmentation, generating a target point cloud based on the mask, a depth map and camera internal parameters, performing depth complementation or denoising, outlier elimination and threshold value confirmation filtration, inputting the processed target point cloud into a grabbing pose prediction network to output a plurality of candidate six-degree-of-freedom grabbing poses, performing constraint screening and collision detection according to grabbing constraint, and selecting a confidence optimal person from collision-free candidates as an optimal grabbing pose to output. The six-degree-of-freedom grabbing pose which meets the constraint and can be executed is achieved by penetrating task constraint through a target extraction, point cloud construction and pose screening process and combining point cloud processing and collision verification.

Inventors

LV YINGMING
SU PENG
YUAN WENPING
GUO LUGANG
CHENG YUANBIN
Jiang qili
FENG YI
WANG TING

Assignees

北京市科学技术研究院

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (10)

1. A robot gripping pose generation method, the method comprising: Acquiring a grabbing task instruction in a natural language form; Performing task analysis on the grabbing task instruction based on a large language model LLM to generate a structured task parameter, wherein the structured task parameter at least comprises target object identification information and grabbing constraint information, and the grabbing constraint information at least comprises one of grabbing direction constraint, clamping constraint and forbidden zone constraint; Acquiring image data and a depth map of a target scene, performing open vocabulary target detection on the image data based on the target object identification information to obtain a detection result of a target object, and performing pixel-level segmentation on the image data based on the detection result to obtain a target region mask; Projecting pixel points in the depth map to a three-dimensional space based on the target area mask, the depth map and camera internal parameters to generate a target point cloud, and executing point cloud filtering processing on the target point cloud, wherein the point cloud filtering processing at least comprises depth complement or denoising, outlier rejection, threshold value confirmation filtering and sampling unification; inputting the target point cloud subjected to the point cloud filtering treatment into a capture pose prediction network, and outputting a plurality of candidate six-degree-of-freedom capture poses; Performing constraint screening on the candidate six-degree-of-freedom grabbing pose according to the grabbing constraint information, and performing collision detection on the candidate six-degree-of-freedom grabbing pose subjected to constraint screening; And selecting the candidate six-degree-of-freedom grabbing pose with the optimal confidence from the candidate six-degree-of-freedom grabbing poses detected through collision as the optimal grabbing pose output of the target object, wherein the optimal grabbing pose comprises position parameters, gesture parameters, clamping jaw width, confidence and/or grabbing type labels.
2. The method of claim 1, wherein the task parsing comprises performing a search enhancement on the crawling task instructions, the search enhancement comprising searching a preset knowledge base for information related to the crawling task instructions and inputting search results as a hint context to the LLM, wherein the knowledge base comprises one or more of an object properties base, a crawling policy base, an operation history base, a scene space description base, and a security constraint base.
3. The method of claim 1, wherein the structured task parameters are output in a set of preset fields, the set of fields including a target name field, a target attribute field, a crawling constraint field, a forbidden zone field, and/or a preference crawling type field, wherein the target attribute field includes one or more of color, texture, shape, and size.
4. The method of claim 3, wherein when performing open vocabulary object detection, combining the object name field with the object attribute field to generate a text prompt as a detection condition is input to an open vocabulary object detection model, and wherein a detection frame and/or a candidate region output by the open vocabulary object detection model is input as a pixel level segmentation to obtain the object region mask.
5. The method of claim 1, wherein after obtaining the target region mask, performing morphological dilation processing on the target region mask to obtain an expanded mask, and screening projection points of the three-dimensional space based on the expanded mask to generate the target point cloud.
6. The method of claim 1, wherein projecting the pixel points into three-dimensional space comprises mapping pixel coordinates and corresponding depth values into three-dimensional coordinates in combination with camera internal parameters based on a pinhole camera model, and mapping the pixel points corresponding to the target area mask to obtain a three-dimensional point set as the target point cloud.
7. The method of claim 1, wherein the point cloud filtering process comprises: performing filtering, hole filling and/or morphological closing operations on the depth map to obtain a complement depth map; performing statistical outlier rejection and/or radius outlier rejection on the target point cloud; Voxel downsampling, uniform sampling and/or random sampling is performed on the target point cloud to unify points to a preset point range.
8. The method according to claim 1, wherein the threshold validation filtering comprises determining a depth threshold interval based on a set of depth values within the target region mask and/or based on a distance distribution of the target point cloud, and retaining only three-dimensional points falling within the depth threshold interval, wherein the determining of the depth threshold interval comprises: determining according to the main peak position of the histogram of the depth distribution and combining with a preset bandwidth; Determining according to quantile intervals of depth distribution; and after the three-dimensional points are clustered, determining according to the distance range of the maximum connected cluster.
9. The method of claim 1, wherein each candidate six degree-of-freedom gripping pose comprises at least a jaw center position, a jaw pose, a jaw opening width, and a confidence; Screening candidate poses of which the included angles between the approaching direction of the clamping jaw and the preset reference direction exceed an angle threshold value based on the grabbing direction constraint; screening candidate poses of which the opening width of the clamping jaw does not meet a preset width range based on clamping constraint; screening candidate poses of the central position of the clamping jaw and/or the intersection of the clamping jaw enveloping body and the space region of the forbidden region based on the constraint of the forbidden region; and the collision detection is at least based on the jaw geometric model corresponding to the candidate pose and the environmental obstacle model and/or the occupied area of the target point cloud for collision judgment.
10. A robot gripping pose generation device, the device comprising: the instruction acquisition module is used for acquiring a grabbing task instruction in a natural language form; the task analysis module is used for carrying out task analysis on the grabbing task instruction based on a large language model LLM and generating a structured task parameter comprising target object identification information and grabbing constraint information; a target recognition segmentation module for performing open vocabulary target detection and pixel level segmentation on the image data based on the target object identification information to obtain a target region mask; The point cloud generation and processing module is used for projecting pixel points into a three-dimensional space based on the target area mask, the depth map and the camera internal parameters to generate a target point cloud, and executing point cloud filtering processing comprising depth completion or denoising, outlier rejection, threshold value confirmation filtering and sampling unification on the target point cloud; The pose prediction module is used for inputting the processed target point cloud into a grabbing pose prediction network to output a plurality of candidate six-degree-of-freedom grabbing poses; And the screening output module is used for carrying out constraint screening on the candidate six-degree-of-freedom grabbing pose according to the grabbing constraint information, carrying out collision detection, and selecting the candidate six-degree-of-freedom grabbing pose with the optimal confidence from the candidate six-degree-of-freedom grabbing poses passing the collision detection as the optimal grabbing pose for output.

Description

Robot grabbing pose generation method and device Technical Field The application relates to the technical field of robot grabbing pose generation, in particular to a method and a device for generating a robot grabbing pose. Background With the development of application scenarios such as industrial automation, warehouse logistics, and service robots, the demand for robots to perform Pick-and-Place (Pick-and-Place) tasks in unstructured or semi-structured environments is increasing. In order to achieve autonomous grabbing, a robot usually needs to integrate sensing information such as vision, depth and the like to identify and position a target object, and calculate grabbing pose to drive an end effector to finish clamping actions. The accuracy and the performability of the capturing pose (such as six-degree-of-freedom capturing pose) serving as a key intermediate result from sensing the performance directly influence the capturing success rate and the system stability. The prior art capture pose generation generally follows the processes of sensing, reconstruction, pose generation and feasibility verification, namely an image is acquired through an RGB or RGB-D camera, a target area is acquired through target detection/segmentation, a target point cloud is generated based on a depth map and camera internal projection, point cloud preprocessing is carried out through modes of filtering, complement, outlier rejection, downsampling, depth/space clipping and the like, candidate capture poses are output through an analytic geometry or capture detection network, and executable poses are selected by combining rules of confidence threshold, clamping jaw opening range, safe distance from a desktop, collision detection, kinematic accessibility inspection and the like. However, in practical application, natural language instructions are dependent on templates or keywords for analysis, and are difficult to stably express and extract fine-grained constraints such as 'grabbing from above, avoiding extrusion, avoiding forbidden zones', and the like, so that pose selection is deviated from task intention, factors such as shielding stacking, background similarity, reflective/transparent materials and the like are easy to cause unstable detection or segmentation, so that point clouds are mixed into the background or the critical surface is absent, fixed threshold values and experience parameters are often adopted in point cloud preprocessing, and over-clipping or residual interference is easy to occur when the target distance, the noise level and the target size are changed, so that the prediction fluctuation of the grabbing pose is caused, even the pose with unreachable or high collision risk is output, and the grabbing efficiency and stability are influenced. In the related art, patent publication number CN 118254169A discloses a mechanical arm unordered grabbing method, a storage medium and electronic equipment based on a sparse convolutional neural network, the method uses point cloud data generated by a simulation environment as a drive to construct a PointGrasp-Net sparse convolutional neural network, a repeated U-Net structure and a channel attention mechanism are adopted, feature extraction and fusion capability is enhanced, and the network directly outputs a predicted grabbing gesture. The patent publication No. CN 116645636A discloses a visual grabbing detection method based on a convolutional neural network, which is based on RGB image and depth map input, divides the CNN network into two parallel branches, predicts a target detection frame and predicts grabbing configuration, and establishes the joint relevance of a target and grabbing pose by calculating the relevance matrix of deep features of the two branches, thereby screening out the final grabbing pose. Therefore, in the process of generating the grabbing pose of the target object by the robot based on multi-mode sensing, task semantic constraint extraction and expression are insufficient, target recognition and segmentation are unstable in an open scene, and the grabbing pose performability caused by the fact that the point cloud reconstruction and filtering depend on a fixed threshold value is poor, so that the problem to be solved is urgent. Disclosure of Invention The application provides a robot grabbing pose generation method and device, and aims to solve the problems of insufficient task semantic constraint extraction and expression, unstable target identification segmentation under occlusion and open scenes and poor grabbing pose executability caused by the fact that point cloud reconstruction and filtering depend on a fixed threshold in the process of generating the grabbing pose of a target object based on multi-mode perception in the prior art. In a first aspect, a method for generating a robot gripping pose, the method comprising: Acquiring a grabbing task instruction in a natural language form; Performing task analysis on the grabbing task inst