CN-121733578-B - Method, equipment and storage medium for grabbing mechanical arm

CN121733578BCN 121733578 BCN121733578 BCN 121733578BCN-121733578-B

Abstract

The application relates to a method, equipment and a storage medium for grabbing a mechanical arm, which are characterized by acquiring an initial frame of a target object determined by a user on an interactive interface based on any frame in a visual video of the mechanical arm, extracting the boundary of the target object based on the initial frame, visually rendering the target object, continuously tracking, segmenting and rendering the target object in each frame in the visual video based on the extracted boundary to obtain a segmentation result, establishing a mapping relation between the rendered target object and grabbing operation of the mechanical arm, inputting the mapping relation and the segmentation result into a VLA model for training, identifying specific color characteristics in a real-time video by using the trained VLA model, and driving the mechanical arm to execute grabbing operation based on the identified specific color characteristics. The model has good generalization, does not need retraining data, and is simple to deploy and apply.

Inventors

LI HONGMING
JING ZHIYUAN

Assignees

天翼数字生活科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260225

Claims (9)

1. A method of robotic arm gripping, the method comprising: the method comprises the steps of collecting relevant data of a single grabbing action of a mechanical arm, and displaying the relevant data on an interactive interface, wherein the relevant data comprise visual videos; Acquiring an initial frame of a target object determined by a user on the interactive interface based on any frame in the visual video; Extracting the boundary of the target object based on the initial frame, visually rendering the target object, and endowing the target object with specific color characteristics; continuously tracking and dividing a target object in each frame in the visual video based on the extracted boundary, and rendering the divided target object in real time according to specific color characteristics to obtain a division result; Establishing a mapping relation between specific color characteristics of the target object and the grabbing operation of the mechanical arm, inputting the mapping relation and the segmentation result into a VLA model, and training the VLA model; identifying specific color characteristics in a real-time video by using the trained VLA model, driving the mechanical arm based on the identified specific color characteristics, and executing grabbing operation on the target object; wherein the identifying a particular color feature in the real-time video using the trained VLA model, driving the robotic arm based on the identified particular color feature, comprises: Acquiring a target object selected by a user in a frame in the real-time video; dividing and rendering the target object based on a frame area where the target object is positioned to obtain a dyed image, wherein the dyed image is provided with the target object with specific color characteristics; inputting the dyed image into the trained VLA model, outputting a mechanical arm action instruction about grabbing the target object based on the specific color characteristic, and driving the mechanical arm according to the mechanical arm action instruction.
2. The method of claim 1, wherein the acquiring data related to a single grasping action of the robotic arm comprises: And acquiring related data of a user for controlling the mechanical arm to finish a single grabbing action in a remote control mode, wherein the related data comprise state data and visual videos of the mechanical arm, and the visual videos comprise a main visual angle video and a wrist visual angle video.
3. The method of claim 2, wherein the obtaining an initial frame of the target object determined by the user on the interactive interface based on any frame of the visual video comprises: and acquiring an initial frame of the target object, which is determined by the user on the interactive interface based on any frame of the target object existing in the wrist visual angle video.
4. The method of claim 2, wherein the inputting the mapping relationship and the segmentation result to a VLA model comprises: performing time synchronization association on the state data, the main view video and a segmentation result obtained from the wrist view video to obtain annotation data; and inputting the labeling data and the mapping relation into the VLA model, wherein the state data is used for the VLA model to correlate the frame of the visual video with the corresponding joint angle of the mechanical arm.
5. The method of claim 1, wherein the continuously tracking and segmenting the target object in each frame of the visual video based on the extracted boundary, and rendering the segmented target object in real time according to a specific color feature comprises: and tracking the target object in each frame in real time in the visual video based on the extracted boundary, dividing the new position target frame by frame, and rendering the new position target in each frame in real time according to the specific color characteristics.
6. The method according to claim 1, characterized in that the method comprises: when a plurality of target objects are obtained, the target objects are segmented and independently tracked in parallel; And rendering one target object at a time by adopting a time-sharing dyeing strategy, performing grabbing operation on the rendered target object, and sequentially performing rendering and grabbing operation on other target objects after the target object is grabbed.
7. The method of claim 1, wherein said performing said VLA model training further comprises: And replacing all background areas except the target object in the single frame image to generate an annotation sample, wherein the annotation sample comprises the same dyed target object and different background features and is used for training the VLA model.
8. An electronic device, the device comprising: One or more processors, and A memory storing computer readable instructions that, when executed, cause the processor to perform the operations of the method of any one of claims 1 to 7.
9. A computer readable storage medium having computer readable instructions stored thereon, wherein the computer readable instructions are executable by a processor to implement the method of any one of claims 1 to 7.

Description

Method, equipment and storage medium for grabbing mechanical arm Technical Field The application mainly relates to the field of robot engineering, in particular to a method, equipment and storage medium for grabbing by a mechanical arm. Background Robots generally use a mechanical arm and two finger jaws to grasp an object, wherein the mechanical arm is composed of a plurality of motors capable of rotating accurately, and the two finger jaws are mounted at the tail end of the mechanical arm and can execute opening and closing actions. Cameras are respectively arranged at the main view angle and the wrist of the mechanical arm to observe the overall situation and the grabbing condition of the clamping jaw. The appointed target mechanical arm grabbing algorithm controls the motor to rotate and match through analyzing information input by the camera picture, the tail end of the mechanical arm is sent to a position where a target can be grabbed, and then the clamping jaw is closed to finish grabbing. The existing mechanical arm grabbing algorithm mainly comprises two main types, namely a traditional modeling-based grabbing algorithm (model-based), wherein the two steps are respectively completed by using two different algorithms, namely the tail end gesture of the mechanical arm which can grab a target is predicted by using a mechanical arm tail end gesture estimation algorithm, and then the mechanical arm is controlled to the gesture by using an inverse solution algorithm (IK) and then the clamping jaw is closed. Secondly, a large motion model method (VLA) based on deep learning is adopted, the method firstly marks language prompt words on a target object, then a data collector manually controls a mechanical arm to clamp the object and records motor motion state data of the mechanical arm, and the VLA model is trained by using the data. After training is completed, the user commands the VLA to grasp the specified object directly by the language prompt (e.g., "grasp bananas"). VLA does not need to estimate the pose and inverse solution algorithm of the tail end of the mechanical arm, but directly controls the rotation angles of all motors of the mechanical arm to finish grabbing. However, in the first type of method, the pose estimation algorithm of the tail end of the mechanical arm is seriously dependent on the image quality of the high-precision 3D point cloud, and the inverse solution algorithm is more difficult to deploy and apply on a large scale because a professional robot engineer is required to carry out calculation adjustment according to an actual scene and the mechanical arm. However, the existing VLA method is not following the instruction in many cases, and the false target is grabbed, namely, a serious posterior collapse problem (posterior collapse) is often caused. For example, if the training data is relatively rich in red objects, the trained model tends to grab the red objects without listening to the user's cue instructions. Disclosure of Invention The application aims to provide a method, equipment and a storage medium for grabbing a mechanical arm, which solve the problems that in the prior art, a grabbing algorithm is difficult to deploy and apply, a grabbing instruction is not followed, and an incorrect object is grabbed. According to one aspect of the present application, there is provided a method of robotic arm gripping, the method comprising: the method comprises the steps of collecting relevant data of a single grabbing action of a mechanical arm, and displaying the relevant data on an interactive interface, wherein the relevant data comprise visual videos; Acquiring an initial frame of a target object determined by a user on the interactive interface based on any frame in the visual video; extracting the boundary of the target object based on the initial frame, visually rendering the target object, and endowing the target object with specific color characteristics; continuously tracking and dividing a target object in each frame in the visual video based on the extracted boundary, and rendering the divided target object in real time according to specific color characteristics to obtain a division result; Establishing a mapping relation between specific color characteristics of the target object and the grabbing operation of the mechanical arm, inputting the mapping relation and the segmentation result into a VLA model, and training the VLA model; And identifying specific color characteristics in the real-time video by using the trained VLA model, driving the mechanical arm based on the identified specific color characteristics, and executing grabbing operation on the target object. Optionally, the collecting the data related to the single grabbing action of the mechanical arm includes: And acquiring related data of a user for controlling the mechanical arm to finish a single grabbing action in a remote control mode, wherein the related data comprise state data and visual videos of