CN-121267892-B - Training method of visual language action model and mechanical arm operating device

CN121267892BCN 121267892 BCN121267892 BCN 121267892BCN-121267892-B

Abstract

The invention provides a training method and a mechanical arm operation device for a visual language action model, which are characterized in that a multi-modal training data set is obtained, the multi-modal training data set comprises a plurality of first multi-modal training data and a plurality of second multi-modal training data, each of the first multi-modal training data and the second multi-modal training data comprises mechanical arm operation videos, text description information and true value joint angle information corresponding to each video frame respectively, the multi-modal training data set is based on the multi-modal training data set, the first visual language action model to be trained is subjected to multi-round iterative training until a preset finishing training condition is met, a target visual language action model is obtained, in each round of iterative training, a strategy failure associated video frame is determined, and the frame weight of the strategy failure associated video frame in the next round of iterative training is improved.

Inventors

ZHU ZHENG
WANG XIAOFENG
HUANG GUAN
Dong Zhegao

Assignees

北京极佳视界科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250922

Claims (12)

1. A method for training a visual language action model, comprising: Acquiring a multi-modal training data set, wherein the multi-modal training data set comprises a plurality of first multi-modal training data and a plurality of second multi-modal training data, each of the first multi-modal training data and each of the second multi-modal training data respectively comprises a mechanical arm operation video, text description information and true joint angle information of a mechanical arm corresponding to each video frame in the mechanical arm operation video, the text description information is used for describing video content of the mechanical arm operation video, the multi-modal training data set comprises a plurality of first multi-modal training data acquired based on different mechanical arm visual angles, and the text description information of the second multi-modal training data is obtained by adjusting target object attribute information and/or background information in the text description information of the first multi-modal training data; And carrying out multi-round iterative training on a first visual language action model to be trained based on the multi-modal training data set until a preset end training condition is met, and obtaining a target visual language action model from the first visual language action model to be trained, wherein in each round of iterative training, a strategy failure associated video frame is determined based on a prediction strategy reasoning result respectively corresponding to each video frame in each mechanical arm operation video output by the first visual language action model to be trained, the frame weight of the strategy failure associated video frame in the next round of iterative training is improved, the prediction strategy reasoning result corresponding to each video frame comprises a prediction joint angle sequence of a mechanical arm, the prediction joint angle sequence comprises prediction joint angle information corresponding to future video frames predicted based on the video frames, and the prediction joint angle information corresponding to the video frames, and the strategy failure associated video frames represent video frames which enable the first visual language action model to be trained to generate the strategy reasoning result causing task execution failure.
2. The method of claim 1, wherein the text description information comprises foreground information, background information and interaction relation of the manipulator operation video, the foreground information comprises target object attribute information, and the interaction relation represents a process and a target of a strategy reasoning result executed by the manipulator for the target object; the first multi-modal training data corresponds to at least one second multi-modal training data, and the corresponding first multi-modal training data and second multi-modal training data have different target object attribute information and/or background information.
3. The method of claim 1, wherein the determining the robot arm operation video for the second multimodal training data based on the robot arm operation video for the first multimodal training data and the text descriptive information for the second multimodal training data comprises: extracting depth maps corresponding to all video frames in the mechanical arm operation video of the first multi-mode training data to obtain a depth map sequence; And obtaining a mechanical arm operation video of the second multi-mode training data by utilizing a video generation model based on the depth map sequence and the text description information of the second multi-mode training data.
4. The method of claim 1, wherein the constructing the multimodal training data set based on at least one of the first multimodal training data and at least one of the second multimodal training data comprises: extracting depth maps of all video frames in the mechanical arm operation video of the second multi-mode training data for at least one second multi-mode training data to obtain a test depth map sequence; determining a depth map restoration degree corresponding to the mechanical arm operation video based on the test depth map sequence and a true value depth map sequence, wherein the true value depth map sequence is a depth map sequence of the mechanical arm operation video for generating the second multi-mode training data; determining second multi-modal training data for constructing the multi-modal training data set based on depth map restoration degrees respectively corresponding to the mechanical arm operation videos of the second multi-modal training data; the multimodal training data set is constructed based on second multimodal training data and at least one of the first multimodal training data used to construct the multimodal training data set.
5. The method of claim 4, wherein the constructing the multimodal training data set based on at least one of the first multimodal training data and at least one of the second multimodal training data further comprises: Generating a plurality of initial data sets, each of the initial data sets having a different ratio of the number of first multimodal training data to the number of second multimodal training data; based on each initial data set, training a second visual language action model to be trained respectively to obtain a plurality of test visual language action models; and determining the multi-modal training data set in a plurality of initial data sets based on the strategy performance of each test visual language action model, wherein the strategy performance represents the performance of the mechanical arm driven by the test visual language action model to complete tasks.
6. The method according to any one of claims 1-5, wherein determining a policy failure associated video frame based on a prediction policy inference result corresponding to each video frame in each of the robot arm operation videos output by the first visual language action model to be trained includes: For each video frame, determining an action deviation score based on a predicted joint angle sequence corresponding to the video frame and true joint angle information corresponding to the corresponding video frame; Determining joint angular acceleration based on a predicted joint angle sequence corresponding to the video frame, and determining an action smoothness score based on the joint angular acceleration; Determining a joint angle compliance score based on a predicted joint angle sequence corresponding to the video frame and preset joint angle threshold information; and determining the strategy failure associated video frame from the video frames by using a preset reward function based on the action deviation score, the action smoothness score and the joint angle compliance score corresponding to the video frames.
7. A method according to claim 3, wherein the video generation model is obtained by: The method comprises the steps of obtaining a video model training data set, wherein the video model training data set comprises a plurality of first training data and a plurality of second training data, each first training data and each second training data comprise a labeling depth map sequence and shared description information corresponding to the depth map sequence, the shared description information comprises target object attribute information, background information and interaction information in the labeling depth map sequence, the plurality of labeling depth map sequences are obtained by labeling mechanical arm operation videos acquired based on different mechanical arm visual angles, each first training data corresponds to at least one second training data, and the corresponding first training data and second training data have the same labeling depth map sequence and different target object attribute information and/or background information; and carrying out multi-round iterative training on the video generating model to be trained based on the video model training data set until a preset model training condition is met, and obtaining the video generating model from the video generating model to be trained.
8. A manipulator operating device, characterized in that it is applied to a robot, said manipulator operating device comprising a manipulator, a manipulator control device and a target visual language action model according to any one of the preceding claims 1-7; the target visual language action model is used for carrying out reasoning based on an input operation instruction and a mechanical arm working environment video, and outputting a strategy reasoning result, wherein the strategy reasoning result comprises a joint angle sequence of the mechanical arm; and the mechanical arm control device controls the mechanical arm to execute the strategy reasoning result.
9. A training device for a visual language action model, comprising: The system comprises a first training data acquisition module, a second training data acquisition module and a first training data processing module, wherein the first training data acquisition module is used for acquiring a multi-modal training data set, the multi-modal training data set comprises a plurality of first multi-modal training data and a plurality of second multi-modal training data, each first multi-modal training data and each second multi-modal training data respectively comprise mechanical arm operation videos, text description information and true joint angle information of mechanical arms corresponding to video frames in the mechanical arm operation videos respectively, the text description information is used for describing video contents of the mechanical arm operation videos, the first multi-modal training data are acquired based on different mechanical arm visual angles, and the target object attribute information and/or background information in the text description information of the first multi-modal training data are adjusted to obtain the text description information of the second multi-modal training data; The first model training module is used for carrying out multi-round iterative training on a first visual language action model to be trained based on the multi-modal training data set until a preset end training condition is met, and obtaining a target visual language action model by the first visual language action model to be trained, wherein in each round of iterative training, a strategy failure associated video frame is determined based on a prediction strategy reasoning result corresponding to each video frame in each mechanical arm operation video output by the first visual language action model to be trained, the frame weight of the strategy failure associated video frame in the next round of iterative training is improved, the prediction strategy reasoning result corresponding to each video frame comprises a prediction joint angle sequence of a mechanical arm, the prediction joint angle sequence comprises prediction joint angle information corresponding to future video frames predicted based on the video frames, and the prediction joint angle information corresponding to the video frames, and the strategy failure associated video frames represent video frames enabling the first visual language action model to be trained to generate strategy reasoning results causing task execution failure.
10. An electronic device, comprising: A memory for storing a computer program; A processor for executing a computer program stored in said memory, and which, when executed, implements the method of any of the preceding claims 1-7.
11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of the preceding claims 1-7.
12. A computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of the preceding claims 1-7.

Description

Training method of visual language action model and mechanical arm operating device Technical Field The disclosure relates to deep learning technology, in particular to a training method of a visual language action model and a mechanical arm operating device. Background A visual-Language-Action model (VLA) is a decision-making system of a robot, whose core function is to convert the observation of the robot into actions. The visual language action model is a core technical component in the field of robot imitation learning (Robot Imitation Learning, IL), and has the core value of providing end-to-end capability support of 'perception-understanding-execution' for the IL by deeply coupling the environment perception (visual mode, such as operation scene video) and task semantic understanding (language mode, such as natural language instruction) of the robot with action sequence generation (action mode, such as robot joint motion track) by means of a cross-mode fusion architecture. However, the current mainstream visual language action model is generally trained by adopting an isolated paradigm of "generating, namely, end point", namely, the model only completes a unidirectional generating task of "visual-language input to action sequence output", and a feedback mechanism is not formed by training with a downstream IL strategy, so that the success rate of a robot driven based on the visual language action model in actually executing the task is finally low. Disclosure of Invention In order to solve the technical problems, an embodiment of the present disclosure provides a training method of a visual language action model and a manipulator operating device. According to one aspect of the disclosed embodiments, a training method for a visual language action model is provided, which includes obtaining a multi-modal training data set, wherein the multi-modal training data set includes a plurality of first multi-modal training data and a plurality of second multi-modal training data, each of the first multi-modal training data and the second multi-modal training data includes a mechanical arm operation video, text description information and real-time joint angle information of a mechanical arm corresponding to each video frame in the mechanical arm operation video, the text description information is used for describing video content of the mechanical arm operation video, based on the multi-modal training data set, a multi-round iterative training is performed on a first visual language action model to be trained until a preset end training condition is met, a target visual language action model is obtained by the first visual language action model to be trained, in each round of iterative training, a strategy failure associated video frame is determined based on a prediction strategy inference result corresponding to each video frame in the mechanical arm operation video output by the first visual language action model to be trained, a prediction strategy associated video frame is improved, a prediction strategy associated video frame is performed based on a prediction strategy angle corresponding to each video frame in a next round of the mechanical arm operation video frame, and the training strategy associated video is performed based on the prediction strategy angle information corresponding to the video frame, and the training strategy associated frame is performed by the prediction strategy angle information. In another aspect of the disclosed embodiments, a manipulator operating device is provided, and is applied to a robot, the manipulator operating device comprises a manipulator, a manipulator control device and the target visual language action model, wherein the target visual language action model is used for carrying out reasoning based on an input operation instruction and a manipulator working environment video, a strategy reasoning result is output, the strategy reasoning result comprises a joint angle sequence of the manipulator, and the manipulator control device controls the manipulator to execute the strategy reasoning result. In yet another aspect of the disclosed embodiments, a training device for a visual language action model is provided, including a first training data acquisition module configured to acquire a multimodal training data set, where the multimodal training data set includes a plurality of first multimodal training data and a plurality of second multimodal training data, each of the first multimodal training data and each of the second multimodal training data includes a mechanical arm operation video, text description information, and true joint angle information of a mechanical arm corresponding to each video frame in the mechanical arm operation video, where the text description information is used to describe video content of the mechanical arm operation video, a first model training module configured to perform multiple iterative training on a first visual language action