CN-122008201-A - Lightweight visual language action system oriented to robot operation

CN122008201ACN 122008201 ACN122008201 ACN 122008201ACN-122008201-A

Abstract

The invention discloses a robot operation-oriented lightweight visual language action system which comprises a training-free lightweight architecture, action step parameterized control frames and six basic action primitives, wherein the mobile CPU calculation requirements are realized through FastSAM segmentation and GPT-4V reasoning modularized integration. Through intelligent modularized integration of the pre-training component, the robot operation process is decomposed into a plurality of mutually decoupled functional modules such as visual perception, visual language reasoning, action planning, robot execution and the like. In the visual perception stage, a light-weight zero sample segmentation model FastSAM is utilized to carry out target segmentation on a scene image to obtain mask information of a plurality of target objects in the scene, and the depth image data is combined to carry out post-processing and screening on the segmented target areas to extract the spatial center position and the gesture characteristics of the target objects, so that the three-dimensional pose estimation of the scene objects is realized, and a reliable spatial perception basis is provided for subsequent operation.

Inventors

LIU CHUNFANG
LIANG KAIMING
LIU XIAOYU

Assignees

北京工业大学

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (9)

1. The lightweight visual language action system for robot operation is characterized in that a complex robot operation process is decomposed into a visual perception module, a visual language reasoning module, an action primitive planning module and a functional module for mutual decoupling of a robot execution module through intelligent modularized integration of a pre-training component; The visual perception module performs target segmentation on the scene image by using a lightweight zero sample segmentation model FastSAM to obtain mask information of a plurality of target objects in the scene, performs post-processing and screening on the segmented target areas by combining depth image data, and extracts the spatial center position and posture characteristics of the target objects, thereby realizing three-dimensional pose estimation of the scene objects and providing a reliable spatial perception basis for subsequent operation; The visual language reasoning module performs semantic analysis on the natural language instruction, performs semantic binding on a target object related in the language and a target mask obtained in a visual perception stage, and inputs a scene image with a target mark to the visual language model to complete mapping of a corresponding relation between language description and a visual target, so that the accurate positioning of the language instruction to a specific operation object is realized under the condition of not depending on target class training data; The action primitive planning module is used for decomposing a complex robot operation task into a plurality of reusable atomic action units, and operating at a semantic level, and converting natural language instructions through a two-stage decomposition process, wherein the complex task is decomposed into atomic subtasks and subtasks into parameterized action primitives; And when the execution of a certain action primitive fails, the robot execution module can automatically trigger local retry, parameter recalculation, scene re-perception or overall action sequence re-planning according to the failure type.
2. The robot-operated lightweight visual language action system as recited in claim 1, wherein said visual perception module first obtains a color image of a current scene with an RGB-D camera And depth image And then to Median filtering to remove noise and Alignment: ; In the middle of The depth map after the filtering is represented, k is a filtering kernel, and H and W respectively represent the number of high pixels and the number of wide pixels of the image; Inputting the color image into a zero sample target segmentation model FastSAM, and performing non-category segmentation on all the segmentable objects in the scene to obtain two-dimensional mask areas of a plurality of candidate targets ; ; Wherein the method comprises the steps of The method comprises the steps of dividing a mask area, filtering the mask area by an area threshold value, removing a noise area with the area smaller than a preset threshold value, judging the depth consistency of pixel points in the mask area by combining a depth image, and optimizing the mask by removing an invalid mask with overlarge depth dispersion to realize accurate object boundary extraction; and the visual language model carries out semantic understanding on targets in the scene according to natural language semantic content and outputs a mapping relation between target objects related in the language and corresponding target numbers, thereby completing target semantic binding between natural language description and visual perception results.
3. The robot-operation-oriented lightweight visual language action system according to claim 2 is characterized by comprising the steps of carrying out three-dimensional pose estimation on a target object based on RGB-D information after target semantic binding is completed, carrying out principal component analysis PCA on depth information in a target mask area, carrying out principal component analysis PCA on the point cloud, calculating the principal axis direction and the space center position of the target object under a camera coordinate system, and converting pose information of the target object into the robot base coordinate system by combining a hand-eye calibration result between a camera and a robot base to obtain three-dimensional pose expression of the target object under the robot coordinate system: ; Wherein the method comprises the steps of Is the pose of the target object relative to the robot base, For the camera to base conversion relationship, Is the coordinates of the object relative to the camera.
4. A robotically-oriented lightweight visual language action system according to claim 3, wherein the action primitive planning module performs semantic planning by hierarchically decomposing complex tasks into executable action primitives; firstly, in the high-level atomic task decomposition process, adopting a structured prompt engineering strategy to carry out semantic subtask reasoning: ; ; Wherein the method comprises the steps of For the input of a natural language, In order to be able to operate on the object, In order to decompose the result of the task, 、 The action primitive rules and task templates, respectively.
5. A robotically oriented lightweight visual language action system according to claim 4, Has the following characteristics: (1) The method comprises the steps of structuring semantic decomposition, namely defining six semantic action primitives APPROACHTARGET, INTERACTWITHTARGET, MODIFYTCPORIENTATION, SETGRIPPERSTATE, WAITDURATION, VERTICALLIFT as atomic building blocks of a complex operation task, wherein each primitive is parameterized through space, kinematics and interaction parameters so as to realize systematic task decomposition; (2) Self-adaptive parameterization, namely automatically determining interaction parameters by utilizing depth-based height calculation, and defining four interaction height strategies, namely ABOVETABLESURFACE, TARGETOBJECTMIDHEIGHT, TARGETOBJECTTOPSURFACE, ABOVETARGETOBJECTSURFACE; (3) Template guidance, common modes of operation include grabbing and placing, twisting, pouring, opening containers, pressing, encoded as a sequence of templates, providing structured task decomposition guidance; (4) Constraint-aware planning, which combines physical constraints and system constraints, including collision avoidance strategies, grab state management, and motion planning considerations.
6. The robotic-oriented lightweight visual language action system of claim 5, wherein the task decomposition results of the output of the visual language reasoning module map to parameterized sequences of action primitives, run at execution level: ; Wherein, the Is at (APPROACHTARGET, INTERACTWITHTARGET, MODIFYTCP) ORIENTATION, SETGRIPPERSTATE, WAITDURATION, VERTICALLIFT) of the action primitive type; Defining target specifications of the interactive object, namely target coordinates and target object id; the spatial parameters of accurate positioning, namely the target pixel offset, the interaction height type and the height offset; the kinematic parameter used for controlling the TCP direction of the tool center point is the angle of the gripper rz; the interaction parameters of the end effector behavior, namely the fixture target state, and a can parameterize the action primitives.
7. The robot-operated lightweight visual language action system according to claim 6, wherein said sequence of action primitives is sequentially parsed in a robot execution module and the mechanical arm is driven to complete the corresponding actions according to the set parameters, wherein the mapping function operates the structured JSON output, parses and validates each parameter to construct a complete action specification : ; The specific mapping process involves three key transformations, 1) parameter analysis and verification, wherein the JSON parameter of each action primitive is analyzed and verified according to predefined constraint, the system extracts target specification, space parameter, kinematic parameter and interaction parameter, ensures that all required fields exist and are in an effective range, 2) coordinate transformation, and converts image coordinates into robot base coordinates through a hand-eye calibration matrix, 3) height calculation, interaction height is calculated according to interaction height setting parameters, including object geometry and depth information, and final height is obtained by the following formula: ; In the middle of Representing interaction height setting parameters, the parameters from the depth map, Representing fine height, the parameter coming from Is used for the regular parameter setting of the (c), Is the final operating height value; The method comprises the steps of monitoring the tail end state of the mechanical arm and an execution result in real time in the execution process, automatically triggering a corresponding error processing strategy when clamping failure, target deviation or pose abnormality is detected, and designing a grading error recovery mechanism for improving the overall execution success rate of multi-step tasks.
8. The robot-operated lightweight visual language action system of claim 7, wherein the hierarchical error recovery mechanism comprises 1) action level retries, re-executing the action without changing the task structure when a single action primitive fails to execute, 2) parameter level recalculations, re-calculating target poses and updating action parameters when a target pose change is detected, 3) perception level refreshes, re-acquiring scene images and executing complete visual perception and semantic binding processes when multiple retries fail, and 4) task level rescheduling, returning the robot to a safe initial state and re-planning task processes when none of the policies fail to recover.
9. The robot-operated lightweight visual language action system according to claim 7, wherein in the multi-step operation task, the robot sequentially completes grabbing, placing or interaction operations of a plurality of target objects according to a structured task planning result generated by the visual language model, and after completing one action primitive, the visual perception module is re-called as required to update the scene state, thereby avoiding task failure caused by environmental change or error accumulation.

Description

Lightweight visual language action system oriented to robot operation Technical Field The invention relates to the field of robot body intelligence, in particular to a lightweight visual-Language-Action (VLA) robot system which can realize efficient VLA deployment without model training or fine tuning, so that a robot can understand natural Language instructions and complete corresponding physical operations based on visual perception. Background With the continuous development of artificial intelligence, robotics and multi-modal sensing and reasoning technologies, robotic systems with natural language understanding and autonomous decision making capabilities are becoming an important development direction for performing complex operational tasks. The Vision-Language-Action (VLA) system combines Vision perception and Language understanding with robot control, so that the robot can autonomously complete target recognition, task planning and Action execution according to natural Language instructions given by human beings, and is considered as a key technical path for improving the universality and the intelligence level of the robot. However, existing VLA robotic systems still face significant engineering and deployment bottlenecks in practical applications. One type of mainstream method adopts an end-to-end deep learning framework, uniformly models visual input, language input and control strategies, and obtains task generalization capability through large-scale real or simulation data training. Although the method has stronger performance in a controlled experimental environment, the method is highly dependent on massive marking data and a high-performance graphic processor for training and reasoning, and has the advantages of large system computing resource consumption, high energy consumption, high deployment cost and difficulty in adapting to resource-limited scenes such as industrial sites, laboratories, edge computing and the like. In addition, the end-to-end VLA model is usually tightly coupled with the sensing, reasoning and control processes, so that the system has poor interpretability and difficult debugging, and once sensing errors, target deviation or action failure occur in the execution process, an effective error detection and recovery mechanism is often lacking, and the robot is difficult to support to complete long-time sequence and multi-step complex operation tasks. In a real physical environment, environmental variations, sensing errors and execution uncertainties are prevalent, which further limit the stability and practicality of existing VLA systems. Therefore, how to construct a VLA robot operation method with clear structure, strong interpretability and robust execution and error recovery capability without depending on large-scale training data and high-performance computing hardware is a technical problem to be solved in the current field. Conventional VLA systems mainly employ task-centric design paradigms to design specialized solutions for specific task types. However, a complex robotic manipulation task is essentially a sequence of multiple atomic level basic motor skills. For example, the complex task of "grooming the table" may be broken down into an ordered combination of basic actions such as grabbing, placing, and pushing. Therefore, the complex task is decomposed into parameterized atomic operations from the task-centric design paradigm to the skill-centric design paradigm, so that the generalization capability of the system is improved, and higher execution efficiency is realized through skill reuse. This paradigm shift enables the system to handle infinitely diverse task types through a combination of limited basic skills, significantly enhancing the adaptability and extensibility of the system. In a real application environment, the robot operation generally faces the problems of execution errors, lack of computing resources, lack of self-adaption capability in the execution process and the like. Therefore, how to implement a visual-language-action system capable of completing multi-step robot operations based on natural language instructions without relying on large-scale training data and high-performance computing resources, which has high robustness and is deployed in a resource-constrained environment, is still a technical problem to be solved in the art. Disclosure of Invention Aiming at the problems that the existing VLA robot system generally depends on large-scale model training, has large consumption of computing resources, has insufficient system reliability and insufficient error recovery mechanism and is difficult to deploy in a resource-limited environment, the invention provides a training-free lightweight VLA robot operation method and system. According to the method, on the premise that training or fine adjustment of the model is not needed and a graphic processor is not needed to be configured at the robot end, the robot can finish multi-step operat