CN-122008248-A - Auxiliary teleoperation method based on vision-language-action model

CN122008248ACN 122008248 ACN122008248 ACN 122008248ACN-122008248-A

Abstract

The invention belongs to the technical field of teleoperation and robot control, and particularly relates to an auxiliary teleoperation method based on a vision-language-action model. The key technology of the method is divided into a data preprocessing stage and a strategy learning and reasoning stage based on a few-sample strengthening auxiliary teleoperation framework, and the method comprises the following steps of S1, injecting random noise into a supervision track to construct intent disturbance distribution, S2, extracting key frames of the supervision track to construct intent representation of geometric perception, and S3, encoding the processed track into potential embedding to provide conditions for a vision-language-action model controller. According to the invention, through fusing visual information, language instructions and action strategies, the teleoperation task is quickly adapted, and good generalization capability is provided among different operators, so that cross-operator migration and robust control are supported, and accurate cross-operator intention recognition and strategy execution are realized.

Inventors

Fan Zipei
LIU YU
MIAO CHENGYU
SONG XUAN
Huang Tianlv
HAN WEI
YAN FEI

Assignees

吉林大学

Dates

Publication Date: 20260512
Application Date: 20260408

Claims (4)

1. The auxiliary teleoperation method based on the vision-language-action model is characterized in that the auxiliary teleoperation frame is enhanced by a few samples based on the vision-language-action model, and the key technology is divided into two main core stages of a data preprocessing stage and a strategy learning and reasoning stage, and the method comprises the following steps: s1, injecting random noise into the monitoring track to construct the disturbance distribution of the intention, namely setting the monitoring track as Wherein As a function of the total time step of the trajectory, For the joint angles at each moment, the core is disturbed by the track level Constructing an intention disturbance distribution; S2, extracting key frames of the supervision track, and constructing an intention representation of geometric perception, namely a given track Wherein As a function of the total time step of the trajectory, For each time step Corresponding trajectory geometry by directed hausdorff distance Defining geometric errors of tracks, and solving for meeting constraints Minimum keyframe set of (2) Η is a preset error threshold; S3, encoding the processed tracks into potential embedding, providing conditions for a vision-language-action model controller, namely, generating consistent semantic context representation by aligning natural language instructions with vision observation in a robust way through the vision-language-action model, fusing track guidance and semantic context to infer potential intention through intention specialists, splicing to form uniform multi-mode context, converting the multi-mode context into action tokens through condition stream matching by action specialists, and outputting the action tokens, wherein the vision-language model and the action specialists inherit initial weights of a pre-trained vision-language-action model, and the decoder type intention specialists adopt the same implementation mode as a Gemma backbone network.
2. The method for assisting teleoperation based on a visual-language-action model according to claim 1, wherein in S1, random noise is injected into a supervision track, and the specific steps of constructing the intended disturbance distribution are as follows: the kernel qψ (ζ) is disturbed by the track level |Ζ) vs. supervision trajectory Constructing an intended disturbance distribution, defined as: ; Wherein the method comprises the steps of : ; Is a single time step action instruction Is used to determine the vector dimension of (c), Is the total time step of the track; Is the first Corresponding to time steps The disturbance parameter block is specifically expressed as: ; Wherein the method comprises the steps of Is the first Time step, the first The disturbance variance of the dimensional motion is used for controlling noise intensity, and the diagonal matrix formally ensures that disturbance between different time steps is independent.
3. The method for assisting teleoperation based on the visual-language-action model according to claim 1, wherein in S2, the specific steps of extracting key frames of the trajectory and constructing the intended representation of geometric perception are as follows: given a trajectory For the geometric features of the trajectory we define linear interpolation of the trajectory Wherein For key frame index set, satisfy ; Obtained by linear interpolation In the first place Time step The method comprises the following steps: ; ; Wherein, the 、 Is the first Time step of key frame, current time step The method meets the following conditions: the geometric error of the trajectory is defined as: ; Wherein, the Is the original track point To reconstruct the track The geometric distance of (2) is calculated as follows: ; The goal of key frame extraction is to find the satisfaction Is a minimum set of (2) And its geometric error does not exceed a preset threshold : 。
4. The method for assisting teleoperation based on the vision-language-action model according to claim 1, wherein the action expert assisting the fine tuning of the teleoperation system in S3 comprises the following specific steps: Action expert, converting multimodal context into action block State of proprioception And stream matching noise term Mapping to action embedding space to form queries, these queries driving a cross-modal attention decoder that focuses on multi-modal contexts, including language, as keys and values Observing and observing And intention of The architecture ensures that the generated actions strictly follow the semantic environment and specified intent constraints; In the training stage, a mixed training strategy is adopted, wherein for training the pre-training vision-language model and the basic priori of action specialists by using low-rank adaptation, a low-rank decomposition formula with updated weight is as follows: ; Wherein, the Weight update amount introduced for low rank adaptation, dimension is , Is of low rank dimension, much smaller than Alpha is the scaling factor, For the downlink projection matrix, the dimension is , Is an uplink projection matrix, and the dimension is ; The forward propagation calculation type modification is as follows: ; I.e. The input characteristics of the layer(s), To inherit pre-training weight matrix from pre-training vision-language model and action expert, dimension and Consistent; Intent expert full parameter training to ensure alignment with visual-language model context, strategy The continuous transition from gaussian priors to expert action distribution is modeled using conditional flow matching, which, in particular, Stream matching training relying on a target vector field Let the motion converge gradually from noise to expert motion, the loss function is: ; Wherein, the The condition context is represented, including language instructions, visual observations, proprioception, and inferred operator intent.

Description

Auxiliary teleoperation method based on vision-language-action model Technical Field The invention relates to the technical field of teleoperation and robot control, in particular to an auxiliary teleoperation method based on a vision-language-action model. Background With the continuous development of intelligent technology, teleoperation systems have become basic systems for promoting intelligent development of the tool, and are widely applied to the fields of industrial manufacturing, remote service, dangerous environment operation and the like. The teleoperation system can collect demonstration data by means of the multimode perception capability of human beings and expert decision priors, and is used for training and optimizing a general strategy of the robot. The task efficiency is improved by adopting a shared control mode of 'human providing high-level guidance and robot executing bottom-level actions'. However, the inherent mismatch between human and robotic dynamics (especially under non-optimal operating conditions) may result in the joint approaching an extreme position or falling into a singular configuration. Meanwhile, the operator has high heterogeneity of track distribution due to individual differences generated by different habits and professional backgrounds, and the stability of intention recognition is also greatly challenged. To improve the intent to infer stability and execution efficiency, existing teleoperation schemes can be broadly divided into full teleoperation under closed-loop control, coordinated teleoperation based on fixed policies, and supervisory control teleoperation where the operator only acts as a supervision. However, these methods are burdened with operators, not only rely on extensive scene data training, but also have difficulty achieving generalization of intent recognition between different operators and skill levels. More importantly, although the efficiency quality of teleoperation and the generalization capability to the operation environment can be improved to a certain extent by expanding an intention-strategy library or adopting traditional reinforcement learning to the teaching track data of the expert operation robot, the data collection and labeling cost is greatly increased, the model calculation amount and the supervisor burden are obviously increased, and meanwhile, the generalization capability across operators is not improved. Disclosure of Invention This section is intended to outline some aspects of embodiments of the application and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description of the application and in the title of the application, which may not be used to limit the scope of the application. In order to solve the technical problems, according to one aspect of the present invention, the following technical solutions are provided: An auxiliary teleoperation method based on a vision-language-action model and a few-sample generalized auxiliary teleoperation frame based on the vision-language-action model are characterized in that the key technology is divided into two main core stages of a data preprocessing stage and a strategy learning and reasoning stage, and the method comprises the following steps: s1, injecting random noise into the monitoring track to construct the disturbance distribution of the intention, namely setting the monitoring track as WhereinAs a function of the total time step of the trajectory,For the joint angles at each moment, the core is disturbed by the track levelConstructing an intention disturbance distribution; S2, extracting key frames of the supervision track, and constructing an intention representation of geometric perception, namely a given track WhereinAs a function of the total time step of the trajectory,For each time stepCorresponding track geometry. By directed hausdorff distanceDefining geometric errors of tracks, and solving for meeting constraintsMinimum keyframe set of (2)Η is a preset error threshold; S3, encoding the processed tracks into potential embedding, providing conditions for a vision-language-action model controller, namely, generating consistent semantic context representation by aligning natural language instructions with vision observation in a robust way through the vision-language-action model, fusing track guidance and semantic context to infer potential intention through intention specialists, splicing to form uniform multi-mode context, converting the multi-mode context into action tokens through condition stream matching by action specialists, and outputting the action tokens, wherein the vision-language model and the action specialists inherit initial weights of a pre-trained vision-language-action model, and the decoder type intention specialists adopt the same implementation mode as a Gemma backbone network. As a preferred scheme of the auxiliary teleoperation method based on the vision-la