CN-121973233-A - Asynchronous operation method and device for robot based on optical flow prediction

CN121973233ACN 121973233 ACN121973233 ACN 121973233ACN-121973233-A

Abstract

The invention discloses a robot asynchronous operation method and a device based on optical flow prediction, and relates to the technical field of robot asynchronous operation strategies, wherein the method comprises the steps of extracting characteristics from historical visual frames and modeling, outputting object flows through a lightweight flow prediction module, and rendering and synthesizing future observations; the method comprises the steps of adopting contrast learning with a time mask, aligning and synthesizing with real future observation characteristics to form consistent future visual characteristics, inputting the future visual characteristics and the ontology perception state into a diffusion strategy action generating network, generating an action sequence aligned with time sequence and constructing a queue, and depending on a diffusion strategy asynchronous architecture, a modeling system delays and determines a prediction visual field, dynamically maintains the queue and outputs alignment actions to finish asynchronous operation of the robot under a delayed scene. The invention realizes dynamic task optimization by supplementing future observation and realizing time sequence alignment, and solves the problems of asynchronous reasoning time sequence deviation and response lag.

Inventors

LU JIWEN
ZHOU JIE
YU BINGYAO
WEI HAOYU
CHENG ZIYANG
XU XIUWEI
Ma Angyuan
YIN HANG

Assignees

清华大学

Dates

Publication Date: 20260505
Application Date: 20260327

Claims (10)

1. An asynchronous operation method of a robot based on optical flow prediction, comprising: performing feature extraction and space-time modeling on the historical visual frames, outputting object flows representing the motion trend of the objects through a lightweight flow prediction module, and completing rendering synthesis of future observation based on the object flows; Adopting a contrast learning mode with a time mask to conduct alignment constraint on the synthesized future observation characteristics and the real future observation characteristics so as to form future visual representation consistent with real perception; The future visual representation after alignment constraint and the ontology perception state are input into an action generation network constructed based on a diffusion strategy together, an action sequence which is explicitly aligned with a future timestamp is generated, and an action queue with uniform time sequence is constructed; And constructing an asynchronous decision architecture based on a diffusion strategy, accessing the aligned future visual representation, determining the predicted visual field length through the total delay of an explicit modeling system, dynamically maintaining the action queue, generating an action sequence aligned with a real time stamp, and completing the time sequence asynchronous operation of the robot under a delay scene.
2. The method of claim 1, wherein feature extraction and spatio-temporal modeling of historical visual frames, outputting, by a lightweight stream prediction module, an object stream characterizing a motion trend of the object, comprises: Extracting corresponding image characteristic information from the continuous historical visual frames, and generating a thermodynamic diagram for positioning the object area; Performing soft sampling operation by adopting a clustering algorithm with weights, and weighting and aggregating corresponding space coordinates according to response values of each point of the thermodynamic diagram to obtain a rough positioning result of an object flow starting point; Calculating local correlation characteristics among different time frame characteristics, and outputting fine granularity correction values of a stream starting point and corresponding motion displacement vectors through a stream decoder to finish prediction output of an object stream.
3. The method of claim 2, wherein completing the process of rendering the composite of future observations based on the object stream comprises: endowing the predicted object flow vector with corresponding characterization information according to a preset rule, and rendering and mapping the object flow vector onto a current observation image; And the displacement and the movement direction from the original point are intuitively reflected through the flow vector after rendering, and the synthesis and the generation of the future observation image are completed by depending on the flow information.
4. A method according to claim 3, wherein the alignment constraint of the synthesized future observed features and the true future observed features using a contrast learning approach with a time mask comprises: respectively inputting the synthesized future observation and the real future observation into a shared feature encoder, and mapping the synthesized future observation and the real future observation into a unified potential feature space; constructing a time mask screening mechanism, filtering frame features with shorter time intervals, and reserving only samples with time distances meeting set conditions to participate in contrast learning; And (3) taking positive sample feature alignment and true negative sample feature separation as optimization directions, carrying out bidirectional constraint on the predicted features and the true value features, and completing feature space alignment in a symmetrical optimization mode.
5. The method of claim 4, wherein the process of forming future visual representations consistent with the true perception comprises: adopting a comparison learning target to pull the distribution distance between the predicted characteristic after the stream enhancement and the real future observation characteristic; in the feature matching process, feature interference caused by time adjacent frames is eliminated, and the effectiveness of feature alignment is improved; Feature matching degree is measured through normalized similarity, the feature reliability of prediction observation is continuously optimized, the feature difference between synthetic observation and real observation is eliminated, and future visual features consistent with real perception are formed.
6. The method of claim 5, wherein inputting the aligned future visual representation and the ontology-aware state together into the action-generating network constructed based on the diffusion policy, generating an action sequence explicitly aligned with the future timestamp, and constructing a time-sequential unified action queue, comprises: Constructing an action generating network structure based on a diffusion strategy, and adopting an encoding and decoding structure adapting to visual characteristics; Performing feature fusion on the aligned future visual representation and the self-body perception state of the robot to form complete decision input; generating a network output continuous and time sequence standard action track through action by taking a future time stamp as an alignment reference; based on the explicit time stamp corresponding to the generated action, an action queue with uniform time sequence and direct calling is constructed.
7. The method of claim 6, wherein constructing an asynchronous decision architecture based on a diffusion policy, accessing an aligned future visual representation, determining a predicted field of view length by explicit modeling system total delay, dynamically maintaining the action queue and generating an action sequence aligned with a real time stamp, comprises: accessing future visual representations subject to alignment constraints in an asynchronous decision architecture constructed based on a diffusion policy; The method comprises the steps of comprehensively observing delay, reasoning delay and controller delay, completing explicit modeling of total delay of a system, combining the total delay of the system and a control time step, and calculating and determining a predicted visual field length required by asynchronous decision; and dynamically maintaining an action queue aligned with the real time stamp based on the calculated predicted field length, and outputting an action sequence matched with the real time line.
8. An asynchronous operation device of a robot based on optical flow prediction, comprising: The observation synthesis module is used for carrying out feature extraction and space-time modeling on the historical visual frames, outputting object flows representing the motion trend of the objects through the lightweight flow prediction module, and completing rendering synthesis of future observation based on the object flows; the feature alignment module is used for performing alignment constraint on the synthesized future observation features and the real future observation features by adopting a comparison learning mode with a time mask to form future visual representation consistent with real perception; the queue construction module is used for inputting the future visual representation subjected to alignment constraint and the ontology perception state into the action generation network constructed based on the diffusion strategy together, generating an action sequence explicitly aligned with the future timestamp, and constructing an action queue with uniform time sequence; The asynchronous decision module is used for constructing an asynchronous decision architecture based on a diffusion strategy, accessing the aligned future visual representation, determining the predicted visual field length through the total delay of the explicit modeling system, dynamically maintaining the action queue, generating an action sequence aligned with the real time stamp, and completing the time sequence asynchronous operation of the robot under the delay scene.
9. An electronic device comprising a processor and a memory; Wherein the processor runs a program corresponding to executable program code stored in the memory by reading the executable program code for implementing the method according to any one of claims 1-7.
10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the method according to any one of claims 1-7.

Description

Asynchronous operation method and device for robot based on optical flow prediction Technical Field The invention relates to the technical field of asynchronous operation strategies of robots, in particular to an asynchronous operation method and device of a robot based on optical flow prediction. Background With the continuous development of intelligent perception and decision algorithms, the robot autonomous control technology continuously realizes breakthrough in the scenes of dynamic interaction, precise operation, autonomous movement and the like, and has wide application prospects in the fields of industrial automation, intelligent storage, service robots and the like. However, the existing end-to-end control strategy has the common inherent defect that the strategy reasoning (policy inference) process is accompanied by significant computational overhead, and the problem of delay caused by the strategy is seriously limited to floor deployment of the robot system in a real scene. Under the traditional synchronous execution mode, the problems of lag of action response, hard and incoherent movement track and the like of a robot are easily caused by serial execution of observation acquisition, strategy reasoning and bottom control. Especially when facing high-speed interaction tasks such as dynamic target tracking, moving object grabbing and the like, the environment state changes at any time, and the slow response speed is difficult to match with the real-time requirement of a dynamic scene, so that the task success rate is directly reduced greatly. To alleviate the delay bottleneck caused by synchronous execution, an asynchronous reasoning framework (asynchronous inference frameworks) for parallel processing of model reasoning and action execution is gradually becoming a mainstream technical route. However, the asynchronous architecture does not fundamentally eliminate the system delay, and the total delay generated by superposition of links such as observation transmission, model reasoning, bottom control and the like still causes obvious time sequence deviation between the action instruction output by the strategy and the real-time environment state (temporal misalignment). To ensure execution safety and rationality, existing systems are often forced to discard early action segments generated by each round of reasoning, which not only further aggravates execution delay, but also causes continuous action block (action chunks) engagement breakage, and motion continuity and manipulation accuracy are significantly reduced. In a dynamic target interaction scene, the problem of time sequence dislocation is further amplified, a control strategy generated based on hysteresis information is difficult to continuously aim at a target state, and stable and accurate dynamic control cannot be completed. In the prior art, research has been attempted to optimize strategy timing by predicting the future ontology perceived state (proprioceptive states), and representative methods such as VLASH alleviate the action hysteresis problem by introducing future ontology state predictions, but the methods still rely heavily on current and historical visual observations, and cannot acquire future-oriented visual context information. Because of the lack of future visual clues, the strategy network can only approximate deduce future actions based on incomplete historical information, but not develop accurate reasoning based on complete space-time context, so that the robustness and the control precision in a dynamic environment are difficult to meet the actual demands. In order to solve a series of technical problems such as time sequence dislocation, visual information loss, action response lag and the like in a complex dynamic scene, the current asynchronous strategy framework still cannot realize efficient, accurate and coherent robot real-time control. Therefore, in the field of asynchronous control of robots facing dynamic interaction tasks, there is an urgent need for a high-efficiency policy reasoning method capable of fully utilizing future visual context, eliminating time sequence deviation and improving dynamic control precision. Disclosure of Invention The invention mainly aims to provide a robot asynchronous operation method based on optical flow prediction. Another object of the present invention is to provide a robot asynchronous operation device based on optical flow prediction. A third object of the present invention is to propose an electronic device. A fourth object of the present invention is to propose a non-transitory computer readable storage medium. To achieve the above object, an embodiment of a first aspect of the present invention provides a robot asynchronous operation method based on optical flow prediction, including: performing feature extraction and space-time modeling on the historical visual frames, outputting object flows representing the motion trend of the objects through a lightweight flow pre