CN-121980407-A - Body-equipped intelligent VLA data pre-labeling method and device based on multi-mode data

CN121980407ACN 121980407 ACN121980407 ACN 121980407ACN-121980407-A

Abstract

The invention relates to a method and a device for pre-labeling self-contained intelligent VLA data based on multi-modal data, wherein the method comprises the steps of obtaining multi-modal data stream comprising visual modal data, language modal data and action modal data, and carrying out time sequence segmentation on the multi-modal data stream according to time sequence to obtain a plurality of time period data units; the method comprises the steps of identifying a current task execution state based on visual mode data and language mode data according to each time period data unit, generating at least one candidate labeling level set corresponding to the state according to the task execution state, respectively carrying out multi-mode semantic representation on each level labeling in the candidate labeling level set, constructing a multi-mode consistency relation graph comprising visual nodes, language nodes and action nodes, and executing consistency propagation operation in the multi-mode consistency relation graph to obtain multi-mode consistency constraint relations among the nodes. According to the scheme, stability and reliability of the multi-mode annotation in a complex scene can be improved.

Inventors

YANG SHUSEN
CHEN YUHAO
YAO BIN

Assignees

堃华科技(广州)有限公司

Dates

Publication Date: 20260505
Application Date: 20260124

Claims (10)

1. The utility model provides a have body intelligence VLA data pre-labeling method based on multi-mode data which characterized in that includes: Acquiring a multi-mode data stream containing visual mode data, language mode data and action mode data, and carrying out time sequence segmentation on the multi-mode data stream according to a time sequence to obtain a plurality of time period data units; For each time period data unit, identifying a current task execution state based on the visual mode data and the language mode data, and generating at least one candidate annotation level set corresponding to the task execution state; Respectively carrying out multi-mode semantic representation on each level mark in the candidate mark level set, and constructing a multi-mode consistency relation graph comprising visual nodes, language nodes and action nodes; Performing consistency propagation operation in the multi-mode consistency relation graph to obtain multi-mode consistency constraint relation among all nodes, and detecting node combinations with conflicts; Constructing a counterfactual multimodal sample associated with the candidate annotation based on the corresponding candidate annotation level for the detected conflict node combination; introducing the counterfactual multimode sample into the multimode consistency relation graph, and executing multiple rounds of consistency backtracking verification to determine the consistency stability of the candidate labels under different mode combinations; Executing label life cycle state transfer control on the candidate labels according to the consistency stability result to determine the current label state of the candidate labels; And when the labeling state of the candidate labeling meets a preset freezing condition, determining the candidate labeling as a pre-labeling result of the VLA data and outputting the pre-labeling result.
2. The pre-labeling method of claim 1, wherein the obtaining a plurality of time period data units comprises: Performing time synchronization and alignment processing on a multi-mode data stream composed of visual mode data, language mode data and action mode data; Determining candidate time period boundaries segment by segment along the time sequence of the multi-mode data stream based on action change information in action mode data; Correcting the boundary of the candidate time period by combining scene change information in the visual mode data and semantic boundary information in the language mode data; The multi-modal data stream is divided into a plurality of time segment data units based on the revised time segment boundaries.
3. The method of pre-labeling according to claim 1, wherein the performing multi-modal semantic representation on each level label in the candidate labeling level set, and constructing a multi-modal consistency relationship graph including visual nodes, language nodes, and action nodes, includes: Respectively constructing a corresponding modal sub-consistency diagram aiming at each level annotation in the candidate annotation level set, wherein the modal sub-consistency diagram comprises a visual semantic unit, a language semantic unit or an action semantic unit which are associated with the level annotation; on the basis of the modal sub-consistency map, constructing a cross-modal consistency map according to semantic mapping relations among semantic units of different modalities, wherein the cross-modal consistency map is used for describing corresponding relations among the different modal sub-consistency maps; Combining the modal sub-consistency map with the cross-modal consistency map to form a graph collection structure of a multi-modal consistency relationship map, wherein the graph collection structure at least comprises a main consistency map and a plurality of auxiliary consistency maps; Assigning an initial consistency state identifier to each graph node in the graph collection structure, wherein the consistency state identifier is used for representing the consistency credibility of the node under the current candidate labeling condition; And carrying out grouping initialization on the node states in the graph collection structure based on the hierarchical type and the task execution state to which the candidate labels belong, so that nodes in different hierarchies or different states have different consistent state starting conditions.
4. A method of pre-labeling according to claim 3, wherein each graph node in the graph collection structure of the multimodal consistency relationship graph is divided into, according to the semantic unit type it represents: Visual nodes represent image features or object recognition results extracted from visual modalities; language nodes represent vocabulary or semantic information extracted from language modes; And the action node represents the motion or behavior information extracted from the action mode.
5. A method according to claim 1 or 3, wherein performing a consistency propagation operation in the multi-modal consistency relationship graph to obtain multi-modal consistency constraint relationships between nodes and detecting conflicting node combinations comprises: In the graph aggregation structure of the multi-mode consistency relation graph, sequentially executing a plurality of consistency propagation phases according to a preset propagation phase sequence, wherein each propagation phase corresponds to a different consistency propagation target; in each propagation stage, intra-layer consistency propagation and cross-graph consistency propagation are respectively executed according to the consistency constraint type, wherein the intra-layer consistency propagation is used for updating node states in the sub-consistency graph of the same modality, and the cross-graph consistency propagation is used for transmitting consistency influence information between the main consistency graph and the auxiliary consistency graph; in the consistency propagation process, recording propagation path information corresponding to the node state change to form a node state evolution track; after at least one complete propagation period is completed, analyzing the node state evolution track to identify whether a conflict evolution mode exists in which the node state presents periodic oscillation, continuous degradation or branch splitting in a plurality of propagation stages; And when the conflict evolution mode is detected, judging the corresponding node combination as a conflict node combination.
6. The pre-labeling method of claim 5, wherein one or more of the following is satisfied: The intra-layer consistency propagation is used for updating the consistency state of the nodes in the same mode, the cross-mode consistency propagation is used for consistency transmission of the nodes among different modes, the propagation is performed according to a preset priority order, and the propagation priority among different modes is dynamically adjusted according to the initial state of the nodes; in the propagation process, judging whether the propagation condition is met according to the initial consistency state of the node and the historical propagation result; When the consistency propagation meets the condition, continuing to execute the next propagation phase, and when the consistency propagation does not meet the condition, suspending the propagation and waiting for more samples or condition updating; Judging whether the nodes reach a conflict threshold according to the consistency state change of the graph nodes, wherein the conflict threshold is determined based on the consistency state history change of the nodes, the correlation degree among modes and the credibility of candidate labels.
7. The pre-labeling method of claim 5, wherein the constructing a counterfacts multi-modal sample associated with the candidate labels based on corresponding candidate label levels for the detected conflicting node combinations comprises: determining candidate labels with higher association with conflicts based on the conflict node combination; Replacing, perturbing or shielding at least one modality data for the determined candidate annotations; And constructing a counterfactual multimode sample based on the processed modal data.
8. The method of pre-labeling of claim 6, wherein the determining consistency stability of the candidate labels under different modal combinations comprises: Introducing the counterfactual multimodal sample into the multimodal consistency relationship graph; executing multi-round consistency backtracking verification on the multi-mode consistency relation graph after the counter fact multi-mode sample is introduced; and evaluating consistency stability of the candidate labels based on the multi-round backtracking verification result.
9. The pre-labeling method of claim 1, wherein the performing label lifecycle state transition control on the candidate labels to determine the current label state of the candidate labels based on the consistency stability result comprises: according to the consistency stability result, calculating to obtain the credibility level of the candidate label; And controlling the candidate labels to execute state transition among a plurality of preset label life cycle states based on the matching result of the credibility grade and a plurality of preset credibility grade areas or thresholds.
10. A body-building intelligent VLA data pre-labeling device based on multi-modal data, comprising: The system comprises an acquisition unit, a time sequence segmentation unit and a processing unit, wherein the acquisition unit is used for acquiring a multi-mode data stream containing visual mode data, language mode data and action mode data, and performing time sequence segmentation on the multi-mode data stream according to a time sequence to obtain a plurality of time period data units; the generating unit is used for identifying the current task execution state based on the visual mode data and the language mode data according to each time period data unit, and generating at least one candidate annotation level set corresponding to the task execution state; The construction unit is used for respectively carrying out multi-mode semantic representation on each level annotation in the candidate annotation level set and constructing a multi-mode consistency relation graph comprising visual nodes, language nodes and action nodes; The detection unit is used for executing consistency propagation operation in the multi-mode consistency relation graph so as to obtain multi-mode consistency constraint relation among all nodes and detecting node combinations with conflicts; A construction unit, configured to construct a counterfactual multimodal sample associated with the candidate annotation based on the corresponding candidate annotation level for the detected conflicting node combination; The processing unit is used for introducing the counterfactual multimode sample into the multimode consistency relation graph, executing multiple rounds of consistency backtracking verification to determine the consistency stability of the candidate labels under different mode combinations, and executing label life cycle state transfer control on the candidate labels according to the consistency stability result to determine the current label state of the candidate labels; And the output unit is used for determining the candidate label as a pre-labeling result of the VLA data and outputting the result when the labeling state of the candidate label meets the preset freezing condition.

Description

Body-equipped intelligent VLA data pre-labeling method and device based on multi-mode data Technical Field The invention relates to the technical field of robot control, in particular to a method and a device for pre-labeling intelligent VLA (very large scale integration) data based on multi-mode data. Background With the development of artificial intelligence technology, multi-modal data based on Vision (Vision), language (Language) and Action (Action) is widely applied to the fields of personal intelligence, robot control, automatic driving, man-machine interaction, and the like. In the above application scenario, it is often necessary to construct VLA datasets containing visual information, language descriptions, and corresponding action behaviors for training or validating multimodal intelligent models. In practical applications, VLA data is usually derived from a multi-mode data stream collected during the operation of a real system, and has large data scale and long time span, and the collection frequency and expression form of different modes have significant differences. Therefore, how to efficiently and accurately label the multi-modal data becomes an important factor for limiting the construction efficiency of the VLA data set. In the existing scheme, the labeling mode aiming at VLA data mainly comprises manual labeling, semi-automatic labeling and automatic labeling based on model reasoning. The manual labeling mode has the problems of high cost, low efficiency and difficulty in large scale although the accuracy is high, the semiautomatic labeling mode generally relies on manual correction of an automatic generation result and still needs a large amount of manual participation, and the automatic labeling mode based on model reasoning generally directly generates a corresponding labeling result by extracting and matching characteristics of multi-mode data. However, when the existing automatic or semi-automatic labeling method processes multi-mode data, the multi-mode consistency is generally regarded as a static result, modeling capability of the multi-mode semantic relation changing along with the processing process is lacked, and stability and reliability of the multi-mode labeling under a complex scene are difficult to accurately reflect. Disclosure of Invention Based on the method and the device, the stability and the reliability of the multi-mode annotation in a complex scene can be improved. The invention provides a self-contained intelligent VLA data pre-labeling method based on multi-mode data, which comprises the following steps: Acquiring a multi-mode data stream containing visual mode data, language mode data and action mode data, and carrying out time sequence segmentation on the multi-mode data stream according to a time sequence to obtain a plurality of time period data units; For each time period data unit, identifying a current task execution state based on the visual mode data and the language mode data, and generating at least one candidate annotation level set corresponding to the task execution state; Respectively carrying out multi-mode semantic representation on each level mark in the candidate mark level set, and constructing a multi-mode consistency relation graph comprising visual nodes, language nodes and action nodes; Performing consistency propagation operation in the multi-mode consistency relation graph to obtain multi-mode consistency constraint relation among all nodes, and detecting node combinations with conflicts; Constructing a counterfactual multimodal sample associated with the candidate annotation based on the corresponding candidate annotation level for the detected conflict node combination; introducing the counterfactual multimode sample into the multimode consistency relation graph, and executing multiple rounds of consistency backtracking verification to determine the consistency stability of the candidate labels under different mode combinations; Executing label life cycle state transfer control on the candidate labels according to the consistency stability result to determine the current label state of the candidate labels; And when the labeling state of the candidate labeling meets a preset freezing condition, determining the candidate labeling as a pre-labeling result of the VLA data and outputting the pre-labeling result. In one embodiment, the obtaining the plurality of time period data units includes: Performing time synchronization and alignment processing on a multi-mode data stream composed of visual mode data, language mode data and action mode data; Determining candidate time period boundaries segment by segment along the time sequence of the multi-mode data stream based on action change information in action mode data; Correcting the boundary of the candidate time period by combining scene change information in the visual mode data and semantic boundary information in the language mode data; The multi-modal data stream is divided into a plur