CN-121973254-A - Layering alignment method and system for multi-mode sensing data set of soft manipulator

CN121973254ACN 121973254 ACN121973254 ACN 121973254ACN-121973254-A

Abstract

The invention discloses a layering alignment method and system for a multi-mode sensing dataset of a soft manipulator, which realize full-flow coverage from data quality evaluation to feature space alignment and semantic consistency optimization through a three-stage design of 'single-mode noise evaluation-feature level self-adaptive alignment-semantic level constraint alignment', break through the limitation that the traditional method only focuses on single-layer alignment, and enable a model to gradually adapt to the complex characteristics of touch multi-mode sensing. And secondly, designing a self-adaptive temperature function based on the comprehensive reliability of the modal pair, establishing a strict monotonic mapping relation between the temperature coefficient and the modal noise characteristic, enabling the high-noise modal pair to automatically obtain a larger temperature coefficient to enhance robustness, and enabling the low-noise modal pair to automatically obtain a smaller temperature coefficient to improve alignment accuracy, thereby fundamentally solving the technical problem that the fixed temperature coefficient cannot be adapted to the tactile multi-modal heterogeneous noise characteristic, and realizing end-to-end sensitivity adjustment through the globally learnable self-adaptive adjustment coefficient.

Inventors

HUANG JIAN
WANG HAOYUAN
LIAO YONGKAI
CHEN XINXING

Assignees

华中科技大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. A hierarchical alignment method for a multi-modal awareness dataset of a soft manipulator, comprising: S1, quantifying the noise level of each mode in a multi-mode sensing data set, calculating the data reliability of the corresponding mode based on the noise level of each mode, and calculating the comprehensive reliability of each mode pair based on the prior weight of the mode pair and the data reliability of each mode; S2, distributing differentiated temperature coefficients for different modes based on comprehensive reliability of each mode pair and a learnable self-adaptive adjustment coefficient, calculating contrast learning loss based on a perception feature vector extracted by a feature extractor of each mode for perception data of each mode corresponding to each sample and the temperature coefficient of the different mode pair, and adjusting the feature extractor of each mode and the self-adaptive adjustment coefficient based on the contrast learning loss, wherein the self-adaptive adjustment coefficient can meet the conditions that the higher the comprehensive reliability of any mode pair is and the lower the temperature coefficient is, extracting a perception feature vector of each mode corresponding to each sample based on the feature extractor of each mode after training convergence, and calculating a cross-mode average vector of the perception feature vector of each mode corresponding to each sample as a reference center of a corresponding sample; And S3, calculating intra-class alignment loss and center protection loss based on the perceptual feature vector of each mode corresponding to each sample extracted by the feature extractor of each mode and the reference center of the corresponding sample, so as to finely adjust the feature extractor of each mode based on the intra-class alignment loss and the center protection loss, wherein the intra-class alignment loss is used for restraining the perceptual feature vector of each mode corresponding to each sample from shrinking towards the reference center of the corresponding sample, the center protection loss is used for restraining the actual center of the perceptual feature vector of each mode corresponding to each sample from deviating from the reference center of the corresponding sample, and after training convergence, the perceptual feature vector extracted by the feature extractor of each mode can meet the inter-mode alignment requirement.
2. The method of hierarchical alignment of a soft manipulator multimodal perception dataset of claim 1, wherein the intra-class alignment loss is calculated based on the formula: Wherein, the For the intra-class alignment loss, B is the number of samples in the current training batch, M is the number of modalities, For the perceptual feature vector of sample i corresponding to modality m, Is the reference center of sample i; The more concentrated the collision score of the sample i, and the feature similarity of the sample i corresponding to different modality pairs, The higher the feature similarity of the sample i corresponding to any mode pair is obtained by scaling the temperature coefficient of the mode pair based on the cosine similarity between the sensing feature vectors of the sample i corresponding to the two modes in the mode pair.
3. The method for hierarchical alignment of multi-modal awareness datasets of a soft manipulator of claim 2, wherein the collision score of any one sample i Is calculated based on the following formula: Wherein, the As a function of the Sigmoid, For a learnable rule weight, K is the number of fuzzy rules, Is a scaling factor; Wherein, the For an activation threshold; The activation weight of the kth fuzzy rule corresponding to the sample i is given; To pair(s) Performing the filtered activation weight; wherein R is the number of modal pairs; For the membership degree of sample i corresponding to the kth fuzzy rule, Feature similarity vectors corresponding to different modality pairs for sample i; Feature similarity of the sample i corresponding to the r-th modality pair; And Is a learnable parameter for controlling the center position and width of the membership function.
4. The method of hierarchical alignment of a soft manipulator multimodal perception dataset of claim 1, wherein the center protection penalty is calculated based on the formula: Wherein, the For the center protection loss, B is the number of samples in the current training batch, For sample i to correspond to the actual center of the perceptual feature vector of each modality, As the reference center of the sample i, In order to tolerate the parameters of the proportions, Is the global average inter-class distance; wherein N is the total number of samples in the multi-modal sense dataset, Is the reference center of sample j.
5. The method of layering alignment of a multi-modal sensing dataset of a soft manipulator of claim 1, wherein the temperature coefficient of any modal pair is calculated based on the formula: Wherein, the Is the temperature coefficient of the mode pair (m, n), K is the adaptive adjustment coefficient for the base temperature coefficient, Is the integrated reliability of the modal pair (m, n).
6. The hierarchical alignment method of multi-modal sensing datasets of a soft manipulator of claim 5, wherein the expression of contrast learning loss is: Wherein, the For the contrast learning loss, M is the number of modes, Is the cross-modal loss of the modal pair (m, n), Is a regularization term; Where B is the number of samples in the current training batch, , Is a cross-modal similarity matrix for a modal pair (m, n), As a result of the normalization processing of the perceptual feature vector of the sample i corresponding to the modality m, The normalization processing result of the perception feature vector of the corresponding mode n of the sample j is obtained; Wherein, the And Is a preset value.
7. The hierarchical alignment method of a multi-modal sensing dataset of a soft manipulator of claim 1, wherein the quantifying the noise level of each modality in the multi-modal sensing dataset, calculating the data reliability of the corresponding modality based on the noise level of each modality, and calculating the integrated reliability of each modality pair based on the prior weight of the modality pair and the data reliability of each modality, specifically comprises: Generating mask data of perception data of each mode corresponding to each sample in the current training batch by adopting random mask operation; reconstructing mask data of each mode corresponding to each sample in the current training batch based on a reconstruction network of each mode, and calculating reconstruction loss of each mode; Calculating the ratio of the reconstruction loss of each mode to the historical maximum reconstruction loss of the corresponding mode as the noise level of the corresponding mode, wherein the historical maximum reconstruction loss of any mode is the maximum reconstruction loss calculated in the process of training the reconstruction network of the mode; Calculating the data reliability of the corresponding modes based on the noise level of each mode, wherein the higher the noise level of any mode is, the lower the data reliability of the mode is; For any one of the pair of modes, calculating the product of the average value of the data reliability of the two modes in the pair of modes and the prior weight of the pair of modes as the comprehensive reliability of the pair of modes.
8. The method of hierarchical alignment of a multi-modal sensing dataset of a soft manipulator of claim 1, further comprising, prior to step S1: The multi-mode synchronous acquisition system is used for acquiring perception data under different modes for the grasping actions of the articles covering different materials, shapes, sizes, hardness and color attributes, and constructing a multi-mode perception data set based on the perception data under different modes corresponding to each grasping action, wherein one grasping action corresponds to one sample, and the multi-mode synchronous acquisition system covers multiple modes in vision, inertia, touch pressure and texts.
9. A hierarchical alignment system for a multi-modal awareness dataset of a soft manipulator, comprising: the single-mode noise evaluation unit is used for quantifying the noise level of each mode in the multi-mode sensing data set, calculating the data reliability of the corresponding mode based on the noise level of each mode, and calculating the comprehensive reliability of each mode pair based on the prior weight of the mode pair and the data reliability of each mode; The device comprises a feature level self-adaptive alignment unit, a feature level self-adaptive adjustment unit, a training convergence unit and a feature level self-adaptive adjustment unit, wherein the feature level self-adaptive alignment unit is used for distributing differentiated temperature coefficients for different modes based on the comprehensive reliability of each mode pair and the self-adaptive adjustment coefficient capable of being learned, the feature extractor based on each mode is used for calculating contrast learning loss aiming at the perception feature vector extracted by the perception data of each corresponding mode of each sample and the temperature coefficient of the different mode pair, and adjusting the feature extractor and the self-adaptive adjustment coefficient of each mode based on the contrast learning loss, wherein the self-adaptive adjustment coefficient can meet the conditions that the comprehensive reliability of any mode pair is higher and the temperature coefficient is lower; The semantic level constraint alignment unit is used for calculating intra-class alignment loss and center protection loss based on the perceptual feature vector of each sample corresponding to each mode extracted by the feature extractor of each mode and the reference center of the corresponding sample so as to finely adjust the feature extractor of each mode based on the intra-class alignment loss and the center protection loss, wherein the intra-class alignment loss is used for constraining the perceptual feature vector of each sample corresponding to each mode to shrink towards the reference center of the corresponding sample, the center protection loss is used for constraining the actual center of the perceptual feature vector of each sample corresponding to each mode not to deviate from the reference center of the corresponding sample, and after training convergence, the perceptual feature vector extracted by the feature extractor of each mode can meet the cross-mode alignment requirement.
10. An electronic device comprises a computer readable storage medium and a processor; the computer-readable storage medium is for storing executable instructions; the processor is configured to read executable instructions stored in the computer readable storage medium and perform the method of any one of claims 1-8.

Description

Layering alignment method and system for multi-mode sensing data set of soft manipulator Technical Field The invention belongs to the field of soft robots, and in particular relates to a layering alignment method and system for a multi-mode sensing data set of a soft manipulator. Background The soft robot has obvious advantages in the scenes of man-machine cooperation, precise operation and the like by virtue of the characteristics of the flexible materials, wherein the pneumatic soft manipulator is a practical choice due to the simple structure and low cost. The intelligent operation of the soft manipulator is realized by accurately identifying the gripping state by the multi-mode sensing system, wherein the key states comprise shaking, stable gripping, object sliding and the like. The current multi-mode data set construction method has the following limitations that firstly, mode coverage is incomplete, most data sets only comprise visual or single touch modes and lack full-mode synchronous acquisition of visual, inertial, touch and text, secondly, object diversity is insufficient, acquired object types are limited, attribute combinations of different materials, shapes, hardness and the like are difficult to cover, so that model generalization capability is weak, thirdly, time sequence information is insufficient, single-frame static data is mostly adopted, dynamic changes in a grasping process are ignored, and fourthly, text semantic annotation is absent, structured object attribute description is lacked, and visual-language joint learning is difficult to support. The current multi-mode alignment technology mainly uses a general multi-mode model framework to realize cross-mode feature mapping by adopting contrast learning. The key idea is to map features of different sensing modes (such as vision, touch sense, IMU and text) to a unified semantic space, and adjust the sharpness of similarity distribution through a temperature coefficient to enable cross-mode features of the same semantic content to have high similarity. However, the existing method faces two major core challenges in the multi-modal sensing field of the soft manipulator, namely, first, the fixed temperature coefficient cannot adapt to the modal noise heterogeneity. In a multi-modal system, the noise characteristics of different modal pairs show significant differences, such as that the haptic-IMU modal pair has high noise level due to mechanical vibration, environmental interference and other factors, the visual-text modal pair has low noise based on strong semantic relevance, and the haptic-visual modal pair noise characteristics are in between. The existing method generally adopts a fixed temperature coefficient, and can not simultaneously meet the requirement of robustness of a high-noise mode pair and the requirement of accurate alignment of a low-noise mode pair, so that the overall alignment quality is reduced. Second, there is a lack of hierarchical alignment mechanisms. The existing method focuses on feature level alignment, so that cross-modal semantic conflict is difficult to resolve effectively. When the spatial distribution of the features is inconsistent due to the difference of noise characteristics of different modal pairs, consistency of semantic layers cannot be ensured by simply relying on feature level alignment, and robustness and interpretability of multi-modal representation learning are limited. Therefore, it is highly desirable to design an adaptive alignment strategy with hierarchical processing capability to ensure robust cross-modal semantic understanding, and to provide high-quality data alignment capability support for adaptation and fine tuning of large models in soft robot physical interaction scenarios. Disclosure of Invention Aiming at the defects or improvement demands of the prior art, the invention provides a layering alignment method and a layering alignment system for a multi-mode sensing data set of a soft manipulator, so that the technical problems that a fixed temperature coefficient cannot adapt to noise heterogeneity, a layering alignment mechanism is lacked and a characteristic collapse risk in multi-mode sensing are solved. To achieve the above object, according to a first aspect of the present invention, there is provided a hierarchical alignment method of a multi-modal sensing dataset of a soft manipulator, including: S1, quantifying the noise level of each mode in a multi-mode sensing data set, calculating the data reliability of the corresponding mode based on the noise level of each mode, and calculating the comprehensive reliability of each mode pair based on the prior weight of the mode pair and the data reliability of each mode; S2, distributing differentiated temperature coefficients for different modes based on comprehensive reliability of each mode pair and a learnable self-adaptive adjustment coefficient, calculating contrast learning loss based on a perception feature vector extract