CN-121982328-A - Haptic characterization extraction method with high-low frequency separation

CN121982328ACN 121982328 ACN121982328 ACN 121982328ACN-121982328-A

Abstract

A haptic characterization extraction method with high-low frequency separation belongs to the field of robot haptics. The method comprises the steps of obtaining and preprocessing tactile data, constructing a low-frequency representation extraction model, constructing a high-frequency representation extraction module, and applying the low-frequency representation extraction model to a downstream task for supervision training. The invention realizes the decoupling of the multidimensional sensing task, breaks through the limitation of the traditional method that the tactile image is treated as a single visual signal, and accurately deconstructs the tactile information into a low-frequency component reflecting the macroscopic topology and a high-frequency component reflecting the microscopic texture through a frequency domain transformation means. The dual-path architecture enables the low-frequency branch to be dedicated to force estimation and slippage detection, and the high-frequency branch to be dedicated to fine grain texture classification, so that the precision of multi-task parallel processing is remarkably improved.

Inventors

DENG WENQIAN
LIU QIAN

Assignees

大连理工大学

Dates

Publication Date: 20260505
Application Date: 20260127

Claims (6)

1. The haptic characterization extraction method with high-low frequency separation is characterized by comprising the following steps: Step1, obtaining and preprocessing tactile data; Step 2, constructing a low-frequency characterization extraction model; Step 3, the low-frequency characterization extraction model is applied to downstream tasks for supervision training; Freezing encoder parameters of the low-frequency representation extraction model by adopting the low-frequency representation extracted by the low-frequency representation extraction model constructed in the step 2, building a decoder special for a task for training; step 4, constructing a high-frequency characterization extraction module; Step 5, the high-frequency characterization extraction model is applied to downstream tasks for supervision training; And (3) freezing encoder parameters of the high-frequency characterization extraction model by adopting the high-frequency characterization extracted by the high-frequency characterization extraction model constructed in the step (4), building a decoder special for tasks for training, and evaluating the extraction effect of the characterization on the robot operation tasks such as texture classification.
2. The method for extracting the high-low frequency separation tactile representation according to claim 1, wherein in the step 1, the tactile data comprises tactile data under the pressing and sliding operation of different geometric features and tactile data under the pressing and sliding operation of different materials, and the preprocessing comprises the steps of removing background of an original image, adjusting size of a deformation map and normalizing characteristics to obtain a single-channel tactile image I.
3. The method for extracting a tactile representation of high-frequency and low-frequency separation according to claim 2, wherein said step 2 comprises: Step 2.1, space-frequency domain conversion, which is to execute two-dimensional discrete Fourier transform on the preprocessed single-channel tactile image I and convert the single-channel tactile image I into frequency domain representation; ; Wherein (H, W) represents a spatial domain coordinate, (u, v) represents a frequency domain coordinate, H and W represent the height and width of the image respectively, and the exponential term e-i2 pi # ) Is a fourier basis function; Step 2.2, filtering low-frequency components, namely applying an ideal low-pass filter to the frequency domain representation obtained in the step 2.1, constructing a binary mask by setting a cut-off radius r, only preserving low-frequency information of a frequency spectrum central area, and filtering high-frequency noise and detail characteristics; ideal low-pass filter In the frequency domain, the binary mask is expressed: ; wherein the spectrum center (uc, vc) corresponds to the direct current component, and the cut-off radius r represents the control reserved frequency band range; step 2.3, reconstructing a time domain signal, namely performing inverse discrete Fourier transform on the frequency domain signal subjected to low-frequency component filtering to reconstruct a low-frequency target image Y only comprising the geometric outline of the object and pressure distribution information; ; wherein, the output Y is the reconstructed image after the high-frequency information is removed; step 2.4, masking and blocking, namely dividing the low-frequency target image Y into a plurality of non-overlapping visible image blocks, and executing random masking processing of more than 75%; Step 2.5, characterization coding, namely inputting an uncovered visible image block into an MAE encoder based on a transducer architecture based on a mask to obtain a deep low-frequency feature vector; Step 2.6, frequency domain alignment constraint, namely, inputting deep low-frequency characteristic vectors into an MAE decoder based on a transducer architecture to obtain a reconstruction mask image, calculating the difference between the reconstruction mask image and an original low-frequency target image Y by using a mean square error loss function, and ensuring that a model learns a local space topological relation; the global frequency domain alignment is to transform the reconstructed mask image obtained in the local frequency domain alignment process into a frequency domain representation again, transform the preprocessed single-channel tactile image I into the frequency domain representation, calculate the loss of the two frequency domain representations, namely the global frequency domain loss, and keep the representation after constraint coding consistent with the original tactile data on the global frequency spectrum distribution, thereby enhancing the capturing capability of the model on the potential commonality characteristics across the sensor; the global frequency domain loss function is: ; Wherein c is the color channel index; Is a two-dimensional discrete Fourier transform; A reconstructed prediction image for a decoder; Is a mask matrix; Is a gradient separation operation; The single-channel tactile image after pretreatment; Step 2.7, performing self-supervision training on the low-frequency representation extraction model through a loss function The loss function of the low-frequency representation extraction model consists of three parts, which are respectively generated by local frequency domain alignment Global frequency domain alignment original spatial domain loss And increased global frequency domain loss ; ; Wherein, the Representing the weight ratio of each loss function, respectively, which weight is obtained through experimental tests.
4. The method for extracting a tactile representation of high-low frequency separation according to claim 3, wherein in the step 3: The force estimation task branches, namely freezing an encoder part of a low-frequency representation extraction model, inputting the extracted low-frequency representation into a pooling device based on an attention mechanism, realizing weighted aggregation of spatial features by learning global query vectors, then sending the aggregated feature vectors into a regression network decoder consisting of a double-layer linear perceptron, and establishing a touch image deformation feature and a three-dimensional contact force by the decoder through dimension reduction processing of hidden layer dimensions and non-linear activation of ReLU , , ) The nonlinear mapping relation between the three-axis force measurement values is further used for constructing a force estimation model, in the training process, the real three-axis force measurement values are normalized to the [ -1,1] interval, the L1 loss function is used as regression loss for supervision training, and an Adam optimizer is used for iteration until the force estimation model is converged; The sliding detection task branches, namely freezing an encoder part of a low-frequency representation extraction model, inputting the low-frequency representation into the decoder, adopting a double-decoder framework by the decoder part, wherein the double-decoder framework comprises a sliding state classification decoder and a normalized force variation regression decoder, the double-decoder framework firstly aggregates the low-frequency representation into a global feature vector through a attention pooling layer, the sliding state decoder carries out nonlinear transformation by using a Sigmoid activation function after reducing the dimension through a linear layer, finally outputs the two classes Logits of sliding/viscous states, the normalized force variation regression decoder adopts a ReLU to activate at a hidden layer, and introduces a Hardtanh function at an output end, the triaxial force variation trend is constrained in a standard physical interval of [ -1,1], the sliding state classification decoder adopts cross entropy loss to carry out the two classes training of the sliding/viscous state by utilizing the physical coupling relation between the sliding state classification decoder and the mechanical variation trend, and the normalized force variation regression decoder is based on the prediction of the average absolute error optimization force variation.
5. The method for extracting a tactile representation of high-frequency and low-frequency separation according to claim 4, wherein said step 4 comprises: Step 4.1, frequency domain conversion, which is to perform two-dimensional discrete Fourier transform on the preprocessed single-channel tactile image I and convert the single-channel tactile image I from a spatial domain to a frequency domain representation; Step 4.2, high-pass filtering, namely applying a high-pass filter to the frequency domain representation obtained in the step 4.1 in a frequency domain space to filter out low-frequency signals reflecting macroscopic geometry and integral deformation, and reserving frequency components in the center of a high-energy spectrum; Step 4.3, reconstructing the signals, namely restoring and reconstructing the frequency domain signals subjected to the high-pass filtering to a spatial domain through inverse discrete Fourier transform to generate a high-frequency target image The high-frequency target image can intensively reflect micro textures, edge contours and fine dynamic characteristics in the interaction process of the object; Step 4.4, saliency estimation, namely, high-frequency target images are processed Dividing the image into a plurality of non-overlapped visible image blocks, inputting the divided image blocks into a saliency estimator, wherein the saliency estimator adopts a gradient weighting method and utilizes a Sobel operator to extract Calculating the sum of the gradient modular lengths within each visible image block as the gradient of (a) The method comprises the steps of extracting local energy distribution characteristics of an image, outputting a saliency map S corresponding to the dimension of an input visible image block by a saliency estimator, wherein the numerical value of each pixel point or region in the saliency map represents the probability or energy intensity of the position containing key texture characteristics; step 4.5, importance-based adaptive sampling and characterization encoding, wherein a sampler non-uniformly samples the image blocks according to a saliency map S, masks the region with the saliency value score in the range of the first 25%, divides the saliency map S into image block sequences consistent with the input of a subsequently used converter architecture asymmetric encoder, and then calculates the first Probability of individual image blocks being selected as mask blocks Probability of being selected as a mask block Score for its significance Positive correlation is formed; ; Wherein, the Is a temperature parameter and is used for adjusting the smoothness degree of sampling distribution; Then according to the preset mask rate and probability Priority is given to keeping the front with the highest significance score Masking the image blocks, and sending the unmasked image blocks into an MAE encoder of a transducer architecture to obtain deep high-frequency feature vectors; step 4.6, calculating a loss function, namely inputting the deep high-frequency feature vector obtained in the step 4.5 into an MAE decoder based on a transducer architecture to obtain a reconstruction mask image, introducing weighted combination of feature regression loss and pixel reconstruction loss in the design of the loss function, and aiming at ensuring the reconstruction mask image and the high-frequency target image through multi-scale constraint The detail textures are consistent with the depth semanteme; The characteristic regression loss is expressed as: ; Wherein, the Representing encoder for the first The prediction feature vectors output by the masked image blocks, Representing the target feature vectors extracted by the pre-trained teacher network for the corresponding blocks of the original high-frequency image, Representing a feature projection layer; the pixel reconstruction loss is: ; Wherein, the Representing the value of the ith high frequency pixel block reconstructed by the MAE decoder based on the transform architecture, Representing the corresponding high frequency target image obtained in step 4.3; Step 4.7, performing self-supervision training on the high-frequency characterization extraction model through a loss function The loss function of the high-frequency characterization extraction model consists of two parts, namely characteristic regression loss And pixel reconstruction loss ; ; Wherein, the Representing the weight ratio of each loss function, respectively, which weight is obtained through experimental tests.
6. The method according to claim 5, wherein the texture classification branch in step 5 is characterized by freezing an encoder part of a high-frequency token extraction model, inputting deep high-frequency tokens into a task-specific linear classification decoder, wherein the linear classification decoder is specifically constructed by firstly aggregating a high-frequency token sequence into a global texture vector by a cross-attention mechanism by using an attention pooling layer, then inputting the global texture vector into a bottleneck mapping network consisting of two linear layers, wherein a first linear layer compresses a token dimension to a quarter of an initial dimension and processes the feature to strengthen nonlinear expression of the feature by a ReLU activation function, finally mapping the feature into a preset material class space by a second linear layer, and performing supervised learning by using a cross-entropy loss function for texture classification loss: Wherein, the For the total number of texture classes, As a real tag it is possible to provide a real tag, Probability distribution for decoder output, classification loss during training And pixel reconstruction loss Characteristic regression loss The model optimization is guided together, so that the characterization vector has the capability of restoring the image and has strong semantic distinction.

Description

Haptic characterization extraction method with high-low frequency separation Technical Field The invention belongs to the field of robot touch, and relates to a high-low frequency separation touch representation extraction method for a robot operation task. Background As an important development field of the fourth industrial revolution, intelligent robot technology has become an important component in the world innovation strategy. From initial simple handling to precision assembly, robots have been able to assist or replace humans in some degree of production and handling. In the current period of high-speed development of artificial intelligence, the functional requirements of people on robots are not just simple, repeated and limited tasks in an industrial environment, more and more intelligent robots are developed and applied to complex and changeable actual scenes, such as rehabilitation treatment, deep sea detection, explosion elimination, space teleoperation and the like, and the requirements on the sensing and operating capability of the robots are higher. Vision is an important perception system of the present robot, but in some situations, such as robot gripping, it is also necessary that the machine has haptic perception capabilities, and by adding haptic interactions, complex tasks can be better performed. The touch sense is one of five main senses of the human body, and the human body can sense the shape, texture, temperature, hardness and weight of the object through the touch sense. With reference to visual perception, haptic perception may be divided into lower layer information, middle layer information, and higher layer information. The low-level information refers to original tactile information, the low-level information form is not unique and depends on the selected tactile sensor form and measurement principle, the middle-level information refers to contact interface, object and environmental mechanical attribute information, the information type is irrelevant to the tactile sensor form, the high-level information refers to the understanding of operation state and behavior, the information is not relevant to the tactile sensor form, and the information is closely relevant to an operation task representation layer. In the fields of artificial intelligence and robotics, haptic perception is receiving increasing research attention as an important perception model in parallel with vision and speech. The haptic sensations not only provide geometric, force and dynamic contact information for interaction with objects, but also address many of the problems of insufficient visual sensations, such as poor light conditions or blocked vision. The introduction of tactile feedback in robotic manipulation tasks such as gripping, assembly and fabric manipulation significantly improves the accuracy and adaptability of the robot. The tactile sensor is one of key core components of the robot performing a complicated fine operation. Different forms of tactile sensor vary widely. The touch sensor based on the measurement array is compact in size, high in signal measurement speed, limited in information sensing type, low in density, high in preparation process requirement price and complex in signal acquisition hardware. In recent years, with the rapid development of machine vision technology, a vision-based tactile sensor (visual tactile sensor) has become an important research direction in the field of tactile perception, which is capable of capturing high-resolution physical interaction images at a lower cost by embedding a camera and a soft elastomer in a design. These images contain rich information such as contact geometry, force field distribution and texture properties, presenting great potential in robotic manipulation tasks. Haptic characterization studies are a key link in the field of haptic perception, aimed at extracting meaningful feature information from haptic data to support robotic operational tasks in different scenarios. The tactile data is complex in form, and includes geometric and mechanical information such as force, displacement, contact area and the like, and also may include time series characteristics in the dynamic contact process. Therefore, how to extract efficient characterization from massive and multi-modal haptic data becomes one of the core problems of robot haptic research. Currently, deep learning-based characterization learning methods have made breakthrough progress in the visual and speech fields, and their application in the haptic field has also demonstrated great potential. Through characterization learning, the haptic data can be converted from the original signal of the lower layer to the object characteristic features of the middle layer, and further sublimated to the operational behavior understanding of the upper layer. The hierarchical characterization method can effectively reduce the complexity of data processing and improve the adaptability of the touch