CN-122021566-A - Feature Token sequence clipping method, device, equipment and storage medium

CN122021566ACN 122021566 ACN122021566 ACN 122021566ACN-122021566-A

Abstract

The application discloses a characteristic Token sequence cutting method, a device, equipment and a storage medium, and relates to the technical field of data processing, wherein the method comprises the steps of determining data to be input, inputting the data to be input into a preset multi-mode model, and determining the complexity of corresponding content of the data to be input through the preset multi-mode model; and based on the content complexity, dynamically cutting the feature Token sequence corresponding to the data to be input to obtain a reserved target sequence, and sending the target sequence to a cloud end for the cloud end to reason corresponding tasks of the data to be input based on the target sequence. The application meets the application requirements of different scenes.

Inventors

CUI YINGJIE
ZHOU XIQIN
CHEN YONG
JIANG ZHONGLIN
LI ZHICHENG
ZHANG XIAOPEI
ZHAO JIAN
TIAN CAN
CHEN RONGFA
Yu Chuncun

Assignees

浙江吉利控股集团有限公司
吉利汽车研究院（宁波）有限公司

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (14)

1. A feature Token sequence clipping method, which is applied to an edge device, the feature Token sequence clipping method comprising: Determining data to be input, inputting the data to be input into a preset multi-mode model, and determining the complexity of the content corresponding to the data to be input through the preset multi-mode model; Based on the content complexity, dynamically cutting the feature Token sequence corresponding to the data to be input to obtain a reserved target sequence; and sending the target sequence to a cloud end so that the cloud end can infer corresponding tasks of the data to be input based on the target sequence.
2. The method for clipping a Token sequence according to claim 1, wherein the step of determining the complexity of the content corresponding to the data to be input by using the preset multi-modal model comprises: extracting the situation or scene information richness of a global feature vector corresponding to the data to be input through the preset multi-mode model; and inputting the condition of the global semantic vector or the richness of the scene information into a content perception clipping sub-module in a preset multi-mode model, and determining the complexity of the content corresponding to the data to be input.
3. The method of claim 2, wherein the content aware cropping sub-module is a trained pre-set lightweight network sub-model.
4. The method for clipping a feature Token sequence according to claim 2, wherein the step of dynamically clipping a feature Token sequence corresponding to the data to be input based on the complexity of the content to obtain a target sequence comprises: determining the importance degree of the feature Token sequence corresponding to the data to be input; and dynamically cutting the characteristic Token sequence corresponding to the data to be input based on the content complexity and the importance degree to obtain a target sequence.
5. The method for clipping feature Token sequences according to claim 4, wherein the step of dynamically clipping the feature Token sequences corresponding to the data to be input based on the content complexity and the importance level to obtain a target sequence comprises: predicting a suggested Token retention ratio based on the content complexity; calculating the number of the Token to be reserved according to the Token reservation proportion; And sequencing the feature Token sequences according to the importance degree from high to low, and dynamically cutting the feature Token sequences corresponding to the data to be input based on the sequencing and the Token quantity to be reserved to obtain a target sequence.
6. The method for clipping a feature Token sequence according to claim 4, wherein the step of determining the importance level of the feature Token sequence corresponding to the data to be input comprises: Inputting a feature Token sequence corresponding to the data to be input and a global feature vector corresponding to the data to be input into a target importance judging network of the content perception cutting sub-module; determining importance scores of the feature Token sequences based on the target importance discrimination network; And determining the importance degree of the feature Token sequence corresponding to the data to be input according to the importance degree score.
7. The method for clipping a Token sequence according to claim 2, wherein the step of inputting the data to be input into a predetermined multi-modal model and determining the complexity of the content corresponding to the data to be input by the predetermined multi-modal model comprises: Acquiring a training sample and a label corresponding to the training sample; Inputting the training sample into a preset model to be trained, and predicting the training sample based on the preset model to be trained to obtain a prediction result, wherein the preset model to be trained is correspondingly provided with a multi-objective loss function, and the multi-objective loss function at least comprises a loss function of a main task corresponding to the training sample; comparing the predicted result with the label to obtain a comparison result; And carrying out iterative training on the preset model to be trained based on the comparison result to obtain a preset multi-mode model, wherein the iterative training enables the overall loss value corresponding to the multi-objective loss function to be minimum or enables the number of iterative training to reach a preset number.
8. The method of claim 7, wherein the multi-objective loss function further comprises at least one of a knowledge distillation loss function and a sparsity constraint loss function.
9. The method of claim 8, wherein when the multi-objective loss function includes a knowledge distillation loss function, the corresponding prediction result includes a first prediction result, the overall loss value includes a knowledge distillation loss value, and the predicting the training sample based on the preset model to be trained to obtain the prediction result includes: performing first prediction on the training sample based on a content perception cutting sub-module in the preset model to be trained to obtain a first prediction result; The step of comparing the predicted result with the label to obtain a comparison result comprises the following steps: Comparing the first prediction result with a knowledge distillation label in the labels to obtain a knowledge distillation loss value, wherein the knowledge distillation label comprises teacher output, and the teacher output comprises a first output result of processing corresponding training samples when the content perception clipping submodule corresponds to non-clipping or a second output result of processing corresponding training samples corresponding to a cloud model.
10. The method for clipping a feature Token sequence according to claim 9, wherein the step of performing a first prediction on the training sample based on the content-aware clipping sub-module in the preset model to be trained to obtain a first prediction result includes: Processing the training sample based on a cutting decision unit in an importance discrimination network to be trained in the content perception cutting sub-module to obtain a third output result, wherein the cutting decision unit comprises various preset cutting decisions; The step of performing iterative training on the preset model to be trained based on the comparison result to obtain a preset multi-modal model comprises the following steps: And based on the knowledge distillation loss value, adjusting a preset cutting decision and corresponding model parameters selected by the importance discrimination network to be trained until a preset multi-mode model containing the target importance discrimination network is obtained through training.
11. The method of claim 8, wherein when the multi-objective loss function includes a sparsity constraint loss function, the corresponding prediction result includes a second prediction result, the overall loss value includes a sparsity constraint loss value, and the predicting the training sample based on the preset model to be trained to obtain the prediction result includes: Performing second prediction on the training sample based on a content perception cutting sub-module in the preset model to be trained to obtain a second prediction result; The step of comparing the predicted result with the label to obtain a comparison result comprises the following steps: Comparing the second prediction result with a sparse constraint tag in the tags to obtain a sparse constraint loss value, wherein the sparse constraint tag comprises sparse constraint output, the sparse constraint output is a fourth output result generated when the content perception cutting sub-module processes a corresponding training sample, and the fourth output result is introduced when the content perception cutting sub-module is trained And a norm penalty.
12. A feature Token sequence clipping apparatus, for use with an edge device, the feature Token sequence clipping apparatus comprising: The determining module is used for determining data to be input, inputting the data to be input into a preset multi-mode model, and determining the complexity of the content corresponding to the data to be input through the preset multi-mode model; the clipping module is used for dynamically clipping the characteristic Token sequence corresponding to the data to be input based on the content complexity degree to obtain a reserved target sequence; The reasoning module is used for sending the target sequence to the cloud end so that the cloud end can reason corresponding tasks of the data to be input based on the target sequence.
13. A feature Token sequence clipping apparatus, characterized in that the apparatus comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the feature Token sequence clipping method according to any one of claims 1 to 11.
14. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the feature Token sequence clipping method according to any one of claims 1 to 11.

Description

Feature Token sequence clipping method, device, equipment and storage medium Technical Field The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for clipping a feature Token. Background In the cloud-based collaborative intelligent system, an edge device (such as a smart phone, a vehicle-mounted terminal and the like) and a cloud server together complete a complex artificial intelligent task. The edge equipment is responsible for the perception processing of front-end data, and the cloud end performs depth reasoning by using a large-scale model, however, along with the diversification of application requirements, the content of data collected by the edge side is quite different, for example, in intelligent driving, sometimes a simple road scene is captured by a camera, sometimes a complex scene containing a plurality of pedestrians and vehicles is captured by a camera, and the requirements of the scenes with different complexities on data transmission and model calculation are quite different. Based on the above analysis, the general edge device cannot upload all the original data to the cloud for processing, and compression screening must be performed on the data locally on the edge device, so Token feature clipping technology is proposed (in the current technology, in the end cloud collaborative intelligent system, a large model such as a Transformer is generally used to process multi-modal tasks, but the calculation cost is in direct proportion to the length of a Token sequence corresponding to the input data), and the compression acceleration purpose is achieved by deleting a non-important Token feature sequence. However, in the prior art, a fixed clipping proportion or rule is used to process the input data corresponding to the Token feature sequence, which is difficult to meet the application requirements of different scenes. Disclosure of Invention The application mainly aims to provide a feature Token sequence clipping method, device, equipment and storage medium, and aims to solve the technical problem that in the prior art, input data corresponding to Token feature sequences are processed by adopting fixed clipping proportion or rule, so that application requirements of different scenes are difficult to meet. In order to achieve the above object, the present application provides a feature Token sequence clipping method, which is applied to an edge device, and the feature Token sequence clipping method includes: Determining data to be input, inputting the data to be input into a preset multi-mode model, and determining the complexity of the content corresponding to the data to be input through the preset multi-mode model; Based on the content complexity, dynamically cutting the feature Token sequence corresponding to the data to be input to obtain a reserved target sequence; and sending the target sequence to a cloud end so that the cloud end can infer corresponding tasks of data to be input based on the target sequence. In an embodiment, the step of determining, according to the preset multimodal model, a complexity level of the content corresponding to the data to be input includes: extracting the situation or scene information richness of a global feature vector corresponding to the data to be input through the preset multi-mode model; and inputting the condition of the global semantic vector or the richness of the scene information into a content perception clipping sub-module in a preset multi-mode model, and determining the complexity of the content corresponding to the data to be input. In one embodiment, the content aware clipping sub-module is a trained pre-set lightweight network sub-model. In an embodiment, the step of dynamically clipping the feature Token sequence corresponding to the data to be input to obtain the target sequence based on the content complexity includes: determining the importance degree of the feature Token sequence corresponding to the data to be input; and dynamically cutting the characteristic Token sequence corresponding to the data to be input based on the content complexity and the importance degree to obtain a target sequence. In an embodiment, the step of dynamically clipping the feature Token sequence corresponding to the data to be input to obtain a target sequence based on the content complexity and the importance level includes: predicting a suggested Token retention ratio based on the content complexity; calculating the number of the Token to be reserved according to the Token reservation proportion; And sequencing the feature Token sequences according to the importance degree from high to low, and dynamically cutting the feature Token sequences corresponding to the data to be input based on the sequencing and the Token quantity to be reserved to obtain a target sequence. In an embodiment, the step of determining the importance level of the feature Token