CN-122021656-A - Text processing method and training method based on hierarchical vector approximation

CN122021656ACN 122021656 ACN122021656 ACN 122021656ACN-122021656-A

Abstract

The disclosure provides a text processing method and a training method based on hierarchical vector approximation, belonging to the technical field of neural networks and context adaptation. The method comprises the steps of obtaining an output embedded vector of an ith layer of a neural network based on a transformer framework, inputting the output embedded vector of the ith layer to a trained low-rank proxy module to perform low-rank mapping output residual approximation, wherein the trained low-rank proxy module is obtained through self-distillation training and is configured to output an approximation value of a residual generated by the ith layer in the neural network based on the transformer framework according to the output embedded vector of the ith layer, and generating an approximated output embedded vector of the ith+1 layer according to the output embedded vector of the ith layer and the residual approximation.

Inventors

JIAN HAIFANG
ZHANG LINGHE
LI YUEHAO
WANG HONGCHANG
Huang Gaobin

Assignees

中国科学院半导体研究所

Dates

Publication Date: 20260512
Application Date: 20251230

Claims (10)

1. A text processing method based on hierarchical vector approximation, comprising: Obtaining an output embedded vector of an ith layer of neural network based on a transformer architecture, wherein the output embedded vector of the ith layer characterizes semantic context characteristics of a word element in an input text sequence in the ith layer; The method comprises the steps of inputting an output embedded vector of an ith layer to a trained low-rank proxy module to perform low-rank mapping to output residual approximation, wherein the residual approximation represents conversion increment generated between the ith layer and an ith+1th layer of semantic context characteristics of the ith layer; And generating an approximate i+1 layer output embedded vector according to the i layer output embedded vector and the residual approximation, wherein the approximate i+1 layer output embedded vector characterizes semantic context characteristics of the word elements in the input text sequence in the i+1 layer.
2. The method of claim 1, wherein the inputting the output embedded vector of the i-th layer to the trained low rank proxy module for low rank mapping output residual approximation comprises: Determining the low rank dimension of the low rank mapping according to the information retention threshold value and the computing resource budget value of the trained low rank proxy module; And determining residual approximation according to the low-rank dimension of the low-rank mapping and the output embedded vector of the ith layer.
3. The method of claim 2, wherein the low rank mapping is an incremental transformation of a dual linear projection of decreasing and increasing dimensions of the output embedded vector of the i-th layer in sequence.
4. The method of claim 2, wherein a minimum low rank dimension of the low rank map is a minimum integer that meets the information retention ratio threshold and does not exceed the computing resource budget value.
5. The method of claim 1, wherein generating an approximated i+1-th layer output embedding vector from the i-th layer output embedding vector and the residual approximation comprises: Connecting the output embedded vector of the ith layer to the output end of the trained low-rank agent module through jump connection so as to output the output embedded vector of the ith layer; and adding the output embedded vector of the i layer and the residual error approximation to generate an approximate output embedded vector of the i+1 layer.
6. A training method for a low rank agent module, comprising: constructing a teacher model and a student model to be trained; the teacher model is an ith layer to be approximated in an original neural network based on a converter architecture, parameters of the original neural network based on the converter architecture are kept frozen in a training process, and the teacher model is used for receiving an output embedded vector of the ith layer and outputting an output embedded vector of a real (i+1) th layer; the student model to be trained is used for carrying out low-rank mapping according to the input output embedded vector of the ith layer to determine residual approximation, and outputting an output embedded vector of the (i+1) th layer based on the residual approximation; Determining a loss function according to the output embedded vector of the real i+1th layer and the output embedded vector of the approximate i+1th layer, and optimizing the parameters of the low-rank mapping according to the loss function to complete self-distillation training and obtain a trained student model.
7. The method of claim 6, wherein the student model to be trained is configured to determine a residual approximation from a low-rank mapping of an input i-th layer output embedded vector, and to output an approximation i+1-th layer output embedded vector based on the residual approximation, comprising: the student model to be trained comprises a low-rank agent module, a jump connection and addition module; the low rank agent module is configured to receive the input i-th layer output embedded vector and perform low rank mapping according to the input i-th layer output embedded vector to output the residual approximation; the jump connection is used for transmitting the input output embedded vector of the ith layer to the addition module; the adding module is used for approximately adding the input output embedded vector of the ith layer with the residual error and outputting the output embedded vector of the (i+1) th layer.
8. The method of claim 6, wherein the determining a loss function from the real i+1 layer output embedding vector and the approximate i+1 layer output embedding vector and optimizing the low rank mapped parameters according to the loss function to complete self-distillation training to obtain a trained student model comprises: And optimizing the parameters of the low-rank mapping by minimizing the mean square error between the output embedded vector of the real i+1th layer and the output embedded vector of the approximate i+1th layer so as to finish self-distillation training and obtain a trained student model.
9. The method as recited in claim 6, further comprising: Performing error characteristic analysis on the loss function, and taking the total error difference between the output embedded vector of the real i+1th layer and the output embedded vector of the approximate i+1th layer as an approximate error and an estimated error, wherein the approximate error is characterized by the approximation capability of a low-rank hypothesis space to a real residual error, the estimated error is characterized by the convergence degree of a self-distillation training process to the optimal low-rank mapping parameter, the low-rank hypothesis space is a function class constraint of the low-rank proxy module, and the real residual error is the difference between the output and the input of the i th layer in the original neural network based on a converter architecture; Based on the characteristic analysis result, the approximation error is regulated according to the low rank dimension of the low rank map.
10. The method of claim 9, wherein said adjusting said approximation error according to a low rank dimension of said low rank map comprises: the approximation error is reduced by increasing the low rank dimension of the low rank map.

Description

Text processing method and training method based on hierarchical vector approximation Technical Field The present disclosure relates to the field of neural networks and context adaptation techniques, and more particularly, to a text processing method and training method based on hierarchical vector approximation. Background Along with the rapid development of deep learning models, particularly large-scale language models (Large Language Models, LLMs), model parameters and computational complexity are exponentially increased, so that the delay of an reasoning stage is high, the energy consumption is high, and the deployment efficiency of the model in edge equipment or real-time application scenes is seriously restricted. In order to improve the reasoning performance, a plurality of model compression and acceleration techniques are proposed in the industry, and the model compression and acceleration techniques can be mainly divided into two major categories, namely a static compression method and a dynamic reasoning method. Static compression methods are typically implemented after model training is completed, with computational acceleration achieved by reducing the number of model parameters or decreasing the numerical accuracy. The method comprises the steps of obtaining a floating point weight and an activation value, wherein the floating point weight and the activation value are mapped to a low-bit integer representation by a weight quantization technology, although the storage cost and the calculation cost can be remarkably reduced, quantization noise is inevitably introduced in a quantization process, so that the model precision is reduced, and model pruning compresses a model structure by identifying and removing redundant or unimportant neurons or connections, however, the effect is highly dependent on the rationality of pruning criteria, and the performance loss caused by structural damage is difficult to recover. Although there have been studies attempting to combine quantization with pruning to increase compression ratio, such static methods generally employ fixed computation paths and resource allocation strategies, and cannot adaptively adjust computation overhead according to actual semantic complexity of input samples, so that it is difficult to achieve optimal efficiency in diverse reasoning scenarios. The dynamic reasoning method attempts to break through the limitation of the static method, and the on-demand allocation of the computing resources is realized through a conditional execution mechanism. Typical schemes include layer skipping and early back. The layer skipping strategy dynamically decides whether to skip the neural network layer based on the converter architecture according to the intermediate characteristics, and only executes complete calculation on the key layer, the early-back mechanism embeds a plurality of intermediate outlets in the network, and when the confidence level of the output of a certain layer reaches a preset threshold value, the reasoning is terminated in advance. For example, deeBERT, fastBERT et al model achieves the acceleration of BERT by self-distillation and adaptive inference time control. However, such methods tend to suffer from inadequate context modeling due to interruption of standard forward information flow, especially in long-sequence or complex semantic tasks, which often face the dilemma that accuracy-efficiency is difficult to compromise, limiting their applicability in high-performance LLM pushing. In recent years, low-rank approximation (low-rank approximation) has become a new research direction for LLMs optimization due to its potential to significantly reduce computational complexity while retaining key semantic information. The prior art focuses on two types of application, namely, aiming at low-rank decomposition of weight tensors, high-efficiency fine adjustment of parameters is realized by superposing low-rank increment matrixes on original weights, and although the video memory and calculation requirements of a training stage are effectively reduced, the reasoning process still needs to execute complete forward propagation, so that the calculation resource consumption is huge and the reasoning efficiency is not substantially improved. Secondly, performing low-dimensional projection on intermediate tensors (such as attention matrix) in the activation path can accelerate specific submodules (such as attention calculation), but still execute all neural network blocks based on the transformer architecture layer by layer, so that global optimization of cross-layer calculation cannot be realized. Thirdly, the existing low-rank approximation strategy is applied to a local module, lacks the overall modeling capability of interlayer residual transformation, and is difficult to maintain the consistency of deep semantic evolution of an original model. As the inference depth increases, there may be rapid deviations from the correct semantic traj