CN-122021879-A - Large language model value alignment method, device, equipment and storage medium

CN122021879ACN 122021879 ACN122021879 ACN 122021879ACN-122021879-A

Abstract

A method for aligning the value of big language model, its device, equipment and storage medium includes such steps as choosing the target preference distribution according to standard preference data set, obtaining the output distribution of target fine-tuning algorithm in the course of fine-tuning the big language model, extracting the token embedded table of big language model, calculating the semantic distance between tokens according to the norms of embedded vectors in token embedded table, constructing optimal transmission cost matrix, converting the optimal alignment problem between the object to be optimized and the target preference distribution to optimal transmission problem, calculating the minimum transmission distance between them to obtain PLOT loss item, constructing joint loss function, and fine-tuning the model end-to-end. Through the scheme, the token-level preference learning is converted into the optimal transmission problem, so that the spanning from local optimization to global distribution alignment is realized, and the integrity of the preference learning is improved.

Inventors

TAN MINGHUAN
YANG MIN
ZHU LIANG

Assignees

中国科学院深圳先进技术研究院

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (10)

1. The value alignment method of the large language model is characterized in that the value alignment method is integrated in a target fine adjustment alignment algorithm and is used for enhancing the value alignment performance of the target large language model, and comprises the following steps of: constructing target preference distribution reflecting human preferences according to a pre-collected standard preference data set, wherein the standard preference data set at least comprises user query data, preference answer data and non-preference answer data; acquiring output distribution of a target fine tuning alignment algorithm in the process of fine tuning the target large language model as distribution to be optimized, and taking the output distribution as a distribution object to be optimized; Extracting a token embedded table of the target large language model, and embedding vectors in the token embedded table Calculating semantic distances among the token by using norms, and constructing an optimal transmission cost matrix fused with semantic information to replace an original cost matrix of the target large language model; Converting the preference alignment problem of the distribution object to be optimized and the target preference distribution into an optimal transmission problem, and calculating the minimum transmission distance between the two problems to obtain a PLOT loss term ; From PLOT loss terms And constructing a joint loss function with a basic loss term of the target fine tuning method, and performing end-to-end fine tuning optimization on the model by utilizing the joint loss function.
2. The method for aligning values of a large language model of claim 1, wherein said constructing a target preference distribution reflecting human preferences based on a pre-collected standard preference data set, The target preference distribution is obtained through output distribution of a reward model or conversion of positive and negative sample token frequency difference distribution, and the target preference distribution is ensured to meet the set mathematical distribution requirement.
3. The large language model value alignment method of claim 2, wherein when the target preference profile is converted from the bonus model output profile, the following steps are taken: scoring preference answers by adopting a pre-trained mature reward model, taking output distribution of the reward model as initial preference distribution, and obtaining target preference distribution by normalizing the initial preference distribution.
4. The method for aligning value of large language model according to claim 2, wherein when the target preference distribution is obtained by converting positive and negative sample token frequency difference distribution, the following steps are taken: Counting token frequency distribution of preference answer data and non-preference answer data in the standard preference data set, and calculating difference distribution; and carrying out non-negative change and normalization conversion on the difference distribution to obtain target preference distribution.
5. The method for aligning value of large language model according to claim 1, wherein extracting the token embedded table of the target large language model, calculating semantic distances between tokens according to L2 norms of embedded vectors in the token embedded table, and constructing an optimal transmission cost matrix for fusing semantic information, instead of the original cost matrix of the target large language model, comprises the following steps: token embedded table for obtaining target large language model Wherein The dimension of the embedded vector is d, which is the embedded vector of the ith token; the L 2 norm (Euclidean norm) of each embedding vector is calculated, converting the high-dimensional embedding into one-dimensional semantic features: wherein A k dimension value of e; constructing an n x n dimensional cost matrix C with the absolute difference of two token embedded norms as the transmission cost, wherein the elements 。
6. The method for aligning values of large language model according to claim 1, wherein said problem of aligning preferences of said distribution object to be optimized with target preference distribution is converted into an optimal transmission problem, and a minimum transmission distance between them is calculated to obtain a PLOT loss term The following formula is used: Wherein: And (3) with Cumulative Distribution Functions (CDFs) of Q and P, respectively; Distance interval of adjacent token; The W distance is calculated by the difference integral of the CDF.
7. The large language model value alignment method of claim 6, wherein the loss term according to PLOT Constructing a joint loss function with a basic loss term of the target fine tuning method, performing end-to-end fine tuning optimization on a model by using the joint loss function, and adopting the following formula: where α is a superparameter used to balance the weights of the underlying fine tuning and preference enhancement.
8. A large language model value alignment apparatus, the apparatus comprising: A preference distribution construction module configured to construct a target preference distribution reflecting human preferences from a pre-collected standard preference dataset, wherein the standard preference dataset comprises at least user query data, preference answer data, and non-preference answer data; The model output distribution extraction module is configured to acquire output distribution of a target fine adjustment alignment algorithm in fine adjustment of the target large language model as distribution to be optimized and serve as a distribution object to be optimized; The semantic perception cost matrix construction module is configured to extract a token embedded table of the target large language model, calculate semantic distances among tokens according to L 2 norms of embedded vectors in the token embedded table, and construct an optimal transmission cost matrix fused with semantic information to replace an original cost matrix of the target large language model; An optimal transmission loss calculation module configured to convert the problem of the alignment of the distribution object to be optimized and the preference of the target preference distribution into an optimal transmission problem, calculate the minimum transmission distance between the two, and obtain a PLOT loss term ; A joint optimization module configured to lose terms according to PLOT And constructing a joint loss function with a basic loss term of the target fine tuning method, and performing end-to-end fine tuning optimization on the model by utilizing the joint loss function.
9. A large language model value alignment device, characterized in that the large language model value alignment device comprises a processor, a memory, and a large language model value alignment program stored on the memory and executable by the processor, wherein the large language model value alignment program, when executed by the processor, implements the steps of the large language model value alignment method according to any one of claims 1 to 7.
10. A storage medium, wherein a large language model value alignment program of an optimal transmission algorithm is stored on the storage medium, and the large language model value alignment program of the optimal transmission algorithm, when executed by a processor, implements the steps of the large language model value alignment method according to claims 1 to 7.

Description

Large language model value alignment method, device, equipment and storage medium Technical Field The application relates to the technical field of large language models, in particular to a large language model value alignment method, a device, equipment and a storage medium. Background The rapid development of large language models has enabled them to exhibit great capability in many areas, but unaligned models may generate harmful, unassisted or logically confusing content, severely limiting their practical application. The value alignment technology enables the model to learn human judgment standards through preference learning, and the value alignment technology becomes one of the core directions of large language model research. At present, alignment techniques of large language models are mainly divided into two types, an inference stage alignment method and a fine tuning stage alignment method. The inference phase alignment method regulates and controls model behaviors through constraints (such as controllable decoding, output sampling filtration and the like) in the decoding process without modifying model parameters, but has the problems of increased computation overhead, susceptibility to attack resistance, insufficient long-term alignment reliability and the like in inference. The alignment method of the fine tuning stage enables preference to be internalized into model capacity through adjusting model parameters, has more stable performance and becomes a mainstream technical path, and the core is that model output distribution is optimized based on preference data through reasonably designed loss functions and training strategies. The core requirement of preference learning is to let the model accurately capture the difference of human preferences (such as the distinction between "good answer" and "bad answer") on the output, and convert this difference into a signal that the model can learn. The main methods of preference learning in the current fine tuning stage can be divided into two categories of reinforcement learning base category and pure fine tuning category, For example, a human feedback based reinforcement learning method (RLHF) as a representative method of the alignment phase, RLHF constructs positive and negative sample datasets reflecting preferences through human labeling, first trains a reward model based on the datasets to quantify output quality, and then reinforcement learns using a near-end policy optimization (PPO) algorithm. The core mechanism is to limit the parameter updating amplitude by introducing KL divergence with the original model, so as to avoid excessive capability of the model from deviating from the original. However, the method has the obvious defects that multi-model architectures such as a reward model, a reference model, a criticizing model and the like are required to be built, the calculation cost is extremely high, the training process depends on the design of the reward model, instability is easy to occur, the requirement on human labeling data is large, and the labeling cost is high. Further, in order to solve the complexity problem of RLHF, a direct preference optimization method (DPO) is proposed in the prior art, and the DPO eliminates the dependence of a variation function through mathematical derivation of a PPO loss term, so that preference learning is converted into a supervised contrast learning form. According to the method, only the language model with updated parameters and the fixed reference model are needed, the rewarding model and the commentary model are not required to be trained, the calculation cost is greatly reduced, and meanwhile, the performance similar to RLHF is obtained, so that the method becomes a mainstream alternative scheme. However, DPO is still focused on the contrast optimization of the probability of local token, global semantic association among the tokens is not fully considered, so that preference learning lacks integrity, and the improvement of the performance in complex preference tasks (such as logic reasoning and multidimensional value judgment) is limited. Disclosure of Invention The application provides a method, a device, equipment and a storage medium for aligning the value of a large language model, which can solve the problems that in the prior art, global semantic association among tokens is not fully considered in the preference learning process of the large language model, so that the preference learning lacks of integrity and the improvement of the performance in complex preference tasks is limited. In a first aspect, an embodiment of the present application provides a method for aligning values of a large language model, where the method includes the following steps: A method for aligning the value of a large language model, which is integrated in a target fine-tuning alignment algorithm and is used for enhancing the value alignment performance of the target large language model, comprising the foll