CN-121981195-A - Big language model training method and device based on reinforcement learning and electronic equipment

CN121981195ACN 121981195 ACN121981195 ACN 121981195ACN-121981195-A

Abstract

The invention provides a large language model training method, a device and electronic equipment based on reinforcement learning, which relate to the technical field of artificial intelligence and are applied to creative design and inspiration search scenes, wherein the method comprises the steps of obtaining training data of a large language model, wherein the training data comprise input data, a thinking chain and output data of the large language model, and the thinking chain comprises all thinking steps from the input data to the output data; and combining a reinforcement learning algorithm and a process rewarding model, evaluating the divergence degree of the divergent association words generated by each thinking step in the thinking chain, and giving rewards to the corresponding thinking steps according to the divergence degree of the divergent association words, wherein the reinforcement learning algorithm comprises a near-end strategy optimization algorithm, and superposing rewards of all the thinking steps in the thinking chain to obtain a cumulative rewards, and training the large language model by using the maximization of the cumulative rewards as a training target to obtain the trained large language model.

Inventors

FAN LING
DING XINDONG
RAO RUIJIE

Assignees

特赞（上海）信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A large language model training method based on reinforcement learning is characterized by being applied to creative design and inspiration search scenes and comprising the following steps: acquiring training data of a large language model, wherein the training data comprises input data, a thinking chain and output data of the large language model, and the thinking chain comprises all thinking steps from the input data to the output data; evaluating the divergence degree of the divergent association words generated by each thinking step in the thinking chain by combining a reinforcement learning algorithm and a process rewarding model, and rewarding the corresponding thinking steps according to the divergence degree of the divergent association words, wherein the reinforcement learning algorithm comprises a near-end strategy optimization algorithm; And overlapping rewards of all the thinking steps in the thinking chain to obtain a cumulative reward, and training the large language model by taking the maximization of the cumulative reward as a training target to obtain a trained large language model.
2. The method of claim 1, wherein the input data is an original design requirement and the output data is a last diverging association word of a plurality of diverging association words obtained by a plurality of semantic hops of the input data; wherein, the combined reinforcement learning algorithm and process rewarding model evaluates the divergence degree of the divergent association words generated by each thinking step in the thinking chain, and rewards the corresponding thinking steps according to the divergence degree of the divergent association words, comprising: dividing all the vocabularies into parent node vocabularies and corresponding child node vocabularies according to the original design requirement and the sequential logic relation of the multi-round divergent associative word, wherein the all the vocabularies comprise the original design requirement and the multi-round divergent associative word; Vectorizing the parent node vocabulary and the child node vocabulary by utilizing Embedding model; The jump degree between the parent node vocabulary and the child node vocabulary and the diversity degree between all the child node vocabulary under the same parent node vocabulary are evaluated by combining the near-end strategy optimization algorithm and the process rewarding model based on the semantic distance; according to the jump degree between the parent node vocabulary and the child node vocabulary, giving longitudinal rewards to the corresponding thinking steps; and giving transverse rewards to corresponding thinking steps according to the diversity degree among all child node vocabularies under the same parent node vocabulary.
3. The method of claim 2, wherein said giving a longitudinal reward to the corresponding thought step based on the degree of jump between the parent node vocabulary and the corresponding child node vocabulary, comprises: Determining cosine distances between the quantized parent node vocabulary and the child node vocabulary; And giving a longitudinal reward to the corresponding thinking step based on the cosine distance, wherein the greater the cosine distance is, the greater the longitudinal reward is.
4. The method according to claim 2, wherein said giving a lateral reward to the corresponding thought step according to the degree of diversity among all child node vocabularies under the same parent node vocabulary comprises: determining the average distance between all child node words under the same quantized parent node word; and giving a lateral reward to the corresponding thinking step based on the average distance, wherein the larger the average distance is, the larger the lateral reward is.
5. The method of claim 1, wherein after obtaining the trained large language model, the method further comprises: receiving design requirements input by a user by utilizing the trained large language model, and outputting creative results; Based on the semantic distance between the design requirement and the creative result, divergence of the creative result relative to the design requirement is evaluated.
6. The method of claim 5, wherein receiving the user-entered design requirements using the trained large language model, outputting creative results, comprises: receiving design requirements input by a user by utilizing the trained large language model; Performing multiple rounds of semantic jumping based on the design requirement to generate multiple rounds of divergent association words; And outputting the last divergent association word in the multi-divergent association words as a creative result.
7. The method of claim 5, wherein the evaluating divergence of the creative result relative to the design requirement based on semantic distances between the design requirement and the creative result comprises: Mapping design requirements into vectors using Embedding model Mapping the ith creative word in the creative result into a vector ; Vector-based Sum vector Cosine distance between Determining the semantic divergence SDS of the creative result relative to the design requirement according to the following formula: Wherein N is the total number of creative words contained in the creative result.
8. The utility model provides a big language model trainer based on reinforcement study which characterized in that is applied to intention design and inspiration search scene, includes: An acquisition unit configured to acquire training data of a large language model, wherein the training data includes input data, a thought chain, and output data of the large language model, the thought chain including all thought steps from the input data to the output data; The rewarding unit is used for combining a reinforcement learning algorithm and a process rewarding model, evaluating the divergence degree of the divergent association words generated by each thinking step in the thinking chain, and giving rewards to the corresponding thinking steps according to the divergence degree of the divergent association words, wherein the reinforcement learning algorithm comprises a near-end strategy optimization algorithm; And the training unit is used for superposing rewards of all the thinking steps in the thinking chain to obtain a cumulative rewards, and training the large language model by taking the maximization of the cumulative rewards as a training target to obtain a trained large language model.
9. A computer-readable storage medium storing computer instructions for causing a computer to perform the reinforcement learning-based large language model training method of any one of claims 1 to 7.
10. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores a computer program executable by the at least one processor to cause the at least one processor to perform the reinforcement learning based large language model training method of any of claims 1-7.

Description

Big language model training method and device based on reinforcement learning and electronic equipment Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to a large language model training method and device based on reinforcement learning and electronic equipment. Background Currently, large language models (e.g., deepSeek-R1, openAI o1, etc.) perform excellently in code generation and mathematical solution, these excellent performances mainly benefit from reinforcement learning on datasets with well-defined correct answers. Reinforcement learning training of large language models is mainly directed to mathematics, programming and natural science scenarios, which are characterized by convergence, where questions usually have only a unique correct answer, and the training goal is to converge the large language model to a unique solution. However, if the training method is applied to creative design and inspiration search scenes without unique standard answers, the divergence capability of the large language model can be restrained, so that the more training the large language model, the more the death is. Aiming at the problem that the training mode of the large language model in the related technology is applied to creative design without unique standard answers and suppression of the divergence capability of the large language model when the inspiration search scene is adopted, no effective technical solution is proposed at present. Disclosure of Invention The main objective of the present disclosure is to provide a method and an apparatus for training a large language model based on reinforcement learning, and an electronic device, so as to solve the problem that the training method for the large language model in the related art is applied to creative design without unique standard answers and suppression of divergence capability of the large language model when inspiration search scenes. In order to achieve the above object, a first aspect of the present disclosure provides a reinforcement learning-based large language model training method applied to creative design and inspiration search scenes, including: acquiring training data of a large language model, wherein the training data comprises input data, a thinking chain and output data of the large language model, and the thinking chain comprises all thinking steps from the input data to the output data; evaluating the divergence degree of the divergent association words generated by each thinking step in the thinking chain by combining a reinforcement learning algorithm and a process rewarding model, and rewarding the corresponding thinking steps according to the divergence degree of the divergent association words, wherein the reinforcement learning algorithm comprises a near-end strategy optimization algorithm; And overlapping rewards of all the thinking steps in the thinking chain to obtain a cumulative reward, and training the large language model by taking the maximization of the cumulative reward as a training target to obtain a trained large language model. Optionally, the input data is an original design requirement, and the output data is a last divergent association word of multiple divergent association words obtained by multiple rounds of semantic jumping of the input data; wherein, the combined reinforcement learning algorithm and process rewarding model evaluates the divergence degree of the divergent association words generated by each thinking step in the thinking chain, and rewards the corresponding thinking steps according to the divergence degree of the divergent association words, comprising: dividing all the vocabularies into parent node vocabularies and corresponding child node vocabularies according to the original design requirement and the sequential logic relation of the multi-round divergent associative word, wherein the all the vocabularies comprise the original design requirement and the multi-round divergent associative word; Vectorizing the parent node vocabulary and the child node vocabulary by utilizing Embedding model; The jump degree between the parent node vocabulary and the child node vocabulary and the diversity degree between all the child node vocabulary under the same parent node vocabulary are evaluated by combining the near-end strategy optimization algorithm and the process rewarding model based on the semantic distance; according to the jump degree between the parent node vocabulary and the child node vocabulary, giving longitudinal rewards to the corresponding thinking steps; and giving transverse rewards to corresponding thinking steps according to the diversity degree among all child node vocabularies under the same parent node vocabulary. Further, the giving a longitudinal reward to the corresponding thinking step according to the jump degree between the parent node vocabulary and the corresponding child node vocabulary includes: Determining cosine d