Search

CN-122021781-A - Reinforcement learning method, apparatus, device, storage medium, and computer program product

CN122021781ACN 122021781 ACN122021781 ACN 122021781ACN-122021781-A

Abstract

The application relates to the technical field of artificial intelligence, and discloses a reinforcement learning method, a reinforcement learning device, a reinforcement learning equipment, a storage medium and a computer program product, wherein the reinforcement learning method comprises the following steps: the method comprises the steps of obtaining a triangular color model and constructing a seed prompt set, wherein the triangular color model comprises an attacker model, an defender model and an evaluator model, the triangular color model is used for improving the safety performance of a large language model through iterative collaborative training of the attacker model, the defender model and the evaluator model, iterative training is sequentially carried out on the attacker model, the defender model and the evaluator model based on the seed prompt set, the attacker model is based on diversity rewarding training, the defender model is based on three-level evaluation rewarding training of the evaluator model, and the evaluator model adopts a multi-expert voting strategy to construct training data, so that the safety alignment of the large language model can be completed without a large number of manual labels, triangular color collaborative evolution is realized, and attack diversity is improved.

Inventors

  • SUN LIN
  • SI JIANFENG
  • REN HAIFENG
  • ZHANG XIANGZHENG

Assignees

  • 北京奇虎科技有限公司

Dates

Publication Date
20260512
Application Date
20260121

Claims (10)

  1. 1. A reinforcement learning method, characterized in that the reinforcement learning method comprises: Acquiring a triangle color model and constructing a seed prompt set, wherein the triangle color model comprises an attacker model, an defender model and an evaluator model, and the triangle color model is used for improving the safety performance of a large language model through iterative collaborative training of the attacker model, the defender model and the evaluator model; And sequentially carrying out iterative training on the attacker model, the defender model and the evaluator model based on the seed prompt set, wherein the attacker model is based on diversity rewarding training, the defender model is based on three-level evaluation rewarding training of the evaluator model, and the evaluator model adopts a multi-expert voting strategy to construct training data.
  2. 2. The reinforcement learning method of claim 1, wherein iteratively training the aggressor model, the defender model, and the evaluator model based on the set of seed hints sequentially comprises: Inputting the seed prompt set into the attacker model to obtain a countermeasure prompt generated by the attacker model; a diversity reward is calculated based on the challenge prompt and the attacker model is trained based on the diversity reward.
  3. 3. The reinforcement learning method of claim 2, wherein the diversity rewards include semantic rewards, diversity penalties, and multi-model attack rewards, the calculating diversity rewards based on the challenge prompt and training the attacker model based on the diversity rewards comprises: Calculating semantic rewards, diversity penalties, and multi-model attack rewards based on the countermeasure cues; and calculating total rewards according to the semantic rewards, the diversity penalties and the multi-model attack rewards, and training the attacker model based on the total rewards.
  4. 4. The reinforcement learning method of claim 3, wherein said calculating a diversity penalty based on said antagonism cues comprises: Acquiring each prompt in an attack success pool, wherein the attack success pool is used for storing the prompt of success attack; calculating Self-BLEU values and average cosine similarity of the countermeasure cues and the cues; and calculating a diversity penalty through a nonlinear penalty function based on the Self-BLEU value and the average cosine similarity.
  5. 5. The reinforcement learning method of claim 3, wherein said calculating a multi-model attack reward based on said challenge prompt comprises: counting attack success rates of the countermeasure prompts on the heterogeneous defense models; and calculating multi-model attack rewards according to the attack success rate and the model weights corresponding to the heterogeneous defense models.
  6. 6. The reinforcement learning method of claim 3, wherein said calculating a semantic reward based on said challenge prompt comprises: Invoking a large language model to evaluate semantic relativity between the countermeasure hint and the original basic hint; and determining the semantic rewards corresponding to the antagonism prompts based on the semantic relativity.
  7. 7. A reinforcement learning device, characterized in that the reinforcement learning device comprises: The acquisition module is used for acquiring a triangular color model and constructing a seed prompt set, wherein the triangular color model comprises an attacker model, an defender model and an evaluator model, and the triangular color model is used for improving the safety performance of a large language model through iterative collaborative training of the attacker model, the defender model and the evaluator model; The training module is used for sequentially carrying out iterative training on the attacker model, the defender model and the evaluator model based on the seed prompt set, wherein the attacker model is based on diversity rewarding training, the defender model is based on three-level evaluation rewarding training of the evaluator model, and the evaluator model adopts a multi-expert voting strategy to construct training data.
  8. 8. A reinforcement learning apparatus comprising a memory, a processor, and a reinforcement learning program stored on the memory and executable on the processor, the reinforcement learning program when executed by the processor implementing the reinforcement learning method of any one of claims 1 to 6.
  9. 9. A storage medium having a reinforcement learning program stored thereon, which when executed by a processor implements the reinforcement learning method of any one of claims 1 to 6.
  10. 10. A computer program product comprising a reinforcement learning program which when executed by a processor implements the reinforcement learning method of any one of claims 1 to 6.

Description

Reinforcement learning method, apparatus, device, storage medium, and computer program product Technical Field The present application relates to the field of artificial intelligence technology, and in particular, to a reinforcement learning method, apparatus, device, storage medium, and computer program product. Background Currently, with the rapid development of artificial intelligence technology, large language models (Large Language Model, LLM) play an increasingly important role. However, large language models, while improving social productivity, also expose serious security risks. Thus, the secure alignment of large language models, which is a process of keeping the behavior patterns, output contents, and decision logic of large language models highly consistent with the intention, value view, and operation instructions of their designers (i.e., human operators) through systematic technical means, is becoming increasingly important. However, the related large language model safety alignment mode has the defects of dependence on manual annotation, role isolation optimization and insufficient attack diversity. Disclosure of Invention The application mainly aims to provide a reinforcement learning method, a reinforcement learning device, reinforcement learning equipment, reinforcement learning storage media and reinforcement learning computer program products, and aims to solve the technical problems that the security alignment mode of a related large language model depends on manual annotation, character isolation optimization and attack diversity is insufficient. In order to achieve the above object, the present application provides a reinforcement learning method including: Acquiring a triangle color model and constructing a seed prompt set, wherein the triangle color model comprises an attacker model, an defender model and an evaluator model, and the triangle color model is used for improving the safety performance of a large language model through iterative collaborative training of the attacker model, the defender model and the evaluator model; And sequentially carrying out iterative training on the attacker model, the defender model and the evaluator model based on the seed prompt set, wherein the attacker model is based on diversity rewarding training, the defender model is based on three-level evaluation rewarding training of the evaluator model, and the evaluator model adopts a multi-expert voting strategy to construct training data. Optionally, performing iterative training on the attacker model, the defender model and the evaluator model sequentially based on the seed hint set includes: Inputting the seed prompt set into the attacker model to obtain a countermeasure prompt generated by the attacker model; a diversity reward is calculated based on the challenge prompt and the attacker model is trained based on the diversity reward. Optionally, the diversity rewards include semantic rewards, diversity penalties, and multi-model attack rewards, the calculating diversity rewards based on the challenge cues and training the attacker model based on the diversity rewards includes: Calculating semantic rewards, diversity penalties, and multi-model attack rewards based on the countermeasure cues; and calculating total rewards according to the semantic rewards, the diversity penalties and the multi-model attack rewards, and training the attacker model based on the total rewards. Optionally, the calculating a diversity penalty based on the challenge prompt includes: Acquiring each prompt in an attack success pool, wherein the attack success pool is used for storing the prompt of success attack; calculating Self-BLEU values and average cosine similarity of the countermeasure cues and the cues; and calculating a diversity penalty through a nonlinear penalty function based on the Self-BLEU value and the average cosine similarity. Optionally, the calculating a multi-model attack reward based on the challenge prompt includes: counting attack success rates of the countermeasure prompts on the heterogeneous defense models; and calculating multi-model attack rewards according to the attack success rate and the model weights corresponding to the heterogeneous defense models. Optionally, the calculating a semantic reward based on the countermeasure hint includes: Invoking a large language model to evaluate semantic relativity between the countermeasure hint and the original basic hint; and determining the semantic rewards corresponding to the antagonism prompts based on the semantic relativity. Optionally, training the defender model includes: inputting a fight prompt set generated by the trained attacker model into the defender model to obtain a defending response generated by the defender model; and evaluating the defense response based on the three-level evaluation rewards of the evaluator model, and training the defender model according to the evaluation result. Optionally, the evaluating the defensive respo