WO-2026091553-A1 - BASE MODEL TRAINING METHOD AND APPARATUS

WO2026091553A1WO 2026091553 A1WO2026091553 A1WO 2026091553A1WO-2026091553-A1

Abstract

Provided in the embodiments of the present disclosure are a base model training method and apparatus. The method comprises: generating, by means of self-play, a plurality of pieces of different response text on the basis of a text instruction that is input into a base model; on the basis of the plurality of pieces of response text and in combination with score values corresponding to the plurality of pieces of response text, selecting the plurality of pieces of response text, and constructing a machine preference dataset; and on the basis of the constructed machine preference dataset, training the base model.

Inventors

LIU, KUNLIN
QU, Xinji
TAN, Fang
KANG, Honghui

Assignees

中兴通讯股份有限公司

Dates

Publication Date: 20260507
Application Date: 20250617
Priority Date: 20241028

Claims (11)

A base model training method, comprising: Based on the text instructions input into the base model, the self-play generates multiple different response texts; Based on the multiple response texts and their corresponding scores, the multiple response texts are filtered and a machine preference dataset is constructed. The base model is trained based on a constructed machine preference dataset.
According to the method of claim 1, the base model is an intelligent dialogue model fine-tuned under supervision.
According to the method of claim 1, the step of filtering the plurality of reply texts and constructing a machine preference dataset based on the plurality of reply texts and their corresponding scores includes: Generation steps: Generate multiple different response texts through self-play based on the text instructions input to the base model; Sorting steps: Sort the multiple reply texts according to their corresponding score values to obtain the sorted reply texts in this iteration; Filtering steps: Filter the multiple sorted response texts from this iteration to construct a machine preference dataset; Update steps: Update the base model based on the constructed machine preference dataset to generate multiple different response texts again with the new base model; Judgment step: Determine whether the current iteration number meets the preset threshold, and if the current iteration number does not meet the preset threshold, repeat the generation step, the sorting step, the filtering step, the update step, and the judgment step.
According to the method of claim 3, the step of sorting the plurality of reply texts according to the score values corresponding to the plurality of reply texts to obtain the sorted plurality of reply texts in this iteration includes: The multiple reply texts are input into the reward model and scored according to the scoring rules to obtain the score value corresponding to each reply text. The scoring rules include at least one of the following: whether the reply text is missing punctuation marks, the number of words in the reply text, the grammar of the reply text, and the citation of the reply text. The multiple response texts are arranged in descending or ascending order of their scores to obtain the sorted response texts in this iteration.
According to the method of claim 4, the step of filtering the sorted response texts in this iteration to construct a machine preference dataset includes: For any given response text, the response text with the highest score among the multiple response texts determined by the reward model is the optimal response text; For any given response text, the response text with the lowest score among the multiple response texts is determined as the worst response text; The best and worst response texts are integrated to construct a machine preference dataset.
According to the method of claim 1, wherein training the base model based on the constructed machine preference dataset comprises: The pedestal model is trained by adjusting its parameters using a reinforcement learning algorithm based on a constructed machine preference dataset.
According to the method of claim 6, the reinforcement learning algorithm includes at least one of the following: DPO, SPPO, PPO, and KTO.
A base model training device, comprising: The generation module is configured to generate multiple different response texts through self-play based on the text instructions input to the base model; The filtering module is configured to filter the multiple reply texts and construct a machine preference dataset based on the multiple reply texts and their corresponding scores. The training module is configured to train a base model based on a constructed machine preference dataset.
A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described in any one of claims 1 to 7.
An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the method according to any one of claims 1 to 7.
A computer program product comprising a computer program that, when executed by a processor, implements the steps of the method according to any one of claims 1 to 7.

Description

Base model training method and device Cross-reference of related applications This disclosure is based on and claims priority to Chinese patent application CN202411520303.6, filed on October 28, 2024, entitled “Base Model Training Method and Apparatus”, and incorporates the entire contents of that patent application by reference. Technical Field This disclosure relates to the field of natural language processing, and more specifically, to a base model training method and apparatus. Background Technology With the rapid development of artificial intelligence technology, especially in the field of Natural Language Processing (NLP), and driven by foundational models (such as large-scale pre-trained models like Generative Pre-trained Transformer (GPT), Large Language Model with Attention (LLMA), and Bidirectional Encoder Representations from Transformers (BERT)), these models, through unsupervised pre-training on massive amounts of data, possess powerful language understanding and generation capabilities, enabling them to perform complex language tasks such as dialogue systems, machine translation, and text generation. However, despite their excellent performance in many tasks, how to further train and optimize these models remains a pressing issue that needs to be addressed. Traditional training methods for pedestal models primarily rely on supervised learning: fine-tuning the model under supervision by labeling a large amount of high-quality training data. While this approach improves the training efficiency and performance of pedestal models to some extent, the high cost and time consumption of manual labeling limit the scale and speed of pedestal model training. In summary, no effective solution has yet been proposed in the relevant technologies. Summary of the Invention This disclosure provides a base model training method and apparatus to at least solve the problem in the related art that the training of base models mainly relies on a large amount of manually labeled data, resulting in long training cycles and high costs for base models. According to one embodiment of this disclosure, a base model training method is provided, comprising: generating multiple different response texts through self-play based on text instructions input to the base model; filtering the multiple response texts and constructing a machine preference dataset based on the multiple response texts and their corresponding scores; and training the base model based on the constructed machine preference dataset. According to another embodiment of this disclosure, a base model training apparatus is provided, comprising: a generation module configured to generate multiple different response texts through self-play based on text instructions input to the base model; a filtering module configured to filter the multiple response texts and construct a machine preference dataset based on the multiple response texts and their corresponding scores; and a training module configured to train the base model based on the constructed machine preference dataset. According to yet another embodiment of this disclosure, a computer-readable storage medium is also provided, wherein a computer program is stored therein, wherein the computer program is configured to perform the steps in any of the above method embodiments when it is run. According to yet another embodiment of this disclosure, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above method embodiments. According to yet another embodiment of this disclosure, a computer program product is also provided, including a computer program that, when executed by a processor, implements the steps in any of the above method embodiments. Attached Figure Description Figure 1 is a hardware structure block diagram of a computer terminal for a base model training method according to an embodiment of the present disclosure; Figure 2 is a structural block diagram of a base model training device based on large model self-play according to an embodiment of the present disclosure; Figure 3 is a schematic diagram of the overall process of training the base model according to an embodiment of the present disclosure; Figure 4 is a flowchart of a base model training method according to an embodiment of the present disclosure; Figure 5 is a structural block diagram of a base model training device according to an embodiment of the present disclosure. Detailed Implementation The embodiments of this disclosure will be described in detail below with reference to the accompanying drawings and examples. It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. The methods and embodiments provide