CN-121980266-A - Text quality assessment model training method

CN121980266ACN 121980266 ACN121980266 ACN 121980266ACN-121980266-A

Abstract

The application discloses a text quality evaluation model training method, which comprises the steps of firstly adding implicit information of marking scores into prompt words in a sampling stage, introducing supervision of the marking scores, improving sampling efficiency, and simultaneously providing a two-dimensional similarity calculation method to serve as an evaluation scale in a scoring task and providing fine-granularity rewards for reinforcement learning. Finally, a implication self-distillation loss function is provided, so that the training convergence speed of the scoring model is increased. Compared with the prior art, the method can solve the problems of low sampling positive sample rate and training convergence speed of the traditional fine adjustment method through the implicit sampling and the implicit self-distillation loss function, can remarkably improve the sampling efficiency, reduce the consumption of computing resources, improve the training efficiency and efficiently realize the training of the evaluation model in the specific field.

Inventors

WEN GUIHUA
ZHOU BO

Assignees

华南理工大学

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A text quality assessment model training method, comprising: The method comprises the steps of obtaining text data containing environment information and samples to be evaluated, and labeling scores of the samples to be evaluated, constructing a prompt phrase aiming at the samples to be evaluated, wherein each prompt word in the prompt phrase contains standard prompt words and prompt words; Extracting a score sequence in the candidate response in the group, calculating a similarity value between the score sequence and a labeling score of a sample to be evaluated corresponding to the current candidate response by adopting a two-dimensional similarity algorithm; And performing secondary sampling on the rest samples to be evaluated based on the hinting words by utilizing the base model after fine tuning, generating a second group of candidate responses, calculating the advantages of the second group of candidate responses, constructing a strategy loss item, introducing average KL divergence loss, and performing joint optimization on the base model after fine tuning to obtain a text quality evaluation model.
2. The text quality assessment model training method of claim 1, wherein constructing a prompt phrase for a sample to be assessed includes constructing corresponding prompt words for all non-empty subsets in the sample to be assessed.
3. The method for training a text quality assessment model according to claim 1, wherein calculating a similarity value between the score sequence and a labeling score of a sample to be assessed corresponding to a current candidate response by using a two-dimensional similarity algorithm comprises: In the formula, A sequence of scores is represented and, N is the length of the fractional sequence, The labeling score for the sample to be evaluated for the current candidate response, , And The weights of the rank similarity and the numerical similarity, , Representing the index of the sample to be evaluated in the candidate response, For the rank of the ith sample under evaluation in X, For the rank of the ith sample under evaluation in Y, Representing the maximum sum of squares error when the ranks are diametrically opposed, Representing the i-th sample to be evaluated in X, Representing the ith sample to be evaluated in Y.
4. The text quality assessment model training method of claim 1, wherein screening positive sample responses based on the similarity values comprises: and when the similarity value meets a preset threshold, taking the sample to be evaluated corresponding to the current candidate response as a positive sample, otherwise rejecting.
5. The text quality assessment model training method of claim 1, wherein in the supervised fine tuning of the base model using the positive sample responses, an objective function includes standard supervised fine tuning loss and implied self-distilled loss, expressed as: In the formula, Representing a standard supervised fine tuning loss, The super-parameter is represented by a parameter, The representation implies self-distillation losses, expressed as: In the formula, The standard prompt word is represented by a word, The meaning of the hint word is indicated, A positive sample response is indicated and, A gradient cut-off operation is indicated, A base model is shown and is shown, Indicating KL divergence.
6. The text quality assessment model training method of claim 1, wherein computing the dominance of the second set of candidate responses comprises: In the formula, Representing the dominance of the j-th candidate response in the second set of candidate responses, Representing the j-th candidate response, Indicating the reward for the j-th candidate response, Representing an average reward for the second set of candidate responses, N representing the number of candidate responses in the second set of candidate responses, Fractional sequence representing jth candidate response Score sequence corresponding to label The value of the similarity between the two, Representing slave according to a preset answer format The score sequence is extracted.
7. The text quality assessment model training method of claim 1, wherein the constructed policy loss term is expressed as: where N represents the number of candidate responses in the second set of candidate responses, Showing the base model after fine tuning, Representing standard prompt words, T representing sequences Is used for the word-level length of (a), 、 Respectively represent sequences T-th token and first t-1 token sequences; The average KL divergence introduced is expressed as: In the formula, The meaning of the hint word is indicated, Representing the j-th candidate response.
8. The text quality assessment model training method according to any one of claims 1 to 7, wherein the text data is intelligent diagnosis and treatment data, the environmental information is patient health state description information, and the sample to be assessed is a candidate diagnosis and treatment scheme for a current patient.
9. The method according to any one of claims 1 to 7, wherein the text data is educational question bank data, the environmental information is knowledge points and difficulty requirements associated with questions, and the sample to be evaluated is a question text with scores.
10. The training method of a text quality assessment model according to any one of claims 1 to 7, wherein the text data is customer service dialogue data, the environmental information is user query information and dialogue history, and the sample to be assessed is customer service reply text.

Description

Text quality assessment model training method Technical Field The invention relates to the technical field of model training, in particular to a text quality assessment model training method. Background In recent years, with the rapid development of large language model reasoning capability, the generated scoring model (GENERATIVE SCORE MODEL) has application potential in the automatic assessment task of a plurality of professional fields, such as intelligent diagnosis and treatment, education assessment and customer service quality inspection fields. Compared with the traditional scalar scoring model or rewarding model, the generated scoring model can better exert reasoning capacity and reduce the dependence on training data. To train such models, reject sampling fine Tuning (RFT) and group relative strategy optimization (Group Relative Policy Optimization, GRPO) are currently commonly employed. The RFT method samples standard prompt words for multiple times through the base model, and screens out high-quality answers according to manual or rules to construct a fine tuning data set, so as to improve the performance of the model on a target task. The GRPO method optimizes the model by encouraging relatively better answers and punishing relatively worse answers through an intra-group answer comparison mechanism, and improves the performance of the model on specific tasks while retaining the original capability of the model. However, when the existing RFT and GRPO methods are directly applied to professional text quality scoring tasks in the vertical domain, the inherent technical drawbacks are amplified, leading to serious challenges in training efficiency and upper model performance, which are embodied as follows: On the one hand, poor field suitability leads to a low sampling efficiency, and in the professional field, a high-quality assessment requires a deep knowledge of the field. The existing method uses common sampling prompt words to sample, and a great number of sampling times are needed to obtain a few available positive samples. If standard answers are directly specified in the prompt word, the model may expose itself to knowledge of the correct answer in the thought chain, resulting in data unavailability. On the other hand, insufficient domain knowledge results in a learning upper limit bottleneck, namely, the method is insufficient in labeling answers, and labeling answers are only used as screening answers or calculation rewards, but if a model cannot obtain answers similar to the labeling answers by itself on some samples, no matter how many times the model is sampled, a screening mechanism can not obtain effective positive samples, so that the model cannot learn effectively from the high-quality labeling data, and the training process falls into the bottleneck. Therefore, how to overcome the above problems is a need for a person skilled in the art to solve. Disclosure of Invention In view of the above, the present invention has been made to provide a text quality assessment model training method that overcomes or at least partially solves the above-mentioned problems. In order to achieve the above purpose, the present invention adopts the following technical scheme: The embodiment of the invention provides a training method for a text quality assessment model, which comprises the following steps: The method comprises the steps of obtaining text data containing environment information and samples to be evaluated, and labeling scores of the samples to be evaluated, constructing a prompt phrase aiming at the samples to be evaluated, wherein each prompt word in the prompt phrase contains standard prompt words and prompt words; Extracting a score sequence in the candidate response in the group, calculating a similarity value between the score sequence and a labeling score of a sample to be evaluated corresponding to the current candidate response by adopting a two-dimensional similarity algorithm; And performing secondary sampling on the rest samples to be evaluated based on the hinting words by utilizing the base model after fine tuning, generating a second group of candidate responses, calculating the advantages of the second group of candidate responses, constructing a strategy loss item, introducing average KL divergence loss, and performing joint optimization on the base model after fine tuning to obtain a text quality evaluation model. The remaining samples to be evaluated include a sample of remaining samples and a reject sample in a sample. The text quality assessment model assesses text quality based on standard prompt words. Preferably, the prompt phrase is constructed for the sample to be evaluated, including constructing corresponding prompt words for all non-empty subsets in the sample to be evaluated. Preferably, calculating a similarity value between the score sequence and the labeling score of the sample to be evaluated corresponding to the current candidate response by adopting a