CN-121982445-A - Visual language model training method, device, robot and program product

CN121982445ACN 121982445 ACN121982445 ACN 121982445ACN-121982445-A

Abstract

The application is suitable for the technical field of personal intelligence, and provides a visual language model training method, a device, a robot and a program product. The method comprises the steps of inputting various pieces of multi-modal data into a visual language model to obtain a plurality of candidate answers of each piece of multi-modal data, scoring each candidate answer based on rule rewards and a preset scoring model to obtain target scores of the candidate answers, screening out target answers from all the candidate answers based on the target scores of all the candidate answers of each piece of multi-modal data, constructing a training sample based on the target answers and the multi-modal data corresponding to the target answers, and fine-tuning the visual language model based on the training sample. The application can solve the problems of unstable training, rewarding hackers and high sampling cost of RLHF in the body-building task.

Inventors

YANG JUNPENG

Assignees

深圳市优必选科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20251219

Claims (11)

1. A method for training a visual language model, comprising: inputting each multi-modal data into a visual language model to obtain a plurality of candidate answers of each multi-modal data; scoring each candidate answer based on rule rewards and a preset scoring model to obtain target scores of the candidate answers; screening target answers from all the candidate answers based on target scores of all the candidate answers of each multi-modal data; Constructing a training sample based on the target answer and the multi-modal data corresponding to the target answer; And fine tuning the visual language model based on the training sample.
2. The visual language model training method of claim 1, further comprising, prior to scoring each of the candidate answers based on a rule rewards and a preset scoring model: The scoring dimension of the rule rewards is obtained, wherein the scoring dimension comprises at least one of semantic analysis, target detection and task planning; designing a prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality; scoring each candidate answer based on the rule rewards and the preset scoring model to obtain a target score of the candidate answer, wherein the scoring comprises the following steps: Scoring the candidate answers based on the scoring dimension to obtain a first score of the candidate answers; Based on the prompt words, the preset scoring model is guided to score the candidate answers, and second scores of the candidate answers are obtained; And carrying out weighted summation on the first scores and the second scores based on the weights of the rule rewards and the weights of the preset score models to obtain target scores of the candidate answers.
3. The method for training a visual language model according to claim 2, wherein in the case that the application scene of the visual language model is a hotel and a restaurant service scene, the scoring dimension further comprises destination accuracy and politicity; when the application scenario of the visual language model is the hotel and restaurant service scenario, designing the prompt word of the preset scoring model based on the task completion situation, the task completion efficiency and the content quality, including: the hint words are designed based on the task completion, the task completion efficiency, and the content quality under constraints that prohibit from involving user privacy and compliance with a hotel's service specifications.
4. The method according to claim 2, wherein in the case that the application scene of the visual language model is a medical care service scene, the scoring dimension further comprises limiting region detection; When the application scenario of the visual language model is the medical care service scenario, designing the prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality includes: The hint word is designed based on the task completion, the task completion efficiency, and the content quality under the constraint of prohibiting the provision of diagnosis-related information and a medication use method.
5. The method for training a visual language model according to claim 2, wherein in the case that the application scene of the visual language model is an educational service scene, the scoring dimension further comprises an explanatory word detection; when the application scenario of the visual language model is the educational service scenario, the designing the prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality includes: The hint word is designed based on the task completion, the task completion efficiency, and the content quality under the constraint of prohibiting the provision of non-secure content and allowing the provision of age-appropriate content.
6. The method for training a visual language model according to claim 2, wherein in the case that the application scene of the visual language model is a companion service scene, the scoring dimension further comprises emotion pacifying detection; when the application scene of the visual language model is the companion service scene, designing the prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality, including: Under the constraint of prohibiting the provision of behavioral guidance of non-suitability, the cue words are designed based on the task completion condition, the task completion efficiency and the content quality.
7. The visual language model training method according to claim 2, further comprising, after fine-tuning the visual language model based on the training sample: and carrying out iterative optimization on the visual language model, and adjusting sampling parameters of the visual language model, the weight of the rule rewards and the weight of the preset scoring model in the iterative optimization process, wherein the sampling parameters comprise at least one of sampling temperature, top-k value and Top-p value.
8. The visual language model training method of any one of claims 1 to 7, wherein said screening target answers from among all said candidate answers based on target scores of all said candidate answers for each of said multimodal data comprises: If the text length of the multi-modal data is smaller than a preset length threshold and the task response scene of the visual language model is a fixed output scene, determining a candidate answer with target scores arranged in the front N bits in a plurality of candidate answers of the multi-modal data as the target answer, wherein N is an integer larger than zero; And if the task response scene of the visual language model is a free text answer scene or a multi-step task scene, determining candidate answers with the target scores arranged in the first M bits in all candidate answers of the multi-mode data as the target answers, wherein M is an integer greater than zero.
9. A visual language model training apparatus, comprising: the answer acquisition module is used for inputting the multi-modal data into the visual language model to obtain a plurality of candidate answers of each multi-modal data; the answer scoring module is used for scoring each candidate answer based on rule rewards and a preset scoring model to obtain target scores of the candidate answers; The sampling and screening module is used for screening target answers from all the candidate answers based on target scores of all the candidate answers of each multi-modal data; The sample construction module is used for constructing a training sample based on the target answer and the multi-modal data corresponding to the target answer; And the fine tuning training module is used for fine tuning the visual language model based on the training sample.
10. A robot comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, causes the robot to implement the visual language model training method according to any one of claims 1 to 8.
11. A computer program product comprising a computer program which, when run, causes the visual language model training method of any one of claims 1 to 8 to be performed.

Description

Visual language model training method, device, robot and program product Technical Field The application belongs to the technical field of personal intelligence, and particularly relates to a visual language model training method, a device, a robot and a program product. Background The traditional robot training method mainly adopts the combination of supervised learning and reinforcement learning. Human feedback reinforcement learning (Reinforcement Learning from Human Feedback, RLHF) rewards with human preferences, but there are problems in the physical task of unstable training, rewarding hackers and high sampling costs. Disclosure of Invention The embodiment of the application provides a visual language model training method, a device, a robot and a program product, which can solve the problems of unstable training, rewarding hackers and high sampling cost of RLHF in a body task. In a first aspect, an embodiment of the present application provides a visual language model training method, including: inputting each multi-modal data into a visual language model to obtain a plurality of candidate answers of each multi-modal data; scoring each candidate answer based on rule rewards and a preset scoring model to obtain target scores of the candidate answers; screening target answers from all the candidate answers based on target scores of all the candidate answers of each multi-modal data; Constructing a training sample based on the target answer and the multi-modal data corresponding to the target answer; And fine tuning the visual language model based on the training sample. In the embodiment of the application, the visual language model is utilized to generate a plurality of candidate answers offline through the training framework based on Reject Sampling Fine Tuning (RSFT), so that the online sampling cost can be reduced, the candidate answers are screened by adopting a mixed evaluation mechanism based on rule rewards and a preset scoring model, bad samples can be filtered, rewards and hackers are avoided, and the training stability is improved, thereby solving the problems of unstable training, rewards and hackers and high sampling cost of RLHF in a body task. In some embodiments of the first aspect, before scoring each of the candidate answers based on the rule rewards and the preset scoring model, the method further comprises: The scoring dimension of the rule rewards is obtained, wherein the scoring dimension comprises at least one of semantic analysis, target detection and task planning; designing a prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality; scoring each candidate answer based on the rule rewards and the preset scoring model to obtain a target score of the candidate answer, wherein the scoring comprises the following steps: Scoring the candidate answers based on the scoring dimension to obtain a first score of the candidate answers; Based on the prompt words, the preset scoring model is guided to score the candidate answers, and second scores of the candidate answers are obtained; And carrying out weighted summation on the first scores and the second scores based on the weights of the rule rewards and the weights of the preset score models to obtain target scores of the candidate answers. In some embodiments of the first aspect, where the application scenario of the visual language model is a hotel and restaurant service scenario, the scoring dimension further includes destination accuracy and politicity; when the application scenario of the visual language model is the hotel and restaurant service scenario, designing the prompt word of the preset scoring model based on the task completion situation, the task completion efficiency and the content quality, including: the hint words are designed based on the task completion, the task completion efficiency, and the content quality under constraints that prohibit from involving user privacy and compliance with a hotel's service specifications. In some embodiments of the first aspect, in the case where the application scenario of the visual language model is a healthcare service scenario, the scoring dimension further comprises limiting region detection; When the application scenario of the visual language model is the medical care service scenario, designing the prompt word of the preset scoring model based on the task completion condition, the task completion efficiency and the content quality includes: The hint word is designed based on the task completion, the task completion efficiency, and the content quality under the constraint of prohibiting the provision of diagnosis-related information and a medication use method. In some embodiments of the first aspect, in the case where the application scenario of the visual language model is an educational service scenario, the scoring dimension further comprises an explanatory word detection; when the application sc