CN-121352074-B - Model training method, task processing method, system, electronic device, storage medium, and computer program product

CN121352074BCN 121352074 BCN121352074 BCN 121352074BCN-121352074-B

Abstract

The application discloses a model training method, a task processing method, a system, electronic equipment, a storage medium and a computer program product, and relates to the technical fields of artificial intelligence technology and multi-mode data processing. The model training method comprises the steps of constructing a training set corresponding to a potential action space by utilizing multi-mode corpus data, wherein the potential action space is characterized based on multi-mode potential actions, training an initial interaction processing model based on the training set to obtain an intermediate interaction processing model, and carrying out reinforcement training on the intermediate interaction processing model based on the potential action space and multi-mode prompt data to obtain a target interaction processing model, wherein the target interaction processing model is used for processing multi-mode interaction tasks to generate task processing results. The application solves the technical problems of low efficiency and insufficient generalization capability of the model training method in the related art.

Inventors

LI YONGQI
LANG HAO
LI YONGBIN

Assignees

阿里巴巴（中国）有限公司

Dates

Publication Date: 20260512
Application Date: 20251218

Claims (20)

1. A method of model training, comprising: Constructing a training set corresponding to a potential action space by utilizing multi-modal corpus data, wherein the potential action space is characterized based on multi-modal potential actions, and the training set takes the multi-modal potential actions in the potential action space as supervision signals; Training the initial interaction processing model based on the training set to obtain an intermediate interaction processing model; Performing reinforcement training on the intermediate interaction processing model based on the potential action space and the multi-mode prompt data to obtain a target interaction processing model, wherein the target interaction processing model is used for processing multi-mode interaction tasks so as to generate task processing results; the multi-modal corpus data comprises first corpus data and second corpus data, and the construction of the training set corresponding to the potential action space by utilizing the multi-modal corpus data comprises the construction of the potential action space by utilizing the first corpus data, wherein the data volume of the first corpus data is larger than the data volume of the second corpus data; The first corpus data comprises an original multi-modal corpus and an original text corpus, the first corpus data is utilized to construct the potential action space, the potential action space comprises a first original state sequence matched in multiple modes, which is obtained by sampling the original multi-modal corpus, the first original state sequence is subjected to coding processing to obtain an original multi-modal representation, a second original state sequence of a text mode is obtained by sampling the original text corpus, the second original state sequence is subjected to coding processing to obtain an original text representation, and the initial potential action codebook is optimized and adjusted based on the first original state sequence, the second original state sequence, the original multi-modal representation and the original text representation to obtain a target potential action codebook, wherein multiple candidate multi-modal potential actions contained in the target potential action codebook are used for defining the potential action space.
2. The model training method of claim 1, wherein optimizing the initial potential action codebook based on the first original state sequence, the second original state sequence, the original multi-modal representation, and the original text representation to obtain the target potential action codebook comprises: performing state transition modeling calculation on the first original state sequence to determine a first loss, and performing state transition modeling calculation on the second original state sequence to determine a fourth loss; Performing modal characterization alignment calculation on the original multi-modal characterization, determining a second loss, and performing modal characterization alignment calculation on the original text characterization, determining a third loss; And optimizing and adjusting the initial potential action codebook based on the first loss, the second loss, the third loss and the fourth loss to obtain the target potential action codebook.
3. The model training method of claim 2, wherein performing state transition modeling calculations on the first original state sequence, determining the first penalty comprises: performing state transition motion estimation on the first original state sequence by using an initial inverse dynamics model to obtain a first estimated motion, wherein the coding information of the first estimated motion is sampled from the initial potential motion codebook; Performing state deduction calculation on the first estimation action and the original multi-mode characterization by using an initial world model to obtain a first reconstruction state sequence; determining the first loss from the first original state sequence and the first reconstructed state sequence.
4. The model training method of claim 2, wherein performing a modality alignment calculation on the raw multi-modality characterization, determining the second loss comprises: Performing modal mapping on the paired text tokens corresponding to the original multi-modal tokens by adopting a first mapper to obtain first mapped multi-modal tokens, wherein the first mapper is used for mapping the text modal tokens into multi-modal tokens; Performing modal mapping on the original multi-modal representation by adopting a second mapper to obtain a first mapped text representation, wherein the second mapper is used for mapping the multi-modal representation into a text modal representation; and determining the second loss according to the original multi-modal characterization, the first mapping multi-modal characterization, the pairing text characterization and the first mapping text characterization.
5. The model training method of claim 2, wherein the method further comprises: According to the first loss, carrying out initial preheating training on an initial inverse dynamic model, updating the initial inverse dynamic model, carrying out initial preheating training on an initial world model, updating the initial world model, carrying out initial preheating adjustment on the initial potential action codebook, and updating the initial potential action codebook; And according to the second loss, initializing a preheating training for the first mapper, updating the first mapper, and initializing a preheating training for the second mapper, and updating the second mapper.
6. The model training method of claim 2, wherein performing a modal token alignment calculation on the original text token, determining a third penalty comprises: Performing modal mapping on the original text representation by adopting a first mapper to obtain a second mapped multi-modal representation; Performing modal mapping on the mapped multi-modal representation by adopting a second mapper to obtain a second mapped text representation; And determining the third loss according to the original text representation and the second mapping text representation.
7. The model training method of claim 2, wherein performing state transition modeling calculations on the second original state sequence, determining the fourth penalty comprises: performing state transition motion estimation on the second original state sequence by using an initial inverse dynamics model to obtain a second estimated motion, wherein the coded information of the second estimated motion is sampled from the initial potential motion codebook; Performing state deduction calculation on the second estimated action and the second mapping multi-modal characterization by using an initial world model to obtain a second reconstructed state sequence, wherein the second mapping multi-modal characterization is obtained by performing modal mapping on the original text characterization by using a first mapper; A fourth loss is determined from the second original state sequence and the second reconstructed state sequence.
8. The model training method of claim 2, wherein optimally adjusting the initial potential action codebook based on the first loss, the second loss, the third loss, and the fourth loss, the obtaining the target potential action codebook comprises: calculating a joint loss by using the first loss, the second loss, the third loss and the fourth loss; And according to the joint loss, performing joint optimization on the initial inverse dynamic model, the initial world model, the first mapper, the second mapper and the initial potential action codebook, and obtaining a target inverse dynamic model, a target world model, a target modal mapper and the target potential action codebook after optimization convergence.
9. The model training method of any of claims 1-8, wherein the second corpus data comprises a task multimodal corpus associated with the multimodal interaction task, and wherein generating the training set using the second corpus data and the potential action space comprises: Extracting a state transition sequence from the task multi-mode corpus; performing motion estimation on the state transition sequence by using a target inverse dynamics model to obtain a target estimation motion, wherein the coding information of the target estimation motion is sampled from a target potential motion codebook corresponding to the potential motion space; And constructing the training set based on the state transition sequence and the target estimated action, wherein a real action label corresponding to the state sample in the training set is generated according to the coding information of the target estimated action.
10. The model training method according to any one of claims 1 to 8, wherein the initial interaction processing model includes an initial strategy model, training the initial interaction processing model based on the training set, and obtaining the intermediate interaction processing model includes: inputting the state samples in the training set into the initial strategy model to perform action reasoning so as to obtain a reasoning result; Calculating to obtain a fifth loss based on the reasoning result and the real action labels in the training set; and carrying out model parameter adjustment on the initial strategy model according to the fifth loss to obtain an intermediate strategy model in the intermediate interaction processing model.
11. The model training method according to any one of claims 1 to 8, wherein performing intensive training on the intermediate interaction processing model based on the potential action space and the multi-modal prompt data, to obtain the target interaction processing model includes: determining an initial state based on the multi-mode prompt data; Performing state deduction starting from the initial state by using a target world model, the potential action space and an intermediate strategy model in the intermediate interaction processing model to obtain an interaction track; Determining a sixth loss using the interaction trajectory and a target rewards function; Performing model parameter adjustment on the intermediate strategy model according to the sixth loss to obtain a target strategy model; And deploying the target interaction processing model based on the target strategy model and the target world model.
12. The model training method of claim 11, wherein using the target world model, the potential action space, and the intermediate policy model to perform state deduction starting from the initial state, the deriving the interaction trajectory comprises: taking the initial state as a current state corresponding to a first iteration step in state deduction; Driving the intermediate strategy model to sample from the potential action space according to the current state corresponding to the current iteration step to obtain the current potential action, and driving the target world model to perform state deduction calculation on the current state and the current potential action to obtain the current state corresponding to the next iteration step; And responding to the establishment of a termination condition of state deduction, and deriving the interaction track, wherein the interaction track comprises a current state and a current potential action which respectively correspond to a plurality of iteration steps from the beginning of the first iteration step to the end of the last iteration step.
13. A method of task processing, comprising: acquiring multi-mode interaction task data; Performing task processing on the multi-mode interaction task data by adopting a target interaction processing model to obtain a task processing result; wherein the target interaction processing model is generated according to the model training method of any one of claims 1 to 12.
14. A method of task processing, comprising: acquiring role playing task data; Performing task processing on the role playing task data by adopting a target interaction processing model to obtain a role playing interaction response; wherein the target interaction processing model is generated according to the model training method of any one of claims 1 to 12.
15. A method of task processing, comprising: Acquiring a task processing request through a first application programming interface, wherein request data carried in the task processing request comprises multi-mode interaction task data; And returning a task processing response through a second application programming interface, wherein response data carried in the task processing response comprises a task processing result, the task processing result is obtained after task processing is carried out on the multi-mode interaction task data by adopting a target interaction processing model, and the target interaction processing model is generated according to the model training method of any one of claims 1 to 12.
16. A method of task processing, comprising: acquiring a currently input task processing dialogue request, wherein request data carried in the task processing dialogue request comprises multi-mode interaction task data; Responding to the task processing dialogue request, and returning a task processing dialogue reply, wherein the information carried in the task processing dialogue reply comprises a task processing result, wherein the task processing result is obtained after task processing is carried out on the multi-mode interaction task data by adopting a target interaction processing model, and the target interaction processing model is generated according to the model training method of any one of claims 1 to 12; and displaying the task processing results in a graphical user interface.
17. A method of task processing, comprising: responding to an input instruction acted on an operation interface, and displaying multi-mode interaction task data on the operation interface; Responding to a processing instruction acted on the operation interface, and displaying a task processing result on the operation interface; the task processing result is obtained by performing task processing on the multi-mode interaction task data by adopting a target interaction processing model, and the target interaction processing model is generated according to the model training method of any one of claims 1 to 12.
18. A task processing system, comprising: the client is used for sending multi-mode interaction task data; The server is connected with the client and is used for performing task processing on the multi-mode interaction task data by adopting a target interaction processing model to obtain a task processing result, wherein the target interaction processing model is generated according to the model training method of any one of claims 1 to 12; The client is also used for outputting the task processing result.
19. An electronic device, comprising: A memory storing an executable program; A processor for executing the program, wherein the program when executed performs the model training method of any one of claims 1 to 12 or the task processing method of any one of claims 13 to 17.
20. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored executable program, wherein the executable program when run controls a device in which the computer readable storage medium is located to perform the model training method according to any one of claims 1 to 12 or the task processing method according to any one of claims 13 to 17.

Description

Model training method, task processing method, system, electronic device, storage medium, and computer program product Technical Field The application relates to the technical field of artificial intelligence technology and multi-mode data processing, in particular to a model training method, a task processing method, a system, electronic equipment, a storage medium and a computer program product. Background Along with the fact that the multi-mode large language model shows strong capability in interaction tasks such as visual questions and answers, image description and the like, the application scene of the multi-mode large language model is continuously expanding to the role playing field requiring deep situation understanding and continuous interaction. In particular, in role playing application scenarios (e.g., virtual assistants, immersive entertainment, and emotion chaperones, etc.), the need for agents to maintain role consistency and interactive naturalness in open domain multimodal conversations is becoming increasingly prominent. Currently, a training paradigm based on word granularity is generally adopted in the technical scheme in the related technical field, and model optimization is realized by learning word transition probability in a field knowledge injection stage and optimizing a word generation strategy in a reinforcement learning stage. However, the method has the remarkable defects that on one hand, word element level knowledge injection can introduce a large amount of symbols and mode noise irrelevant to role semantics, so that the concentration degree of a model on task core knowledge is weakened, and on the other hand, high-dimensional sparse word element action space causes low reinforcement learning exploration efficiency and difficult strategy convergence, so that the generalized expression of an intelligent agent in a diversified interaction scene is restricted. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the application provides a model training method, a task processing method, a system, electronic equipment, a storage medium and a computer program product, which are used for at least solving the technical problems of low efficiency and insufficient generalization capability of the model training method in the related art. According to one aspect of the embodiment of the application, a model training method is provided, which comprises the steps of constructing a training set corresponding to a potential action space by utilizing multi-mode corpus data, wherein the potential action space is characterized by being based on multi-mode potential actions, training an initial interaction processing model to obtain an intermediate interaction processing model based on the training set, and carrying out reinforcement training on the intermediate interaction processing model to obtain a target interaction processing model based on the potential action space and multi-mode prompt data, wherein the target interaction processing model is used for processing multi-mode interaction tasks to generate task processing results. According to another aspect of the embodiment of the application, a task processing method is provided, which comprises the steps of obtaining multi-mode interaction task data, performing task processing on the multi-mode interaction task data by adopting a target interaction processing model to obtain a task processing result, wherein the target interaction processing model is generated according to the model training method of any one of the above. According to another aspect of the embodiment of the application, a task processing method is provided, which comprises the steps of obtaining role playing task data, performing task processing on the role playing task data by adopting a target interaction processing model to obtain a role playing interaction response, wherein the target interaction processing model is generated according to the model training method of any one of the above. According to another aspect of the embodiment of the application, a task processing method is provided, which comprises the steps of obtaining a task processing request through a first application programming interface, wherein request data carried in the task processing request comprises multi-mode interaction task data, and returning a task processing response through a second application programming interface, wherein response data carried in the task processing response comprises a task processing result, the task processing result is obtained after task processing is carried out on the multi-mode interaction task data by adopting a target interaction processing model, and the target interaction processing model is generated according to the model training method of any one of the above. According to another aspect of the embodiment of the application, a task processing method is provided, which compris