CN-122021927-A - Training method of visual language model, multi-agent system and task execution method

CN122021927ACN 122021927 ACN122021927 ACN 122021927ACN-122021927-A

Abstract

An embodiment of the present disclosure provides a training method, device and equipment for a visual language model of webpage interaction, a multi-agent system and a task execution method based on the multi-agent system. The method comprises the steps of obtaining a task instruction aiming at a target website, carrying out multi-round online interaction with a webpage environment of the target website based on the task instruction to form interaction track data corresponding to the multi-round online interaction, wherein in each round of interaction, current webpage visual information and the task instruction are input into a visual language model to be trained to obtain and execute webpage interaction actions output by the visual language model, generating reward signals for model training based on the interaction track data, and updating parameters of the visual language model according to the reward signals to realize online reinforcement learning training of the visual language model.

Inventors

GUO YUYU
YANG WENJIE
HUANG YANG
HAO GUOLIANG
YU HANG
LEI LEI
DI PENG

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (20)

1. A training method for a visual language model of web page interactions, comprising: Acquiring a task instruction aiming at a target website; Performing multiple-round online interaction with the webpage environment of the target website based on the task instruction to form interaction track data corresponding to the multiple-round online interaction, wherein in each round of interaction, current webpage visual information and the task instruction are input into a visual language model to be trained so as to acquire and execute webpage interaction actions output by the visual language model; Generating a reward signal for model training based on the interaction trajectory data; And updating parameters of the visual language model according to the reward signal.
2. The method of claim 1, further comprising: acquiring an initial access address of the target website; acquiring initial webpage visual information of the target website based on the initial access address; the multi-round online interaction with the webpage environment of the target website based on the task instruction comprises the following steps: Inputting the task instruction and the initial webpage visual information into a visual language model to be trained, and obtaining the webpage interaction action output by the visual language model; And executing the webpage interaction action in the webpage environment of the target website.
3. The method of claim 1, wherein the obtaining the task instruction for the target website comprises: And synthesizing the task instruction aiming at the target website by a query agent.
4. The method of claim 1, wherein the multiple rounds of online interaction continue until at least one of the following termination conditions is met: the visual language model outputs a response signal for indicating the completion of the task, the number of interacted steps reaches a preset step number threshold value, and a blocking event of a preset type is detected in the interaction process.
5. The method of claim 1, wherein in each round of interactions, the contextual information input to the visual language model comprises historical interaction information in text form, the historical interaction information comprising historical reasoning about the task instructions and web page interactions performed by the visual language model.
6. The method of claim 1, wherein in each round of interaction, the contextual information input to the visual language model does not include historical web page visual information.
7. The method of claim 1, wherein the reward signal comprises a result reward based on an overall evaluation of the interaction trajectory data, and a process reward based on an individual evaluation of each step of interaction in the interaction trajectory data.
8. The method of claim 7, the outcome rewards being obtained by: Inputting the interaction track data and the task instruction into an evaluation model; The evaluation result output by the evaluation model is obtained, wherein the evaluation result comprises a quantization score for at least one dimension of task completion, action effectiveness and track efficiency; and calculating the result rewards according to the quantization scores.
9. The method of claim 8, further comprising: And if the evaluation model outputs a designated score aiming at the task completion degree, the interaction track data is removed from the reinforcement learning training set, and the designated score is used for indicating a preset type of blocking event in the execution process of the task instruction.
10. The method of claim 7, the process rewards being obtained by: Judging whether the operation is effective or not based on a preset rule respectively aiming at each step of interaction operation in the interaction track data; determining the score of each step of interactive operation according to the judgment result; and calculating a process reward of the interaction trajectory data based on the scores of the interaction operations of the steps.
11. The method of claim 10, the interaction trajectory data comprising a first web page state before performing a target interaction operation and a second web page state after performing the target interaction operation; judging whether the target interaction operation is effective or not based on a preset rule comprises the following steps: and judging whether the target interactive operation is effective or not based on the comparison result of the first webpage state and the second webpage state.
12. The method of claim 11, wherein the first web page state comprises a first URL and the second web page state comprises a second URL, wherein the determining whether the target interaction is valid based on the comparison of the first web page state and the second web page state comprises: Judging whether the second URL is changed compared with the first URL; And if the target interaction operation is changed, judging the target interaction operation as an effective operation.
13. The method of claim 12, wherein the first web page status comprises first visual information and the second web page status comprises second visual information, wherein the determining whether the target interaction is valid based on the comparison of the first web page status and the second web page status further comprises: calculating the visual similarity of the second visual information and the first visual information; And if the visual similarity is greater than or equal to a preset visual similarity threshold, judging that the target interactive operation is invalid operation.
14. The method of claim 13, further comprising: If the visual similarity is smaller than a preset visual similarity threshold, inputting the first visual information, the second visual information and the task instruction into a gain analysis model to obtain a gain analysis result; And if the benefit analysis result shows that the visual information change has forward benefit for completing the task instruction, judging that the target interactive operation is effective.
15. The method of claim 10, wherein the interaction trajectory data comprises target location coordinates corresponding to a target interaction operation, and wherein determining whether the target interaction operation is valid based on a preset rule comprises: Acquiring source codes of a first webpage state aimed at by the target interactive operation; based on the source code, identifying a page position area corresponding to the interactable element in the first webpage state; judging whether the target position coordinate is positioned in the page position area or not; and if the target position coordinates are positioned in the page position area, judging the target interaction operation to be an effective operation.
16. The method of claim 7, wherein the reward signal further comprises a format reward for evaluating whether a response output by the visual language model in each round of interaction meets a preset format specification; the format specification includes at least one of including inference process text, conforming to a predetermined structured data format.
17. The method of claim 1, wherein updating parameters of the visual language model based on the reward signal comprises: and updating parameters of the visual language model by adopting a group relative strategy optimization algorithm according to the reward signal.
18. The method of claim 17, wherein when the model parameters are updated by the population relative strategy optimization algorithm, the update amplitude of the model parameters is controlled by KL divergence constraint, and the application object of the KL divergence constraint is a word element with covariance higher than a preset threshold in a model output sequence.
19. The method of claim 1, further comprising: performing supervised fine tuning training on the visual language model to be trained by using multi-task supervised fine tuning data, wherein the multi-task supervised fine tuning data comprises at least two types of data sets: a first type of data set for training page state transition prediction capabilities; a second class of data sets for training the generation of motion capabilities from the instructions; a third type of data set for training element positioning capabilities.
20. The method of claim 19, wherein a weighted loss function is employed in training using the multi-rendition tuning data, the weighted loss function assigning different weights to the losses of different task types, wherein the weights assigned to a target task type are inversely related to the number of samples of the target task type in the total training data.

Description

Training method of visual language model, multi-agent system and task execution method Technical Field One or more embodiments of the present description relate to the field of artificial intelligence technology, and in particular, to a training method for a visual language model for web page interaction. One or more embodiments of the present specification are also directed to a training apparatus for a visual language model for web page interactions, a computing device, a multi-agent system, and a method of performing tasks based on the multi-agent system. Background With the rapid development of large language models and visual language models, the construction of general-purpose agents capable of autonomously sensing environments and performing complex tasks has become an important direction in the field of artificial intelligence. In a webpage interaction scene, an agent needs to complete multi-step and long-range interaction tasks in a dynamic and heterogeneous webpage environment according to user instructions, and the core of the intelligent agent depends on a visual language model with visual understanding and accurate interaction capability. Therefore, how to construct a visual language model suitable for a webpage interaction scene is a technical problem to be solved in the field. Disclosure of Invention In view of this, one or more embodiments of the present description provide a training method, apparatus, a computing device, a multi-agent system, and a task execution method based on the multi-agent system for providing an agent solution suitable for web page interaction. According to a first aspect of one or more embodiments of the present specification, there is provided a training method for a visual language model for web page interaction, comprising: Acquiring a task instruction aiming at a target website; Performing multiple-round online interaction with the webpage environment of the target website based on the task instruction to form interaction track data corresponding to the multiple-round online interaction, wherein in each round of interaction, current webpage visual information and the task instruction are input into a visual language model to be trained so as to acquire and execute webpage interaction actions output by the visual language model; Generating a reward signal for model training based on the interaction trajectory data; And updating parameters of the visual language model according to the reward signal. According to a second aspect of one or more embodiments of the present specification, there is provided a training apparatus for a visual language model of web page interactions, comprising: the task instruction acquisition module is used for acquiring task instructions aiming at a target website; The webpage interaction module is used for carrying out multiple-round online interaction with the webpage environment of the target website based on the task instruction to form interaction track data corresponding to the multiple-round online interaction, wherein in each round of interaction, the current webpage visual information and the task instruction are input into a visual language model to be trained so as to acquire and execute webpage interaction actions output by the visual language model; the rewards calculation module is used for generating rewards signals for model training based on the interaction track data; And the parameter updating module is used for updating the parameters of the visual language model according to the reward signal. According to a third aspect of one or more embodiments of the present specification, there is provided a computing device comprising a memory, a processor and computer instructions stored on the memory and executable on the processor, when executing the computer instructions, implementing the steps of the method for training a visual language model for web page interactions. According to a fourth aspect of one or more embodiments of the present specification, there is provided a multi-agent system comprising: The planning agent is configured to receive a user request and generate a semantic instruction based on the current task execution visual information; Positioning an intelligent body, wherein the intelligent body comprises a visual language model obtained by the training method of the visual language model for webpage interaction, and the intelligent body is configured to receive the semantic instruction and the visual information of the current task execution and output webpage interaction actions; the negative intelligent agent is configured to judge the current task state according to the task execution visual information before and after the webpage interaction action is executed and output decision information for controlling the task flow; the planning agent, the positioning agent and the thinking-back agent are sequentially connected to cooperatively complete the user request through an iterative loop driven by the