CN-121981161-A - Multi-mode large model training method and system for GUI flow automation agent

CN121981161ACN 121981161 ACN121981161 ACN 121981161ACN-121981161-A

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-mode large model training method and system for an automatic proxy of a GUI (graphical user interface) process. The method comprises the steps of S1, constructing semi-automatic data, constructing training data comprising a supervision fine adjustment data set and a reinforcement learning data set, wherein the reinforcement learning data set comprises a structured thinking chain generated by a teacher model, S2, supervising fine adjustment, namely training a multi-mode large model by using the supervision fine adjustment data set to enable the multi-mode large model to obtain the capability of predicting target coordinates according to screen images and natural language instructions, S3, performing reinforcement learning training, namely optimizing GRPO on the basis of a grouping relative strategy, training the multi-mode large model after supervising fine adjustment by using the reinforcement learning data set, and evaluating multi-mode large model output by using a multi-dimensional mixed rewarding function.

Inventors

SONG ZHILONG
OUYANG XIAOGANG
WANG CAN
CHEN JIAWEI
SONG MINGLI
SUN LINJUN
SUN YUEGANG
SUN LINCHUN
GAO YANG

Assignees

浙江实在智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (9)

1. The multi-mode large model training method for the GUI flow automation agent is characterized by comprising the following steps of; s1, semi-automatic data construction: constructing training data comprising a supervised fine tuning dataset and a reinforcement learning dataset, the reinforcement learning dataset comprising a structured chain of thought generated by a teacher model; S2, supervision fine adjustment: Training the multi-mode large model by using the supervision fine tuning data set to enable the multi-mode large model to obtain the capability of predicting target coordinates according to screen images and natural language instructions; s3, reinforcement learning training: And training the supervised and fine-tuned multi-modal large model by using the reinforcement learning data set based on a grouping relative strategy optimization GRPO method, wherein the multi-modal large model output is evaluated by adopting a multi-dimensional mixed rewarding function.
2. The GUI process automation agent oriented multimodal big model training method of claim 1, wherein step S1 comprises the steps of: S11, collecting: A screen image sequence and a corresponding user operation event sequence are acquired in a non-invasive mode, wherein the operation event comprises an action type and an operation coordinate; s12, analyzing and converting: Identifying interactive elements in the screen image by utilizing a visual analysis model, acquiring a boundary box and description information of the interactive elements, and converting the boundary box into a geometric center point coordinate label; S13, instruction generalization: Based on the description information, generating diversified natural language instructions; s14, generating a data set: And generating a supervision fine-tuning data set based on the screen image, the natural language instruction and the geometric center point coordinate label, and generating a reinforcement learning data set containing global task description, step instruction, structured thinking chain and standard action through a teacher model based on the screen image sequence, the operation event sequence and the preset prompt.
3. The GUI process automation agent oriented multimodal big model training method of claim 2, wherein in step S12, the calculation formula for converting the bounding box into the geometric center point coordinate label is: ; Wherein, the Representing the geometric center point of the bounding box; representing the upper left-hand corner abscissa of the bounding box; representing the lower right-hand corner abscissa of the bounding box; representing the upper left-hand ordinate of the bounding box; representing the lower right-hand vertical axis of the bounding box.
4. The GUI process automation agent oriented multimodal big model training method of claim 3, wherein in generating the reinforcement learning dataset in step S14, the structured thinking chain comprises the following nodes: Task nodes for describing global tasks; A HistoryAnalysis node for analyzing the historical operating state; a StepPrediction node for predicting a current step goal; The ActionPrediction node for predicting the impending action.
5. The GUI process automation agent oriented multimodal big model training method of claim 4, wherein in step S3 the total rewards of the multidimensional hybrid rewards function The calculation formula of (2) is as follows: ; Wherein, the Representing a format compliance reward; Representing a multi-node fine-grained thought logical coherence reward; Representing an action type reward; Representing a center trending geometric prize; 、、、 All represent weight coefficients.
6. The multi-modal large model training method for GUI process automation agents of claim 5, wherein the multi-node fine-grained thought logic consistency rewards utilize a pre-trained text embedding model, and the semantic similarity is calculated for nodes in the structured thought chain respectively according to the following specific calculation formula: ; Wherein, n= { Task, historyAnalysis, stepPrediction, actionPrediction }, represent four node sets in the structured thought chain; Representing the text content of the kth thinking node generated by the multi-mode big model after supervision and fine tuning; A tag text representing a kth thought node in the data; Representing a semantic vector extraction function; representing cosine similarity calculations.
7. The multi-modal large model training method for GUI process automation agents of claim 5, wherein the central trending geometric rewards are used for positioning GUI elements as follows: ; Wherein, the Predicting coordinates for the model; Is a distance scaling factor, which is a super parameter; representing a real bounding box; A function representing decay with distance, in particular a gaussian radial basis function.
8. The GUI process automation agent oriented multi-modal large model training method according to claim 1, wherein in step S3, the packet relative policy based optimization GRPO method trains the supervised and fine-tuned multi-modal large model using the reinforcement learning dataset, specifically comprising the steps of: S31, taking a plurality of tracks generated by the multi-mode large model after supervision and fine adjustment under the same state as a group; s32, calculating the total bonus point of each track in the group by using the mixed bonus function; s33, calculating a dominance function value of each track based on the average value and standard deviation of the track rewarding points in the group; and S34, updating the multi-mode large model parameters according to the dominance function values, and enhancing the generation probability of the tracks with higher dominance function values and inhibiting the generation probability of the tracks with lower dominance function values.
9. A GUI process automation agent-oriented multimodal big model training system for implementing the GUI process automation agent-oriented multimodal big model training method of any of claims 1-8, the GUI process automation agent-oriented multimodal big model training system comprising: the semi-automatic data construction module is used for constructing training data comprising a supervision fine adjustment data set and a reinforcement learning data set, wherein the reinforcement learning data set comprises a structured thinking chain generated by a teacher model; the monitoring fine tuning module is used for training the multi-mode large model by utilizing the monitoring fine tuning data set, so that the multi-mode large model can obtain the capability of predicting target coordinates according to screen images and natural language instructions; The reinforcement learning training and hybrid rewarding module is used for training the multi-mode big model after supervision and fine adjustment by utilizing the reinforcement learning dataset based on a grouping relative strategy optimization GRPO method, wherein the multi-mode big model output is evaluated by adopting a multi-dimensional hybrid rewarding function.

Description

Multi-mode large model training method and system for GUI flow automation agent Technical Field The invention belongs to the technical field of artificial intelligence, and particularly relates to a multi-mode large model training method and system for an automatic proxy of a GUI (graphical user interface) process. Background The graphic user interface (GRAPHICAL USER INTERFACE, GUI) process automation technology aims at simulating the clicking, inputting, dragging and other actions of a human user through an intelligent Agent (Agent) to realize the automatic execution of the GUI business process. The technology has undergone an evolution from rule-based RPA (relies on the underlying DOM tree or control handle, with poor tamper resistance) to vision-based smart agents. The current forefront direction is to directly input a screen shot by using a multi-mode large language model (Multimodal Large Language Model, MLLM) in an end-to-end mode and output mouse click coordinates or keyboard instructions, so that dependence on underlying codes is eliminated. At present, the GUI Agent technical scheme based on the multi-mode big model mainly comprises the following contents: 1. GUI Agent based on general multi-mode big model Recently, common pre-training models such as Qwen-VL, GPT-4o and the like are commonly used as GUI perception-decision directly, the models receive high-resolution screen shots, natural language thinking and actions and coordinates are output, and RPA flow is driven. 2. Sparse rewards reinforcement learning fine tuning In order to adapt the model to GUI tasks, the existing work only adopts 'task success or failure' or general 'thinking process semantic similarity' when fine tuning by using RL, and the binary result is used as a reward signal to update the strategy model. The current GUI Agent technical scheme based on the multi-mode large model mainly has the following core defects: insufficient positioning precision of dense small targets in GUI (graphical user interface) scene The general multi-modal model (e.g., qwen-VL, GPT-4 o) is mainly trained based on Natural scene Images (Natural Images) where objects (e.g., cats, dogs, cars) are typically large in size and have distinct boundary features. However, GUI interfaces have significant Domain differences (Domain Gap), specifically containing a large number of very small-sized interactive elements (e.g., 16x16 pixel function icons) and a high-density typesetting layout (e.g., closely-arranged toolbar buttons). When such a GUI image with high resolution and high information density is processed, the general model is able to output coordinates, but often cannot achieve the pixel level accuracy required for RPA automation, and the click position is likely to deviate from the effective area of the target element, resulting in failure of the automation task. 2. Structured reasoning lacking task direction Existing model training allows the model to think before making decisions, but lacks explicit constraints on the "thinking process", ignores domain differences in the model thinking process, and makes it not always possible to perform thought reasoning and decision actions according to a GUI automation task. 3. Reinforcement learning feedback mechanism is sparse and inefficient The existing method mostly adopts sparse rewards (only looking at final success or failure) or general 'thinking process semantic similarity', and lacks structured process supervision. And lack of rewarding designs for GUI spatial characteristics cannot pertinently promote the thinking logic and positioning accuracy of the model in GUI tasks. Therefore, it is very important to design a multi-mode large model training method and system for an automatic agent of a GUI flow, which can realize semi-automatic data construction, explicit constraint on a thinking process and improve the decision precision of executing actions and the positioning precision of elements. Disclosure of Invention The invention provides a multi-mode large model training method and a system for an automatic Agent of a GUI flow, which can realize semi-automatic data construction, explicit constraint on a thinking process and improvement of execution action decision precision and element positioning precision, aiming at solving the problems of insufficient positioning precision of dense small targets, lack of task-oriented structured reasoning and sparse and low efficiency of a reinforcement learning feedback mechanism in the GUI scene in the prior GUI Agent technical scheme based on the multi-mode large model. In order to achieve the aim of the invention, the invention adopts the following technical scheme: The multi-mode large model training method for the GUI flow automation agent comprises the following steps of; s1, semi-automatic data construction: constructing training data comprising a supervised fine tuning dataset and a reinforcement learning dataset, the reinforcement learning dataset comprising