CN-121980356-A - Discrimination method, model training device, discrimination system and related products

CN121980356ACN 121980356 ACN121980356 ACN 121980356ACN-121980356-A

Abstract

The disclosure provides a discrimination method, a model training device, a discrimination system and related products, and belongs to the technical field of artificial intelligence. The judging method comprises the steps of calling a first judging model to conduct first judgment on the target intelligent agent based on target execution data under the condition that a judging request aiming at the target intelligent agent is received to obtain a first judging result, and calling a second judging model to conduct second judgment on the target intelligent agent based on the target execution data under the condition that the first judging result meets preset screening conditions to obtain a second judging result, wherein the target execution data are the execution data which are carried in the judging request and are used for representing target instructions of the target intelligent agent, and the preset screening conditions are used for determining whether the second judging model is called to conduct second judgment on the target intelligent agent according to the certainty degree of the first judging result. The embodiment of the disclosure can reduce the resource expenditure and reduce the time delay.

Inventors

Request for anonymity

Assignees

摩尔线程智能科技(北京)股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260401

Claims (18)

1. A discrimination method, characterized in that it is applied to a discrimination system that includes a first discrimination model and a second discrimination model for discriminating an instruction execution condition of a user graphical interface agent, and the number of parameters of the first discrimination model is smaller than the number of parameters of the second discrimination model, the method comprising: Under the condition that a discrimination request aiming at a target intelligent agent is received, invoking the first discrimination model to perform first discrimination on the target intelligent agent based on target execution data to obtain a first discrimination result; under the condition that the first discrimination result meets the preset screening condition, invoking the second discrimination model to carry out second discrimination on the target intelligent agent based on the target execution data to obtain a second discrimination result; The target execution data are execution data which are carried in the judging request and are used for representing the target agent to execute target instructions, and the preset screening conditions are used for determining whether to call the second judging model to judge the target agent again according to the certainty degree of the first judging result.
2. The method of claim 1, wherein the first discrimination result includes a first discrimination conclusion for characterizing whether the target agent passes discrimination and a confidence level corresponding to the first discrimination conclusion; the invoking the first discriminant model performs a first discriminant of the target agent based on target execution data, after obtaining the first discrimination result, the method further includes: when the confidence coefficient is smaller than or equal to a preset confidence coefficient threshold value, determining that the first discrimination result meets the preset screening condition, obtaining a second discrimination result based on the second discrimination model, and taking the second discrimination result as a final discrimination result of the target intelligent agent; And under the condition that the confidence coefficient is larger than the preset confidence coefficient threshold value, determining that the first judging result does not meet the preset screening condition, and taking the first judging result as a final judging result of the target intelligent agent.
3. The method according to claim 2, wherein the method further comprises: And sending the final judging result to the target intelligent agent to instruct the target intelligent agent to correct and re-execute the target instruction according to the final judging result.
4. The method according to claim 2, wherein the method further comprises: and sending the final judging result to a downstream agent of the target agent to instruct the downstream agent to re-execute the target instruction according to the final judging result, and executing the instruction received by the downstream agent based on the executing result of the target instruction.
5. The method of any one of claims 1-4, wherein invoking the first discriminant model to perform a first discriminant on the target agent based on target execution data, results in a first discriminant result, comprises: Inputting the target execution data into the first discrimination model, so that the first discrimination model determines a target area and a sub-state thereof in an interface screenshot in a visual scanning mode, and determines the first discrimination result based on the target area and the sub-state thereof; The interface screenshot is a screenshot of the target agent in the process of executing the target instruction on a preset user graphical interface.
6. The method of claim 5, wherein the target execution data includes the target instruction and at least one of the interface shots; the processing procedure of the first discrimination model to execute the first discrimination includes: for any interface screenshot, determining a first target area and a first sub-state thereof in the interface screenshot according to the target instruction; performing visual scanning on the interface screenshot, and determining a second target area and a second sub-state thereof in the interface screenshot; Determining a first screenshot judging sub-result according to the first target area, the first sub-state, the second target area and the second sub-state; under the condition that first screenshot judging sub-results of all interface screenshots are obtained, determining the first judging result according to all the first screenshot judging sub-results; Wherein the target area includes the first target area and the second target area, and the sub-state includes the first sub-state and the second sub-state.
7. The method of claim 5, wherein the target execution data includes the target instruction, an operation sequence corresponding to the target instruction, and at least one of the interface shots, the operation sequence includes a plurality of operations, and a correspondence exists between the interface shot and at least one of the operations; the processing procedure of the first discrimination model to execute the first discrimination includes: For any interface screenshot, determining a first target area and a first sub-state thereof in the interface screenshot according to the target instruction and an operation corresponding to the interface screenshot; performing visual scanning on the interface screenshot according to the operation corresponding to the interface screenshot, and determining a second target area and a second sub-state thereof in the interface screenshot; determining a second screenshot judging sub-result according to the first target area, the first sub-state, the second target area and the second sub-state; Under the condition that second screenshot judging sub-results of all interface screenshots are obtained, determining the first judging result according to all the second screenshot judging sub-results; Wherein the target area includes the first target area and the second target area, and the sub-state includes the first sub-state and the second sub-state.
8. The method of any one of claims 1-4, wherein the invoking the second discriminant model to perform a second discriminant on the target agent based on the target execution data to obtain a second discriminant result comprises: inputting the target execution data into the second discrimination model so that the second discrimination model determines the second discrimination result through a thinking chain reasoning mode.
9. The method according to claim 8, wherein the target execution data includes the target instruction and at least one interface screenshot, or the target execution data includes the target instruction, an operation sequence corresponding to the target instruction, and at least one interface screenshot, wherein the interface screenshot is a screenshot of the target agent in a process of executing the target instruction on a preset user graphical interface, the operation sequence includes a plurality of operations, and a correspondence exists between the interface screenshot and at least one operation; The processing procedure of the second discrimination model to execute the second discrimination includes: performing thinking chain reasoning according to the target execution data, and determining the actual execution state of the target intelligent agent through the thinking chain reasoning; and determining the second judging result according to the actual execution state of the target intelligent agent.
10. The method according to any one of claims 1-4, wherein in case the first discrimination result characterizes that the target agent does not pass discrimination, the first discrimination result further includes first regression location information; Wherein the first regression location information is the information of the correct location corresponding to the error location of the target agent in the user graphic interface determined by the first discriminant model according to the target execution data, or Under the condition that the second discrimination result characterizes that the target intelligent agent does not pass discrimination, the second discrimination result also comprises second regression position information and/or reasoning text; The second regression position information is information of a correct position, which is determined by the second discriminant model according to the target execution data and corresponds to the error position of the target intelligent agent, in a user graphical interface, and the reasoning text is used for representing a thinking chain-based reasoning process and a reasoning result of the second discriminant model.
11. A method of model training, the method comprising: Inputting training data into an initial discrimination model to perform second discrimination, and obtaining a prediction discrimination result, wherein the training data comprises a sample instruction of a sample intelligent agent, an operation sequence corresponding to the sample instruction and at least one interface screenshot in the process that the sample intelligent agent executes the sample instruction on a user graphical interface, and the prediction discrimination result is determined to comprise a prediction discrimination conclusion and a prediction reasoning text; Inputting the prediction discrimination result and at least one interface screenshot into a first discrimination model for semantic consistency analysis to obtain a consistency rewarding component, wherein the consistency rewarding component is used for representing the consistency degree between the prediction reasoning text and the prediction discrimination result; According to the consistency rewarding component and the preset general rewarding component, adjusting part of model parameters of the initial judging model; under the condition that a preset convergence condition is met, a second judging model is obtained based on the current initial judging model; Wherein the first and second discriminant models constitute a discriminant system for performing the discriminant method of any one of claims 1-10.
12. The method of claim 11, wherein, for any of the interface shots, the first discriminant model is configured to perform semantic understanding on an interface according to the interface shot to obtain a semantic understanding result; determining the consistency degree between the prediction reasoning text and the prediction discrimination result according to the semantic understanding result; mapping the consistency degree to a preset rewarding interval to obtain the consistency rewarding component, wherein the preset rewarding interval is a continuous scalar interval.
13. A discrimination apparatus, characterized in that it is applied to a discrimination system including a first discrimination model and a second discrimination model for discriminating an instruction execution condition of a user graphical interface agent, and the number of parameters of the first discrimination model is smaller than the number of parameters of the second discrimination model, the apparatus comprising: The first judging module is used for calling the first judging model to carry out first judgment on the target intelligent agent based on target execution data under the condition of receiving a judging request for the target intelligent agent, so as to obtain a first judging result; The second judging module is used for calling the second judging model to carry out second judgment on the target intelligent agent based on the target execution data under the condition that the first judging result meets the preset screening condition, so as to obtain a second judging result; The target execution data are execution data which are carried in the judging request and are used for representing the target agent to execute target instructions, and the preset screening conditions are used for determining whether to call the second judging model to judge the target agent again according to the certainty degree of the first judging result.
14. A model training device, comprising: The prediction module is used for inputting training data into the initial discrimination model to carry out second discrimination so as to obtain a prediction discrimination result, wherein the training data comprises a sample instruction of a sample intelligent agent, an operation sequence corresponding to the sample instruction and at least one interface screenshot in the process that the sample intelligent agent executes the sample instruction on a user graphical interface, and the prediction discrimination result is determined to comprise a prediction discrimination conclusion and a prediction reasoning text; the analysis module is used for inputting the prediction discrimination result and at least one interface screenshot into a first discrimination model for semantic consistency analysis to obtain a consistency rewarding component, and the consistency rewarding component is used for representing the consistency degree between the prediction reasoning text and the prediction discrimination result; The adjustment module is used for adjusting part of model parameters of the initial discrimination model according to the consistency rewarding component and a preset general rewarding component; the acquisition module is used for acquiring a second judgment model based on the current initial judgment model under the condition that the preset convergence condition is met; Wherein the first and second discriminant models constitute a discriminant system for performing the discriminant method of any one of claims 1-10.
15. A discrimination system is characterized by comprising a first discrimination model and a second discrimination model which are cascaded; Wherein the discriminating system is adapted to perform the discriminating method according to any of claims 1-10.
16. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores one or more computer programs executable by the at least one processor to enable the at least one processor to perform the discrimination method of any one of claims 1-10 or the model training method of any one of claims 11-12.
17. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the discrimination method according to any one of claims 1-10, or the model training method according to any one of claims 11-12.
18. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the discriminant method of any one of claims 1-10, or the model training method of any one of claims 11-12.

Description

Discrimination method, model training device, discrimination system and related products Technical Field The present disclosure relates to the field of artificial intelligence technology, and in particular, to a discrimination method, a model training method, a discrimination apparatus, a model training apparatus, a discrimination system, an electronic device, a computer-readable storage medium, and a computer program product. Background With the development of large Language models (Large Language Model, LLM) and Visual Language Models (VLM), graphical user interface (Graphic user interface, GUI) agents (e.g., computer-use agents (CUAs)) are widely used for automated execution of cell-phone-side or Computer-side tasks, in the related technology, a multi-mode large Model with larger parameter quantity (such as 7B and above) is generally used as a judging Model, and a uniform judging mode is adopted for all operations, so that the calculation force waste and high delay are easily caused, and the real-time interaction requirement is difficult to meet. Disclosure of Invention The present disclosure provides a discrimination method, a model training method, a discrimination apparatus, a model training apparatus, a discrimination system, an electronic device, a computer-readable storage medium, and a computer program product. In a first aspect, the disclosure provides a discrimination method applied to a discrimination system, the discrimination system includes a first discrimination model and a second discrimination model for discriminating an instruction execution condition of a user graphical interface agent, and the number of parameters of the first discrimination model is smaller than that of the second discrimination model, the discrimination method includes calling the first discrimination model to perform first discrimination on a target agent based on target execution data under the condition that a discrimination request for the target agent is received, and obtaining a first discrimination result; and under the condition that the first discrimination result meets a preset screening condition, invoking the second discrimination model to perform second discrimination on the target intelligent agent based on the target execution data to obtain a second discrimination result, wherein the target execution data is the execution data which is carried in the discrimination request and is used for representing the target intelligent agent to execute a target instruction, and the preset screening condition is used for determining whether invoking the second discrimination model to perform re-discrimination on the target intelligent agent or not according to the certainty degree of the first discrimination result. In a second aspect, the disclosure provides a model training method, which includes inputting training data into an initial discrimination model to conduct second discrimination to obtain a prediction discrimination result, wherein the training data includes a sample instruction of a sample agent, an operation sequence corresponding to the sample instruction and at least one interface screenshot in the process that the sample agent executes the sample instruction on a user graphical interface, the prediction discrimination result is determined to include a prediction discrimination conclusion and a prediction reasoning text, the prediction discrimination result and at least one interface screenshot are input into a first discrimination model to conduct semantic consistency analysis to obtain a consistency rewarding component, the consistency rewarding component is used for representing the consistency degree between the prediction reasoning text and the prediction discrimination result, part model parameters of the initial discrimination model are adjusted according to the consistency rewarding component and a preset universal rewarding component, and the second discrimination model is obtained based on the current initial discrimination model under the condition that a preset convergence condition is met, wherein the first discrimination model and the second discrimination model form a discrimination system, and the system is used for executing the method. The judging device comprises a first judging module, a second judging module and a judging module, wherein the first judging module and the second judging module are used for judging the instruction execution condition of a user graphical interface intelligent body, the parameter number of the first judging model is smaller than that of the second judging model, the judging device is used for calling the first judging module to conduct first judgment on the target intelligent body based on target execution data to obtain a first judging result when a judging request for the target intelligent body is received, the second judging module is used for calling the second judging module to conduct second judgment on the target intelligent body based on th