CN-122019045-A - Graphical user interface instruction positioning method and system based on dynamic region search

CN122019045ACN 122019045 ACN122019045 ACN 122019045ACN-122019045-A

Abstract

The invention discloses a graphical user interface instruction positioning method and a system based on dynamic region search, and belongs to the technical field of image or video identification. The method comprises the steps of receiving natural language instructions, obtaining a current complete interface screenshot, analyzing elements of candidate areas in the current complete interface screenshot, outputting a structured element set, calculating index scores, generating new candidate areas, calculating index scores of the new candidate areas, grading the quality of all the candidate areas, planning and searching the candidate areas, constructing an area search tree, selecting an optimal search path, outputting an optimal area, inputting the optimal area and the natural language instructions into a basic multi-mode large model, and outputting a position prediction result of a target element. The invention can more accurately position the target interface element corresponding to the user instruction in the complex interface which has high resolution and dense elements and contains a large amount of irrelevant interference information.

Inventors

ZHOU YU
LIU YICHAO

Assignees

南开大学

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (10)

1. A graphical user interface instruction positioning method based on dynamic region search is characterized by comprising the following steps: s1, receiving a natural language instruction input by a user, and acquiring a current complete interface screenshot; S2, analyzing elements of a candidate region in the current complete interface screenshot, outputting a structured element set comprising a boundary box, semantic descriptions and interaction attributes, calculating semantic relevance scores of elements and instructions on each element in the structured element set based on the semantic descriptions by combining natural language instructions input by a user, calculating the consistency of interface coverage in the candidate region based on the boundary box, and calculating element relevance scores and semantic concentration in the candidate region based on the semantic relevance scores and the interaction attributes; S3, executing one of three sensing actions of focusing, transferring or expanding according to the structured element set and the semantic relevance score of each element and the instruction to generate a new candidate region, and calculating the semantic relevance score of each element and the instruction of the new candidate region, the consistency of interface coverage in the new candidate region, the element relevance score and the semantic concentration; S4, quality scoring is carried out on all candidate areas based on the candidate area or new candidate area element correlation score, the candidate area or new candidate area internal interface coverage consistency and the semantic concentration; S5, planning and searching candidate areas generated by different sensing actions, constructing an area search tree based on the current candidate area state, the candidate actions, the generated new candidate areas and the quality scores of all candidate areas, selecting an optimal search path in a given search budget, and outputting an optimal area; S6, inputting the optimal region and the natural language instruction into the basic multi-mode large model, and outputting a position prediction result of the target element.
2. The method for positioning graphical user interface instructions based on dynamic region search according to claim 1, wherein in step S2 and step S3, semantic relevance scores of elements and instructions are calculated according to formula (1): (1); Wherein: Represent the first Semantic relevance scores of individual elements to the current instruction, Representing an instruction of a user and, The transpose of the matrix is represented, A text encoder is represented by a representation of the text, Represent the first The individual elements construct a corresponding textual description, Representing a binary norm.
3. The method for positioning graphical user interface instructions based on dynamic region search according to claim 2, wherein in step S2 and step S3, the consistency of the coverage of the current candidate region is calculated according to formula (2): (2); Wherein: indicating the consistency of the interface coverage within the current candidate region, The current candidate region is indicated and, The area is indicated as such, Representing the first in the current candidate region The bounding box coordinates of the individual elements, Representing the total number of elements within the current candidate region.
4. The method for positioning graphical user interface instructions based on dynamic region search as set forth in claim 3, wherein in step S2 and step S3, element correlation scores are calculated according to formula (3), and semantic concentration is calculated according to formula (4): (3); (4); Wherein: representing the current candidate region element relevance score, Representing the first in the current candidate region The individual elements are used to interact with the perceptual weights, Representing a constant that avoids zero in the denominator, Representing the semantic concentration of the current candidate region element, Representing the first in the current candidate region The semantic relevance scores of the individual elements to the current instruction are at the ratio of the semantic relevance scores of all elements to the current instruction, Representing a natural exponential function of the sign, A parameter of the temperature is indicated and, Representing the semantic relevance score of any element within the current structured element set to the current instruction.
5. The method for positioning graphical user interface instructions based on dynamic region search of claim 1, wherein in step S3, one of three kinds of sensing actions of focusing, transferring or expanding is executed according to the structured element set and semantic relevance scores of each element and the instructions to generate a new candidate region by adopting the following method: S311, comparing the number of elements in the current candidate area with the number of elements in the previous candidate area, if the number of elements in the current candidate area is more than or equal to 90% of the number of elements in the previous candidate area, executing focusing action, selecting partial elements with higher semantic relevance scores of the elements and instructions from the current candidate area, calculating the minimum surrounding area of the selected elements, taking the minimum surrounding area of the selected elements as a new candidate area, and if the number of elements in the current candidate area is less than 90% of the number of elements in the previous candidate area, entering the next step; S312, searching partial elements with higher semantic relevance scores of the elements and the instructions outside the current candidate area, executing a transfer action, calculating the minimum surrounding area of the selected elements, taking the minimum surrounding area as a new candidate area, or executing an expansion action, and combining the selected elements with the elements of the current candidate area to form the new candidate area.
6. The method for locating graphical user interface instructions based on dynamic region search according to claim 4, wherein in step S4, quality scoring is performed on all candidate regions according to formula (5): (5); Wherein: representing a quality score of the candidate region, Representing the relative contribution of the element relevance scores, Representing the relative contribution of the consistency of the interface coverage within the current candidate region, Representing the relative contribution of the semantic concentration of the current candidate region element.
7. The method for positioning graphical user interface instructions based on dynamic region search according to claim 1, wherein in step S5, a Monte Carlo tree search strategy is adopted to plan and search candidate regions generated by different sensing actions.
8. The method for positioning a graphical user interface command based on dynamic region search according to claim 1, wherein in step S5, the current candidate region state includes a current candidate region, a structured element set of the current candidate region, and a semantic relevance score of each element in the structured element set of the current candidate region to the command.
9. The method for positioning graphical user interface instructions based on dynamic region search according to claim 8, wherein the method for constructing a region search tree and selecting an optimal search path within a given search budget and outputting an optimal region in step S5 is as follows: S511, representing a candidate area state by each node in the search tree, representing a perception action by an edge in the search tree, and selecting a current most potential action branch according to a Monte Carlo tree search strategy from a root node; s512, after the leaf nodes are reached, generating new candidate areas for the sensing actions which are not developed, and adding the new candidate areas into the search tree; S513, taking the quality scores of the candidate areas as benefits of the leaf nodes, returning the benefits of the leaf nodes along a search path, and updating statistics of each node in the search tree; And S514, after multiple rounds of searching, selecting the candidate region with the highest quality score or the highest access frequency from the search tree as the optimal region.
10. A graphical user interface instruction positioning system based on dynamic region search, for executing a graphical user interface instruction positioning method based on dynamic region search according to any one of claims 1 to 9, characterized by comprising a task input module, an interface acquisition module, an element perception module, a dynamic perception action module, a region quality evaluation module, an action planning module and a final positioning module; the task input module is used for receiving natural language instructions input by a user; the interface acquisition module is used for acquiring a current complete interface screenshot; The element perception module is used for analyzing elements of a candidate region in the current complete interface screenshot, outputting a structured element set, and calculating semantic relevance scores of the elements and instructions, interface coverage consistency in the candidate region, element relevance scores and semantic concentration; The dynamic perception action module is used for executing one of three perception actions of focusing, transferring or expanding according to the structured element set and the semantic relevance score of each element and the instruction, and generating a new candidate region; The region quality evaluation module performs quality scoring on all candidate regions based on element correlation scores, the consistency of interface coverage in the candidate regions and semantic concentration; The action planning module is used for planning and searching candidate areas generated by different sensing actions, constructing an area search tree based on the current candidate area state, the candidate actions, the generated new candidate areas and the quality scores of all candidate areas, selecting an optimal search path in a given search budget, and outputting an optimal area; The final positioning module is used for inputting the optimal region and the natural language instruction into the basic multi-mode large model and outputting the position prediction result of the target element.

Description

Graphical user interface instruction positioning method and system based on dynamic region search Technical Field The invention relates to the technical field of image or video recognition, in particular to a graphical user interface instruction positioning method and system based on dynamic region search. Background The task of positioning instructions in a graphical user interface can be generally expressed as that given a current screenshot and a natural language instruction, the system needs to accurately position a target interface element corresponding to the semantics of the instruction, such as a button, an icon, a menu item, an input box or other interactable control, in a screen, so as to provide a reliable space position basis for subsequent action execution. The task is a basic link of the graphic user interface agent for understanding user intention and completing interactive operation, and the positioning accuracy directly influences the reliability of subsequent clicking, inputting and other operations. The existing graphic user interface positioning methods mainly can be divided into two types, wherein the first type of methods adopts a full-screen single-step prediction paradigm, namely, the method is directly based on complete screenshot and user instructions, and the target coordinates or the predicted target bounding boxes are returned on the whole interface image. The method is relatively direct in flow realization, but because the real interface is generally higher in resolution, the number of elements is numerous, the arrangement is dense, a large number of texts, icons and controls irrelevant to the current task are contained, and the model is easily interfered by irrelevant areas under the global visual field, so that the attention is dispersed, and the accuracy and the stability of target positioning are further affected. The second type of method adopts a mode of multi-step cutting or gradual scaling, firstly coarsely positioning a candidate region, and then continuously narrowing the observation range so as to gradually approach a target control. The method can improve the local fine granularity sensing capability to a certain extent, but the search path is usually unidirectional shrinkage, once the early cut area deviates from the real target, the follow-up process is difficult to recover, and the problem of gradual accumulation of errors is easy to occur. Especially in complex interfaces with high resolution, dense interface elements and higher local semantic similarity, the strategy of simply relying on forward scaling is difficult to stably handle the situations of visual field offset, context loss, confusion of targets and interference items and the like. In addition, the real interface is different from a general natural image, and tends to have obvious structural heterogeneity. On the one hand, the interface comprises text type elements, icon type elements, combined controls, list items, menu bars and a multi-layer nested container, on the other hand, the visual saliency of target elements is not always prominent, many controls are small in size, the appearance of icons is similar, and the functional semantics of the control elements can be accurately judged by combining surrounding texts or layout contexts. Thus, depending on the original visual features alone or a single global match, it is often difficult to stably extract the regional cues that are truly relevant to the user instruction in a complex interface. Disclosure of Invention The technical problem to be solved by the invention is to provide the graphical user interface instruction positioning method and system based on dynamic region search, which can more accurately position the target interface element corresponding to the user instruction in a complex interface with high resolution, dense elements and a large amount of irrelevant interference information without additional training. The invention is realized by the following technical scheme: a graphical user interface instruction positioning method based on dynamic region search comprises the following steps: s1, receiving a natural language instruction input by a user, and acquiring a current complete interface screenshot; S2, analyzing elements of a candidate region in the current complete interface screenshot, outputting a structured element set comprising a boundary box, semantic descriptions and interaction attributes, calculating semantic relevance scores of elements and instructions on each element in the structured element set based on the semantic descriptions by combining natural language instructions input by a user, calculating the consistency of interface coverage in the candidate region based on the boundary box, and calculating element relevance scores and semantic concentration degrees based on the semantic relevance scores and the interaction attributes; S3, executing one of three sensing actions of focusing, transferring or expanding accor