US-20260127015-A1 - END-TO-END MOBILE USER INTERFACE NAVIGATION WITH VISION LANGUAGE ACTION MODELS

US20260127015A1US 20260127015 A1US20260127015 A1US 20260127015A1US-20260127015-A1

Abstract

The subject technology provides for end-to-end mobile user interface navigation with vision language action models. An apparatus may receive a language instruction and a visual input from a user interface (UI) of an electronic device. The apparatus can tokenize the language instruction and the visual input separately. The apparatus can process the tokenized language instruction and the tokenized visual input using a multi-modal large language model to generate one or more action outputs. The apparatus also can convert the action outputs into executable commands that cause the electronic device to perform navigation tasks on the UI.

Inventors

Di Feng
Abhishek Sundararajan
Pengfei DOU
Haotian Zhang
Zifeng Huang
Eldon K. SCHOOP
Alexander Toshev
Jeffrey W. Nichols
Yinfei Yang
Zhe GAN
Mohana Prasad Sathya Moorthy
Keen YOU
Zhen Yang
Anuj Mahajan
Harsh Agrawal
Meng-Ta Chou
Andres ROMERO MIER Y TERAN
Adolfo LOPEZ MENDEZ
Kenneth jung

Assignees

APPLE INC.

Dates

Publication Date: 20260507
Application Date: 20250807

Claims (20)

1 . A method, comprising: receiving a language instruction and a visual input from a user interface (UI) of an electronic device; tokenizing the language instruction and the visual input separately; processing the tokenized language instruction and the tokenized visual input using a multi-modal large language model to generate one or more action outputs; and converting the action outputs into executable commands that cause the electronic device to perform navigation tasks on the UI.
2 . The method of claim 1 , wherein the action outputs are in a pre-defined format, and wherein the executable commands are configured to perform one or more of a tap, scroll, input, or swipe action on the UI of the electronic device.
3 . The method of claim 1 , wherein the visual input is processed using an image encoder, wherein the image encoder is configured to: resize and divide the visual input based on an orientation of the UI, and tokenize both a full resized image and at least one sub-image derived from the visual input.
4 . The method of claim 1 , further comprising: generating synthetic training data to augment a core data set for training the multi-modal large language model, wherein the synthetic training data is generated by simulating navigation errors that occur on the UI of the electronic device, wherein the navigation errors comprise one or more of an incorrect tap location, inputting text without a corresponding active text field, or performing an excessive scroll action when a page boundary is reached.
5 . The method of claim 4 , wherein generating the synthetic training data further comprises using user-annotated data to create failure scenarios, wherein the synthetic training data is generated by perturbing a tap location to fall outside a UI element boundary defined by the user-annotated data.
6 . The method of claim 1 , further comprising generating reasoning traces for the multi-modal large language model, wherein the reasoning traces guide the multi-modal large language model in a sequence of decision-making steps for multi-step tasks to enhance accuracy in completing the navigation tasks.
7 . A system, comprising: a tokenization module configured to independently tokenize a language instruction and a visual input captured from a user interface (UI) on an electronic device; a multi-modal large language model configured to: receive tokenized inputs from the tokenization module, generate action outputs based on the tokenized inputs, and output the action outputs in a pre-defined format; and an execution module configured to interpret the action outputs as executable commands, the executable commands further configured to navigate the UI of the electronic device based on the language instruction and the visual input.
8 . The system of claim 7 , further comprising an exploration mechanism for generating auto-data by executing one or more simulated tasks on the UI of the electronic device, the exploration mechanism comprising: at least one large language model module configured to autonomously navigate the UI and record task completion traces; an injection module configured to introduce random actions within a task sequence to simulate recovery from adverse states; and a training module configured to apply the task completion traces and injected adverse-state recovery traces for training the multi-modal large language model.
9 . The system of claim 7 , wherein the multi-modal large language model uses reasoning traces and the action outputs, the reasoning traces representing logical, stepwise sequences for completing navigation tasks on the UI of the electronic device.
10 . The system of claim 7 , wherein the executable commands are configured to perform one or more of a tap, scroll, input, or swipe action on the UI of the electronic device.
11 . The system of claim 7 , wherein the visual input is processed using an image encoder, wherein the image encoder is configured to: resize and divide the visual input based on an orientation of the UI, and tokenize both a full resized image and at least one sub-image derived from the visual input.
12 . The system of claim 7 , further comprising: generating synthetic training data to augment a core data set for training the multi-modal large language model, wherein the synthetic training data is generated by simulating navigation errors that occur on the UI of the electronic device, wherein the navigation errors comprise one or more of an incorrect tap location, inputting text without a corresponding active text field, or performing an excessive scroll action when a page boundary is reached.
13 . The system of claim 12 , wherein generating the synthetic training data further comprises using user-annotated data to create failure scenarios, wherein the synthetic training data is generated by perturbing a tap location to fall outside a UI element boundary defined by the user-annotated data.
14 . The system of claim 7 , further comprising generating reasoning traces for the multi-modal large language model, wherein the reasoning traces guide the multi-modal large language model in a sequence of decision-making steps for multi-step tasks to enhance accuracy in completing navigation tasks.
15 . A non-transitory machine-readable medium comprising code that, when executed by a processor, causes the processor to perform operations comprising: capturing a language instruction and a visual input from a user interface (UI) on an electronic device; tokenizing the language instruction and the visual input separately; processing the tokenized language instruction and the tokenized visual input with a multi-modal large language model to produce one or more action outputs in a pre-defined format; converting the action outputs into executable commands that perform navigation tasks on the UI; and generating synthetic data to train the multi-modal large language model by injecting adverse actions into task sequences.
16 . The non-transitory machine-readable medium of claim 15 , wherein the adverse actions comprise off-target taps, premature input actions, or excessive scroll actions beyond a UI boundary.
17 . The non-transitory machine-readable medium of claim 15 , wherein the executable commands are configured to perform one or more of a tap, scroll, input, or swipe action on the UI of the electronic device.
18 . The non-transitory machine-readable medium of claim 15 , wherein the operations further comprise: generating synthetic training data to augment a core data set for training the multi-modal large language model, wherein the synthetic training data is generated by simulating navigation errors that occur on the UI of the electronic device, wherein the navigation errors comprise one or more of an incorrect tap location, inputting text without a corresponding active text field, or performing an excessive scroll action when a page boundary is reached.
19 . The non-transitory machine-readable medium of claim 18 , wherein generating the synthetic training data further comprises using user-annotated data to create failure scenarios, wherein the synthetic training data is generated by perturbing a tap location to fall outside a UI element boundary defined by the user-annotated data.
20 . The non-transitory machine-readable medium of claim 15 , wherein the operations further comprise generating reasoning traces for the multi-modal large language model, wherein the reasoning traces guide the multi-modal large language model in a sequence of decision-making steps for multi-step tasks to enhance accuracy in completing the navigation tasks.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application claims the benefit of U.S. Provisional Application No. 63/716,230, entitled “END-TO-END MOBILE USER INTERFACE NAVIGATION WITH VISION LANGUAGE ACTION MODELS”, filed Nov. 4, 2024, the entirety of which is incorporated herein for reference. TECHNICAL FIELD The present description generally relates to end-to-end mobile user interface navigation with vision language action models. BACKGROUND Machine learning has seen a significant rise in popularity in recent years due to the availability of training data, and advances in more powerful and efficient computing hardware. Machine learning may utilize models that are executed to provide predictions in particular applications. Large language models are characterized by their substantial size, often comprising hundreds of millions to billions of parameters. These models require significant computational power and memory for training and inference. However, deploying large machine learning models across different environments presents challenges related to model performance in these environments. BRIEF DESCRIPTION OF THE DRAWINGS Certain features of the subject technology are set forth in the appended claims. However, for purpose of explanation, several embodiments of the subject technology are set forth in the following figures. FIG. 1 illustrates an example network environment in accordance with one or more implementations. FIG. 2 is a flow chart of an example process that may be performed for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 3 illustrates an example multi-step navigation flow for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 4 illustrates a block diagram of an example autonomous UI agent architecture for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 5 illustrates example visual question answering formats for training a large language model agent for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 6 illustrates an example human-annotated episode flow for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 7 illustrates a block diagram of an exploration mechanism to generate auto-label data for end-to-end mobile user interface navigation with vision language action models in accordance with one or more implementations. FIG. 8 illustrates an electronic system with which one or more implementations of the subject technology may be implemented. DETAILED DESCRIPTION The detailed description set forth below is intended as a description of various configurations of the subject technology and is not intended to represent the only configurations in which the subject technology can be practiced. The appended drawings are incorporated herein and constitute a part of the detailed description. The detailed description includes specific details for the purpose of providing a thorough understanding of the subject technology. However, the subject technology is not limited to the specific details set forth herein and can be practiced using one or more other implementations. In one or more implementations, structures and components are shown in block diagram form in order to avoid obscuring the concepts of the subject technology. Powered by large language models and multi-modal foundation models, autonomous agents capable of controlling mobile devices to perform tasks traditionally executed by humans represent emerging technology with significant potential to impact the computer industry and user interfaces (UIs). In one or more implementations, these autonomous UI agents can display a cooking recipe on a mobile device while a user's hands remain occupied with other tasks, such as washing dishes, or transcribe a meeting reminder while the user is engaged in activities like driving. By automating interactions with mobile applications that typically demand manual effort, such autonomous UI agents can provide beneficial improvements in productivity and safety across daily activities. Recent efforts in autonomous UI agents focus on enabling models to interpret natural language instructions and understand the state of mobile devices by processing visual input, such as raw screenshots, application UI trees, or outputs from specialized UI detection models. These agents can predict actions—such as tap, type, and scroll—that are subsequently executed through the device's user interface. In one or more implementations, autonomous UI agents may be configured to predict low-level action policies specific to mobile device control, positioning them as a subset of Vision Language Action (VLA) models within the