US-12625918-B2 - Automatic navigation of interactive web documents

US12625918B2US 12625918 B2US12625918 B2US 12625918B2US-12625918-B2

Abstract

The present disclosure is generally directed to methods, apparatus, and computer-readable media (transitory and non-transitory) for learning to automatically navigate interactive web documents and/or websites. More particularly, various approaches are presented for training various deep Q network (DQN) agents to perform various tasks associated with reinforcement learning, including hierarchical reinforcement learning, in challenging web navigation environments with sparse rewards and large state and action spaces. These agents include a web navigation agent that can use learned value function(s) to automatically navigate through interactive web documents, as well as a training agent, referred to herein as a “meta-trainer,” that can be trained to generate synthetic training examples. Some approaches described herein may be implemented when expert demonstrations are available. Other approaches described herein may be implemented when expert demonstrations are not available. In either case, dense, potential-based rewards may be used to augment the training.

Inventors

Aleksandra Faust
Dilek Hakkani-Tur
Izzeddin Gur
Ulrich Rueckert

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20241119

Claims (17)

1 . A method implemented using one or more processors, comprising: determining a natural language input, wherein the natural language input comprises a command to perform a task; retrieving an interactive web document that is operable via a graphical user interface (“GUI”) to perform the task, wherein the interactive web document includes one or more constituent interactive document object model (DOM) elements that are operable to input one or more values for use in performing the task; encoding the one or more interactive DOM elements of the interactive web document into one or more interactive DOM element feature vectors, wherein encoding the one or more interactive DOM elements includes conditioning the one or more interactive DOM element feature vectors based on the natural language input, wherein the conditioning includes performing an operation using the one or more interactive DOM element feature vectors and one or more overlap feature vectors as operands; generating, based on the one or more conditioned interactive element feature vectors and the natural language input, one or more probability distributions over the one or more interactive elements; and facilitating automated navigation through the interactive web document in response to the natural language input based at least in part on one or more of the probability distributions.
2 . The method of claim 1 , wherein the one or more overlap feature vectors are encoded based on overlap between the natural language input and one or more attributes of one or more of the interactive DOM elements.
3 . The method of claim 2 , wherein the operation comprises concatenation.
4 . The method of claim 1 , further comprising linearizing a tree structure that represents the interactive DOM elements into a sequence of DOM elements.
5 . The method of claim 1 , wherein a long-short term memory (“LSTM”) network is used to encode the one or more interactive DOM element feature vectors.
6 . The method of claim 5 , wherein the LSTM network comprises a bidirectional LSTM network.
7 . The method of claim 1 , wherein the one or more probability distributions comprise a composite Q value.
8 . The method of claim 7 , wherein the composite Q value comprises an interactive DOM element Q value, a click-or-type Q value, and a type Q value.
9 . A system comprising one or more processors and memory storing instructions that, in response to execution by the one or more processors, cause the one or more processors to: determine a natural language input, wherein the natural language input comprises a command to perform a task; retrieve an interactive web document that is operable via a graphical user interface (“GUI”) to perform the task, wherein the interactive web document includes one or more constituent interactive document object model (DOM) elements that are operable to input one or more values for use in performing the task; encode the one or more interactive DOM elements of the interactive web document into one or more interactive DOM element feature vectors, wherein encoding the one or more interactive DOM elements includes conditioning the one or more interactive DOM element feature vectors based on the natural language input, wherein the conditioning includes performing an operation using the one or more interactive DOM element feature vectors and one or more overlap feature vectors as operands; generate, based on the one or more conditioned interactive element feature vectors and the natural language input, one or more probability distributions over the one or more interactive elements; and facilitate automated navigation through the interactive web document in response to the natural language input based at least in part on one or more of the probability distributions.
10 . The system of claim 9 , wherein the one or more overlap feature vectors are encoded based on overlap between the natural language input and one or more attributes of one or more of the interactive DOM elements.
11 . The system of claim 10 , wherein the operation comprises concatenation.
12 . The system of claim 9 , further comprising linearizing a tree structure that represents the interactive DOM elements into a sequence of DOM elements.
13 . The system of claim 9 , wherein a long-short term memory (“LSTM”) network is used to encode the one or more interactive DOM element feature vectors.
14 . The system of claim 13 , wherein the LSTM network comprises a bidirectional LSTM network.
15 . The system of claim 9 , wherein the one or more probability distributions comprise a composite Q value.
16 . The system of claim 15 , wherein the composite Q value comprises an interactive DOM element Q value, a click-or-type Q value, and a type Q value.
17 . At least one non-transitory computer-readable medium comprising instructions that, in response to execution by one or more processors, cause the one or more processors to: determine a natural language input, wherein the natural language input comprises a command to perform a task; retrieve an interactive web document that is operable via a graphical user interface (“GUI”) to perform the task, wherein the interactive web document includes one or more constituent interactive document object model (DOM) elements that are operable to input one or more values for use in performing the task; encode the one or more interactive DOM elements of the interactive web document into one or more interactive DOM element feature vectors, wherein encoding the one or more interactive DOM elements includes conditioning the one or more interactive DOM element feature vectors based on the natural language input, wherein the conditioning includes performing an operation using the one or more interactive DOM element feature vectors and one or more overlap feature vectors as operands; generate, based on the one or more conditioned interactive element feature vectors and the natural language input, one or more probability distributions over the one or more interactive elements; and facilitate automated navigation through the interactive web document in response to the natural language input based at least in part on one or more of the probability distributions.

Description

BACKGROUND Reinforcement learning (“RL”) is challenging in environments having large state and action spaces, and especially when only sparse rewards are available. In one example of such an environment, RL may be used to train a RL policy that is then used by an automated assistant (also referred to as a “virtual assistant,” “chatbots,” “digital assistant,” etc.) automatically navigate web documents (e.g., webpages) based on users' intents determined from natural language instructions. The potential input vocabulary and number of actionable elements in such a scenario can grow quite large. In a typical web environment, an automated assistant might need to carefully navigate through a large number of interactive input components (e.g., document object model or “DOM” elements) to follow highly dynamic instructions formulated from large vocabularies. For example, suppose a user issues the natural language instruction, “Book a flight from WTK to LON on 21 Oct. 2016.” The automated assistant (or a separate web navigation acting in cooperation with the automated assistant) may need to fill out origin and destination drop down menus on the web page with the correct airport codes, select a date, hit a submit button, and then select the cheapest flight among all the options that are returned. This is not a trivial task for an automated assistant, or a web navigation agent if distinct from the automated assistant. The first three fields may be filled out in any order. Moreover, the options for selection are numerous, and among all possible airport/date combinations, only one conforms to the user's request. In some cases the web page form can only be submitted once all the three fields are filled in. At that point the web environment/web page changes, and flight selection becomes possible. Then, a flight can be selected and booked. Reaching the true objective in such tasks through trial-and-error is cumbersome given the large state and action spaces. Reinforcement learning with sparse rewards results in the majority of the episodes generating no signal at all. The problem is exacerbated when learning from large set of instructions where visiting each option could be infeasible. SUMMARY The present disclosure is generally directed to methods, apparatus, and computer-readable media (transitory and non-transitory) for learning to automatically navigate interactive web documents and/or websites. More particularly, various approaches are presented for training various deep Q network (DQN) agents to perform various tasks associated with reinforcement learning, including hierarchical reinforcement learning, in challenging web navigation environments with sparse rewards and large state and action spaces. These agents include a web navigation that can use learned value function(s) to automatically navigate through interactive web documents, as well as a training agent, referred to herein as a “meta-trainer,” that can be trained to generate synthetic training examples. Some approaches described herein may be implemented when expert demonstrations are available. Other approaches described herein may be implemented when expert demonstrations are not available. In either case, dense, potential-based rewards may be used to augment the training. When an expert demonstrations are available, curriculum learning may be employed to decompose a complex instruction into multiple, simpler sub-instructions. A web navigation agent configured with selected aspects of the present disclosure may be assigned incrementally larger subsets of these sub-instructions, until it ultimately uncovers the original complex instruction. When expert demonstrations are not available, the aforementioned meta-trainer may be used to generate goal states and instruction pairs with dense reward signals for the web navigation agent to train more efficiently. Disclosed models outperform previous state-of-the-art models on challenging environments without using any human demonstration. In some implementations, a computer implemented method may be provided that includes: determining a natural language input, wherein the natural language input comprises a command to perform a task; analyzing the natural language input to identify one or more key-value pairs; retrieving an interactive web document that is operable via a graphical user interface (“GUI”) to perform the task, wherein the interactive web document includes one or more constituent interactive elements that are operable to input one or more values of the one or more key-value pairs; encoding the one or more parameters key-value pairs into one or more instruction feature vectors; encoding overlapping content between the one or more key-value pairs and the one or more interactive elements into one or more overlap feature vectors; encoding the one or more interactive elements of the interactive web document into one or more interactive element feature vectors; conditioning the one or more interactive element fea