US-20260124744-A1 - METHOD FOR EXECUTING NATURAL LANGUAGE INSTRUCTIONS BY AI AGENT CAPABLE OF PRE-EMPTIVELY REVISING ACTIONS USING ENVIRONMENTAL FEEDBACKS AND AI AGENT USING THE SAME

US20260124744A1US 20260124744 A1US20260124744 A1US 20260124744A1US-20260124744-A1

Abstract

Disclosed is a method for executing natural language instructions by pre-emptively revising actions using environmental feedbacks. The method includes steps of: (a) in response to receiving natural language instructing data, (i) inputting the natural language instructing data into an LLM and thus generate initial task-relevant contexts corresponding to the natural language instructing data, and then generate an initial action plan; (b) for a j-th initial action included in the initial action plan, (i) instructing a semantic data module to generate semantic data, and (ii) inputting the j-th initial action and the semantic data into an environmental feedback module, to compare actual information and expected information and thus generate feedback information; and (c) (i) inputting the feedback information, a current action plan, and a system prompt into the LLM, to generate one or more revised task-relevant contexts and then generate a revised action plan.

Inventors

Jonghyun Choi
Byeonghwi KIM
Cheolhong Min
Jinyeon Kim

Assignees

SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION
UIF (UNIVERSITY INDUSTRY FOUNDATION), YONSEI UNIVERSITY

Dates

Publication Date: 20260507
Application Date: 20241219
Priority Date: 20241105

Claims (10)

1 . A method for executing natural language instructions by an AI agent Capable of pre-emptively revising actions using environmental feedbacks, comprising steps of: (a) in response to receiving one or more natural language instructing data as the natural language instructions, the AI agent (i) inputting the natural language instructing data into an LLM (Large Language Model), to thereby instruct the LLM to perform a learning operation on the natural language instructing data and thus generate one or more initial task-relevant contexts corresponding to the natural language instructing data, and (ii) instructing an initial action planner to generate an initial action plan including a first initial action to an n-th initial action by referring to the initial task-relevant contexts, wherein n is an integer equal to or greater than 1; (b) for a j-th initial action among the first initial action to the n-th initial action, while increasing j from 1 to n, the AI agent (i) instructing a semantic data module to perform a learning operation on at least one image acquired from a current direction of the AI agent, to thereby generate at least one semantic data, and (ii) inputting the j-th initial action and the semantic data into an environmental feedback module, to thereby instruct the environmental feedback module to compare actual information and expected information which were acquired during performing the j-th initial action and thus generate feedback information; and (c) the AI agent (i) inputting (1) the feedback information, (2) a current action plan including the j-th initial action to the n-th initial action, and (3) a system prompt capable of providing guidance for tasks into the LLM, to thereby generate one or more revised task-relevant contexts and (ii) instructing a revision action planner to generate a revised action plan including a first revised action to an m-th revised action by referring to the revised task-relevant contexts, and thus execute the natural language instructions according to the revised action plan.
2 . The method of claim 1 , wherein, at the step of (a), the initial task-relevant context corresponding to the natural language instructing data includes information on a task type, at least one target object, at least one related object, a target object expected location, a target object expected appearance class, a target object expected attribute class, and an expected relationship between the target object and the related object.
3 . The method of claim 2 , wherein, at the step of (b), the AI agent instructs the environmental feedback module to use the at least one of (i) information on (1) a target object detected location, (2) a target object detected appearance class, (3) a target object detected attribute class, and (4) a detected relationship between the target object and the related object, as the actual information, and at least one of (ii) information on (5) the target object expected location, (6) the target object expected appearance class, (7) the target object expected attribute class, and (8) the expected relationship between the target object and the related object, as the expected information.
4 . The method of claim 3 , wherein, at the step of (b), the AI agent instructs the environmental feedback module to perform at least one sub-process of (i) generating first feedback information by comparing (1) the target object expected location that is probabilistically acquired from the LLM module with (2) the target object detected location that is acquired by referring to the semantic data, (ii) generating second feedback information by comparing (1) the target object expected appearance class that is set to be a target object class with (2) (2-1) an (i_0)-th target object detected appearance class that is the target object detected appearance class initially acquired and (2-2) an (i_1)-st target object detected appearance class to an (i_D)-th target object detected appearance class that are the target object detected appearance classes acquired from a first direction to a D-th direction, wherein D is an integer greater than or equal to 1, (iii) generating third feedback information by comparing (1) the target object expected attribute class that is set to be an opposite attribute class of an intended attribute class included in the natural language instructing data with (2) the target object detected attribute class, and (iv) generating fourth feedback information by comparing (1) the expected relationship that is set to be an opposite state of an intended relationship included in the natural language instructing data with (2) the detected relationship.
5 . The method of claim 1 , wherein the semantic data module (i) generates a depth map corresponding to an entire environment by referring to spatial information of the entire environment from the image, (ii) acquires each object mask corresponding to at least part of all objects included in the entire environment, (iii) back-projects the depth map and said each object mask into a 3D-coordinates to thereby generate a semantic spatial map, and (iv) generates the semantic data by referring to the semantic spatial map.
6 . An AI agent for executing natural language instructions capable of pre-emptively revising actions using environmental feedbacks, comprising: at least one memory that stores instructions; and at least one processor configured to execute the instructions to perform processes of: (I) in response to receiving one or more natural language instructing data as the natural language instructions, (i) inputting the natural language instructing data into an LLM (Large Language Model), to thereby instruct the LLM to perform a learning operation on the natural language instructing data and thus generate one or more initial task-relevant contexts corresponding to the natural language instructing data, and (ii) instructing an initial action planner to generate an initial action plan including a first initial action to an n-th initial action by referring to the initial task-relevant contexts, wherein n is an integer equal to or greater than 1; (II) for a j-th initial action among the first initial action to the n-th initial action, while increasing j from 1 to n, (i) instructing a semantic data module to perform a learning operation on at least one image acquired from a current direction of the AI agent, to thereby generate at least one semantic data, and (ii) inputting the j-th initial action and the semantic data into an environmental feedback module, to thereby instruct the environmental feedback module to compare actual information and expected information which were acquired during performing the j-th initial action and thus generate feedback information; and (III) (i) inputting (1) the feedback information, (2) a current action plan including the j-th initial action to the n-th initial action, and (3) a system prompt capable of providing guidance for tasks into the LLM, to thereby generate one or more revised task-relevant contexts and (ii) instructing a revision action planner to generate a revised action plan including a first revised action to an m-th revised action by referring to the revised task-relevant contexts, and thus execute the instructions according to the revised action plan.
7 . The AI agent of claim 6 , wherein, at the process of (I), the initial task-relevant context corresponding to the natural language instructing data includes information on a task type, at least one target object, at least one related object, a target object expected location, a target object expected appearance class, a target object expected attribute class, and an expected relationship between the target object and the related object.
8 . The AI agent of claim 7 , wherein, at the process of (II), the processor instructs the environmental feedback module to use the at least one of (i) information on (1) a target object detected location, (2) a target object detected appearance class, (3) a target object detected attribute class, and (4) a detected relationship between the target object and the related object, as the actual information, and at least one of (ii) information on (5) the target object expected location, (6) the target object expected appearance class, (7) the target object expected attribute class, and (8) the expected relationship between the target object and the related object, as the expected information.
9 . The method of claim 8 , wherein, at the process of (II), the processor instructs the environmental feedback module to perform at least one sub-process of (i) generating first feedback information by comparing (1) the target object expected location that is probabilistically acquired from the LLM module with (2) the target object detected location that is acquired by referring to the semantic data, (ii) generating second feedback information by comparing (1) the target object expected appearance class that is set to be a target object class with (2) (2-1) an (i_0)-th target object detected appearance class that is the target object detected appearance class initially acquired and (2-2) an (i_1)-st target object detected appearance class to an (i_D)-th target object detected appearance class that are the target object detected appearance classes acquired from a first direction to a D-th direction, wherein D is an integer greater than or equal to 1, (iii) generating third feedback information by comparing (1) the target object expected attribute class that is set to be an opposite attribute class of an intended attribute class included in the natural language instructing data with (2) the target object detected attribute class, and (iv) generating fourth feedback information by comparing (1) the expected relationship that is set to be an opposite state of an intended relationship included in the natural language instructing data with (2) the detected relationship.
10 . The method of claim 6 , wherein the semantic data module (i) generates a depth map corresponding to an entire environment by referring to spatial information of the entire environment from the image, (ii) acquires each object mask corresponding to at least part of all objects included in the entire environment, (iii) back-projects the depth map and said each object mask into a 3D-coordinates to thereby generate a semantic spatial map, and (iv) generates the semantic data by referring to the semantic spatial map.

Description

CROSS REFERENCE OF RELATED APPLICATION This present application claims the benefit of the earlier filing date of Korean non-provisional patent application No. 10-2024-0155713, filed on Nov. 5, 2024, the entire contents of which being incorporated herein by reference. FIELD OF THE DISCLOSURE The present disclosure relates to a method for executing natural language instructions by an AI agent capable of pre-emptively revising actions using environmental feedbacks and the AI agent using the same. BACKGROUND OF THE DISCLOSURE An AI secretary capable of following language instructions to perform menial tasks such as house chores is of everyone's dream. However, in order for an Artificial Intelligence to be able to achieve the above, the AI should navigate, interact with objects, and perform conversational inference in a visually rich 3D environment. Furthermore, it would be ideal for the AI to be able to navigate its environment, interact with the objects, and perform long-term tasks by following the language instructions based on ego-centric vision. Meanwhile, there are conventional AIs that can navigate its environment by following the natural language instructions, however, the conventional AIs do not consider environmental factors. Thus, the conventional AIs usually perform actions on wrong target objects or perform unnecessary actions such as actions that have already been performed. Therefore, an improvement for solving this problem is required. SUMMARY OF THE DISCLOSURE It is an object of the present disclosure to solve all the aforementioned problems. It is another object of the present invention to provide an AI agent capable of instructing an environmental feedback module to compare actual information and expected information, to thereby generate feedback information and thus generate a revised action plan to be used to execute the natural language instructions. It is still another object of the present invention to provide the AI agent capable of instructing the environmental feedback module to reflect a location, an appearance, an attribute, and a relationship as environmental factors on the feedback information, to thereby prevent the AI agent from interacting with a wrong target object and performing unnecessary actions. In order to accomplish objects above, representative structures of the present disclosure are described as follows: In accordance to one aspect of the present disclosure there is provided a method for executing natural language instructions by an AI agent capable of pre-emptively revising actions using environmental feedbacks, including steps of: (a) in response to receiving one or more natural language instructing data as the natural language instructions, the AI agent (i) inputting the natural language instructing data into an LLM (Large Language Model), to thereby instruct the LLM to perform a learning operation on the natural language instructing data and thus generate one or more initial task-relevant contexts corresponding to the natural language instructing data, and (ii) instructing an initial action planner to generate an initial action plan including a first initial action to an n-th initial action by referring to the initial task-relevant contexts, wherein n is an integer equal to or greater than 1; (b) for a j-th initial action among the first initial action to the n-th initial action, while increasing j from 1 to n, the AI agent (i) instructing a semantic data module to perform a learning operation on at least one image acquired from a current direction of the AI agent, to thereby generate at least one semantic data, and (ii) inputting the j-th initial action and the semantic data into an environmental feedback module, to thereby instruct the environmental feedback module to compare actual information and expected information which were acquired during performing the j-th initial action and thus generate feedback information; and (c) the AI agent (i) inputting (1) the feedback information, (2) a current action plan including the j-th initial action to the n-th initial action, and (3) a system prompt capable of providing guidance for tasks into the LLM, to thereby generate one or more revised task-relevant contexts and (ii) instructing a revision action planner to generate a revised action plan including a first revised action to an m-th revised action by referring to the revised task-relevant contexts, and thus execute the natural language instructions according to the revised action plan. As one example, at the step of (a), the initial task-relevant context corresponding to the natural language instructing data includes information on a task type, at least one target object, at least one related object, a target object expected location, a target object expected appearance class, a target object expected attribute class, and an expected relationship between the target object and the related object. As one example, at the step of (b), the AI agent instructs the environ