KR-20260067244-A - METHOD FOR EXECUTING NATURAL LANGUAGE INSTRUCTIONS BY AI AGENT CAPABLE OF PRE-EMPTIVELY REVISING ACTIONS USING ENVIRONMENTAL FEEDBACKS AND AI AGENT USING THE SAME

KR20260067244AKR 20260067244 AKR20260067244 AKR 20260067244AKR-20260067244-A

Abstract

A method for executing instructions of an AI agent capable of preemptively modifying actions using environmental feedback information, comprising: (a) when natural language instruction data is obtained as said instruction, the AI agent inputs said natural language instruction data into a Large Language Model (LM) and causes the LLM to perform learning operations on said natural language instruction data to obtain initial task-relevant contexts corresponding to said natural language instruction data, and causes an initial action planner to generate an initial action plan including a first initial action to an nth initial action corresponding to said initial task-relevant contexts—where n is an integer greater than or equal to 1; (b) a step in which, for a j-th initial action which is any one of the first initial action to the n-th initial action, the AI agent increases j from 1 to n and causes a Semantic Data module to perform a learning operation on an image acquired from the current direction of the AI agent to acquire semantic data, and inputs the j-th initial action and the semantic data to an environment feedback module so that the environment feedback module generates feedback information by comparing actual information and predicted information according to the j-th initial action; and (c) a step in which the AI agent inputs a current action plan including the feedback information, the j-th initial action to the n-th initial action, and a system prompt capable of presenting a guide for the task to the LLM to acquire a context related to a modification task, and causes a modification action planner to generate a modification action plan including the first modification action to the m-th modification action corresponding to the context related to the modification task, and performs the instruction according to the modification action plan.

Inventors

최종현
김병휘
민철홍
김진연

Assignees

서울대학교산학협력단
연세대학교 산학협력단

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (10)

A method for executing instructions of an AI agent capable of preemptively modifying actions using environmental feedback information, (a) When natural language instruction data is obtained as the above instruction, the AI agent inputs the natural language instruction data into a Large Language Model (LM) and causes the LLM to perform learning operations on the natural language instruction data to obtain initial task-relevant contexts corresponding to the natural language instruction data, and causes an initial action planner to generate an initial action plan including a first initial action to an nth initial action corresponding to the initial task-relevant contexts—where n is an integer greater than or equal to 1; (b) for a j-th initial action which is any one of the first to n-th initial actions, the AI agent increases j from 1 to n, causes a Semantic Data module to perform a learning operation on an image acquired from the current direction of the AI agent to acquire semantic data, and inputs the j-th initial action and the semantic data to an environment feedback module to cause the environment feedback module to generate feedback information by comparing actual information and predicted information according to the j-th initial action; and (c) The AI agent inputs the feedback information, the current action plan including the j-th initial action to the n-th initial action, and a system prompt capable of presenting a guide for the task into the LLM to obtain a context related to the modification task, and causes the modification action planner to generate a modification action plan including the first modification action to the m-th modification action corresponding to the context related to the modification task, thereby performing the instruction according to the modification action plan; A method including
In paragraph 1, In step (a) above, A method characterized in that the initial task-related context corresponding to the above natural language instruction data includes task type information, a target object, at least one related target object, predicted location information for the target object, predicted appearance class information for the target object, predicted attribute class information for the target object, and predicted relationship information between the target object and the related target object.
In paragraph 2, In step (b) above, A method characterized in that the AI agent uses the environment feedback module as the actual information and the predicted information, (i) the actual location information and the predicted location information of the target object, (ii) the actual appearance class information and the predicted appearance class information of the target object, (iii) the actual attribute class information and the predicted attribute class information of the target object, and (iv) the actual relationship information and the predicted relationship information of the target object.
In paragraph 3, In step (b) above, A subprocess in which the AI agent causes the environment feedback module to (i) generate first feedback information by using location information for the target object probabilistically obtained from the LLM module as predicted location information and comparing the actual location obtained from the semantic data with the predicted location; (ii) generate second feedback information by using the class information of the target object as predicted appearance class information and comparing the predicted appearance class information with the i_1 actual appearance class information obtained from the appearance of the target object in the first to D directions - where D is an integer greater than or equal to 1 - with respect to the target object; (iii) generate third feedback information by setting attribute class information opposite to the target attribute class information included in the natural language instruction data as predicted attribute class information and comparing the actual attribute class information with the predicted attribute class information; and (iv) generate fourth feedback information by setting a state opposite to the target relationship information included in the natural language instruction data as predicted relationship information and comparing the actual relationship information with the predicted relationship information. A method characterized by performing at least some of the sub-processes that cause to be generated.
In paragraph 1, A method characterized by the semantic data module generating a depth-map corresponding to the entire environment by referencing spatial information about the entire environment for the image, acquiring an object mask corresponding to each of at least some of the entire objects within the entire environment, generating a semantic spatial map by back-projecting each of the object masks and the depth-map into 3D coordinates, and generating the semantic data using the semantic spatial map.
In an AI agent that executes instructions for an AI agent capable of preemptively modifying actions using environmental feedback information, One or more memories for storing instructions; and It includes one or more processors configured to execute the above instructions, The processor comprises: (I) a process in which, when natural language instruction data is obtained as the instruction, the natural language instruction data is input into a Large Language Model (LM), and the LLM performs a learning operation on the natural language instruction data to obtain initial task-relevant contexts corresponding to the natural language instruction data, and an initial action planner generates an initial action plan including a first initial action to an nth initial action corresponding to the initial task-relevant contexts—wherein n is an integer greater than or equal to 1; (II) for a jth initial action which is any one of the first initial action to the nth initial action, (i) j is increased from 1 to n, and a Semantic Data module performs a learning operation on an image obtained from the current direction of the AI agent to obtain semantic data, and inputs the jth initial action and the semantic data into an environment feedback module, and the environment feedback module generates feedback information by comparing actual information and predicted information according to the jth initial action; and (III) inputting the feedback information, the current action plan including the j-th initial action to the n-th initial action, and a system prompt capable of presenting a guide for the task into the LLM to obtain a context related to the modification task, and causing the modification action planner to generate a modification action plan including the first modification action to the m-th modification action corresponding to the context related to the modification task, thereby performing a process of executing the instructions according to the modification action plan.
In paragraph 6, In the above (I) process, An AI agent characterized in that the task-related context corresponding to the above natural language instruction data includes task type information, a target object, at least one related target object, predicted location information for the target object, predicted appearance class information for the target object, predicted attribute class information of the target object, and predicted relationship information between the target object and the related target object.
In Paragraph 7, In the above (II) process, An AI agent characterized by the processor using the environment feedback module as actual information and predicted information for the target object, (i) actual location information and predicted location information of the target object, (ii) actual appearance class information and predicted appearance class information of the target object, (iii) actual attribute class information and predicted attribute class information of the target object, and (iv) actual relationship information and predicted relationship information of the target object.
In paragraph 8, In the above (II) process, A subprocess that causes the environment feedback module to (i) generate first feedback information by using location information for the target object probabilistically obtained from the LLM module as predicted location information and comparing the actual location obtained from the semantic data with the predicted location; (ii) generate second feedback information by using class information of the target object as predicted appearance class information and comparing the predicted appearance class information with the i_1 actual appearance class information obtained from the appearance of the target object in the first to D directions - where D is an integer greater than or equal to 1 - with respect to the target object; (iii) generate third feedback information by setting attribute class information opposite to the target attribute class information included in the natural language instruction data as predicted attribute class information and comparing the actual attribute class information with the predicted attribute class information; and (iv) generate fourth feedback information by setting a state opposite to the target relationship information included in the natural language instruction data as predicted relationship information and comparing the actual relationship information with the predicted relationship information. An AI agent characterized by performing at least some of the sub-processes that cause to be generated.
In paragraph 6, An AI agent characterized by the above semantic data module generating a depth-map corresponding to the entire environment by referencing spatial information about the entire environment for the image, acquiring an object mask corresponding to each of at least some of the entire objects within the entire environment, generating a semantic spatial map by back-projecting each of the object masks and the depth-map into 3D coordinates, and generating the semantic data using the semantic spatial map.

Description

Method for executing natural language instructions by an AI agent capable of preemptively revising actions using environmental feedback information and an AI agent using the same The present invention relates to a method for executing instructions of an AI agent capable of preemptively modifying actions using environmental feedback information, and an AI agent using the same. Everyone dreams of having an AI assistant that can perform tedious tasks, such as housework, based on verbal instructions. However, for AI to perform these tasks on our behalf, it must be capable of navigation, object interaction, and conversational reasoning in visually rich 3D environments. Going a step further, it would be even more ideal if the AI could navigate the environment based on egocentric vision, interact with objects according to natural language instructions, and perform long-term tasks. Meanwhile, in the past, there was a technology in which AI explored the environment and attempted to interact with objects based on natural language instructions, but there was a problem in that it did not consider environmental factors, resulting in unnecessary actions such as performing tasks on the wrong target or processing tasks that had already been processed again. Therefore, improvement measures are required to resolve these problems. The drawings attached below for use in describing embodiments of the present invention are merely some of the embodiments of the present invention, and other drawings can be obtained based on these drawings without inventive work by a person skilled in the art to which the present invention pertains (hereinafter "person skilled in the art"). FIG. 1 is a diagram showing the configuration of an AI agent including an initial action planner, a modification action planner, and an environment feedback module according to an embodiment of the present invention, and FIG. 2 is a flowchart illustrating a method for executing instructions of an AI agent including an initial action planner, a modified action planner, and an environment feedback module according to an embodiment of the present invention, and FIG. 3 is a diagram illustrating the process of an AI agent according to an embodiment of the present invention extracting task-relevant contexts from natural language instruction data and generating an initial action plan. FIG. 4 is a diagram showing the detailed configuration of an environment feedback module included in an AI agent according to an embodiment of the present invention, and FIG. 5 is a diagram illustrating the process of generating a modification action plan from a modification action planner included in an AI agent according to an embodiment of the present invention. The following detailed description of the invention refers to the accompanying drawings, which illustrate specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that various embodiments of the invention are different but need not be mutually exclusive. For example, specific shapes, structures, and characteristics described herein with respect to one embodiment may be implemented in other embodiments without departing from the spirit and scope of the invention. It should also be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. Accordingly, the following detailed description is not intended to be limiting, and the scope of the invention is limited only by the appended claims, including all equivalents to those claimed therein, provided appropriately described. Similar reference numerals in the drawings refer to the same or similar functions across various aspects. Hereinafter, in order to enable a person skilled in the art to easily practice the present invention, preferred embodiments of the present invention will be described in detail with reference to the attached drawings. FIG. 1 is a diagram showing the configuration of an AI agent (100) including an initial action planner (500), an environment feedback module (600), and a modification action planner (700) according to one embodiment of the present invention. Referring to FIG. 1, the AI agent (100) may include an initial action planner (500), an environment feedback module (600), and a modification action planner (700). At this time, the input/output and operation processes of the initial action planner (500), the environment feedback module (600), and the modification action planner (700) may be performed by a communication unit (110) and a processor (120), respectively. However, in FIG. 1, the specific connection relationship between the communication unit (110) and the processor (120) has been omitted. Additionally, the memory (115) may be in a state where various instructions to be described later a