KR-20260063651-A - A DEVICE AND METHOD FOR PROCESSING TASKS THROUGH MODEL-BASED OFFLINE LEARNING

KR20260063651AKR 20260063651 AKR20260063651 AKR 20260063651AKR-20260063651-A

Abstract

The present invention relates to a task processing device and method through model-based offline learning, wherein the device comprises: a dataset input unit that receives an offline dataset for offline reinforcement learning; an initialization unit that initializes a world model and a model generation dataset capable of predicting state transitions and rewards without interacting with a real environment; a model rollout unit that generates an imaginary path by extending from the offline data of the offline dataset based on the world model and generates imaginary data of the model generation dataset; and a learning update unit that performs critique updates and actor updates based on the offline data and the imaginary data.

Inventors

이영운
박관영

Assignees

연세대학교 산학협력단

Dates

Publication Date: 20260507
Application Date: 20241030

Claims (10)

A dataset input section for receiving an offline dataset for offline reinforcement learning; An initialization unit that initializes a world model and a model generation dataset capable of predicting state transitions and rewards without interacting with the actual environment; A model rollout unit that generates an imaginary path by extending from the offline data of the offline dataset based on the world model and generates the imaginary data of the model-generated dataset; and A task processing device through model-based offline learning, comprising a learning update unit that performs critique updates and actor updates based on the above offline data and the above imagination data.
In paragraph 1, the dataset input unit A task processing device through model-based offline learning characterized by receiving offline data composed of state-action-reward-next state tuples as input using the above offline dataset.
In paragraph 1, the initialization part A task processing device through model-based offline learning characterized by learning the world model through the above offline dataset and pre-learning a policy for a specific action constituting the world model and the value of the specific action.
In paragraph 2, the above initialization part A task processing device through model-based offline learning characterized by emptying the model-generated dataset to add the imaginary data to the model-generated dataset.
In claim 1, the model rollout part A task processing device through model-based offline learning characterized by selecting arbitrary offline data from the above offline dataset and performing the extension through assumptions of state and behavior for the arbitrary offline data.
In paragraph 5, the above model rollout part A task processing device through model-based offline learning characterized by generating the aforementioned imaginary path by predicting a reward and a next state based on the aforementioned state and behavior through the aforementioned assumption.
In Clause 6, the above model rollout part A task processing device through model-based offline learning characterized by determining the imagination data on the above imagination path and adding it to the above model generation dataset.
In paragraph 1, the learning update unit A task processing device through model-based offline learning characterized by learning an expected reward function based on the state and behavior of the offline data and the state data through the above critique update.
In paragraph 1, the learning update unit A task processing device through model-based offline learning characterized by learning a policy to select an optimal action based solely on the aforementioned imaginary data through the aforementioned actor update.
In a task processing method through model-based offline learning performed in a task processing device through model-based offline learning, A dataset input step for receiving an offline dataset for offline reinforcement learning; An initialization step for initializing a world model and a model generation dataset capable of predicting state transitions and rewards without interacting with the actual environment; A model rollout step that generates an imaginary path by extending from the offline data of the offline dataset based on the above world model and generates the imaginary data of the above model generation dataset; and A task processing method through model-based offline learning, comprising a learning update step that performs critique updates and actor updates based on the above offline data and the above imagination data.

Description

A Device and Method for Processing Tasks Through Model-Based Offline Learning The present invention relates to a task processing technology, and more specifically, to a task processing apparatus and method through model-based offline learning that generates an imaginary path from offline data and performs critique updates and actor updates based on offline data and imaginary data. Offline reinforcement learning aims to solve reinforcement learning problems using only pre-collected datasets, and it demonstrates better performance than behavioral cloning policies. While this allows for the simple application of off-policy reinforcement learning algorithms over a fixed dataset, off-policy reinforcement learning methods suffer from the problem of overestimating Q-values for behaviors not observed in the offline dataset. This is because the overestimated value function is not corrected through interactions with the online environment in offline reinforcement learning. To this end, model-free offline reinforcement learning algorithms aim to address the problem of Q-value overestimation by regulating the policy to output only actions present in the offline data or by adopting conservative value estimation when executing actions different from those in the dataset. However, while these algorithms demonstrate strong performance on standard offline reinforcement learning benchmarks, model-free offline reinforcement learning policies are prone to being limited by the supported range of data (i.e., state-action pairs in the offline dataset) and may have limited generalization capabilities. Accordingly, model-based offline reinforcement learning approaches have attempted to overcome the aforementioned limitations by proposing ways to better utilize limited offline data to train a world model and generate virtual data containing behaviors outside the supported range of the data through the trained model; similar to Dyna-style online model-based reinforcement learning, this model can be trained on both offline data and model rollout. However, there is a problem in that such models can be inaccurate regarding states and behaviors outside the supported range of the data, which can lead to policies being easily exploited. FIG. 1 is a diagram illustrating a Lower Expectile Q-learning algorithm according to one embodiment of the present invention. FIG. 2 is a diagram illustrating the functional configuration of a task processing device through model-based offline learning according to an embodiment of the present invention. FIG. 3 is a flowchart illustrating a task processing method through model-based offline learning according to an embodiment of the present invention. FIG. 4 is a diagram illustrating an LEQ learning algorithm using λ-return according to an embodiment of the present invention. FIG. 5 is a diagram illustrating a λ-return and critique update according to an embodiment of the present invention. FIGS. 6(a) and FIGS. 6(b) are drawings illustrating a MuJoCo motion control operation and an AntMaze operation according to an embodiment of the present invention. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them. Meanwhile, the meaning of the terms described in this application should be understood as follows. Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "between" and "exactly between," or "adjacent to" and "directly adjacent to," should be interpreted in the same way. A singular expression should be understood to include a plural expression unless the context clearly indicates otherwise, and terms such as "include" or "have" are intended to specify the existence of the implemented