US-20260127443-A1 - METHOD, APPARATUS, AND SYSTEM FOR REINFORCEMENT LEARNING USING OFFLINE DATA

US20260127443A1US 20260127443 A1US20260127443 A1US 20260127443A1US-20260127443-A1

Abstract

A system for reinforcement learning includes at least one processor, and at least one memory storing at least one instruction that, when executed by the at least one processor, is configured to: perform offline reinforcement learning; and perform online reinforcement learning. The performing of the offline reinforcement learning includes identifying a data-retained region and a data-unretained region, and reducing a Q-value estimated in the data-unretained region.

Inventors

Jeong Hye KIM
Yong Jae SHIN
Kang Hoon Lee
Whi Young JUNG
Sung Hoon Hong
Deun Sol YOON
Woohyung Lim

Assignees

LG MANAGEMENT DEVELOPMENT INSTITUTE CO., LTD.

Dates

Publication Date: 20260507
Application Date: 20251228
Priority Date: 20240913

Claims (20)

1 . A system for reinforcement learning, comprising: at least one processor; and at least one memory storing at least one instruction that, when executed by the at least one processor, is configured to: perform offline reinforcement learning; and perform online reinforcement learning, wherein the performing of the offline reinforcement learning includes: identifying a data-retained region and a data-unretained region; and reducing a Q-value estimated for the data-unretained region.
2 . The system of claim 1 , wherein the reducing of the Q-value estimated for the data-unretained region includes utilizing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1 to reduce the Q-value.
3 . The system of claim 1 , wherein the reducing of the Q-value estimated for the data-unretained region includes: performing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1; performing layer normalization using a reward obtained by the reward scaling as an input; and learning a critic ensemble including a plurality of critic networks in which the layer normalization is performed.
4 . The system of claim 2 , wherein the performing of the online reinforcement learning includes performing online fine-tuning by utilizing reward scaling that multiplies a reward function of a replay buffer used in the online reinforcement learning by the constant c reward .
5 . The system of claim 2 , wherein the constant c reward is set to a value of 10 or greater.
6 . The system of claim 1 , wherein the reducing of the Q-value estimated for the data-unretained region includes penalizing that sets the Q-value for the data-unretained region to be equal to or less than a predetermined value.
7 . The system of claim 6 , wherein the penalizing includes: calculating a penalty loss; calculating a temporal-difference (TD) loss; determining a first loss based on the penalty loss and the TD loss; and performing at least one of the offline reinforcement learning and the online reinforcement learning based on the first loss.
8 . The system of claim 7 , wherein the first loss is determined by adding, to the TD loss, a value obtained by multiplying the penalty loss by a weight value.
9 . A method for reinforcement learning performed by at least one processor, the method comprising: performing offline reinforcement learning; and performing online reinforcement learning, wherein the performing of the offline reinforcement learning includes: identifying a data-retained region and a data-unretained region; and reducing a Q-value estimated for the data-unretained region.
10 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region includes utilizing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1 to reduce the Q-value.
11 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region includes: performing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1; performing layer normalization using a reward obtained by the reward scaling as an input; and learning a critic ensemble including a plurality of critic networks in which layer normalization is performed.
12 . The system of claim 10 , wherein the performing of the online reinforcement learning includes the performing online fine-tuning by utilizing reward scaling that multiplies a reward function of a replay buffer used in the online reinforcement learning by the constant c reward .
13 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region include penalizing that sets the Q-value for the data-unretained region to be equal to or less than a predetermined value.
14 . The method of claim 13 , wherein the penalizing includes: calculating a penalty loss; calculating a temporal-difference (TD) loss; determining a first loss based on the penalty loss and the TD loss; and performing at least one of the offline reinforcement learning and the online reinforcement learning based on the first loss, and wherein the first loss is determined by adding, to the TD loss, a value obtained by multiplying the penalty loss by a weight value.
15 . A computer-readable recording medium including at least one program for executing a method, the method comprising: performing offline reinforcement learning; and performing online reinforcement learning, wherein the performing of the offline reinforcement learning includes: identifying a data-retained region and a data-unretained region; and reducing a Q-value estimated for the data-unretained region.
16 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region includes utilizing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1 to reduce the Q-value.
17 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region includes: performing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant c reward greater than 1; performing layer normalization using a reward obtained by the reward scaling as an input; and learning a critic ensemble including a plurality of critic networks in which layer normalization is performed.
18 . The system of claim 10 , wherein the performing of the online reinforcement learning includes the performing online fine-tuning by utilizing reward scaling that multiplies a reward function of a replay buffer used in the online reinforcement learning by the constant c reward .
19 . The method of claim 9 , wherein the reducing of the Q-value estimated for the data-unretained region include penalizing that sets the Q-value for the data-unretained region to be equal to or less than a predetermined value.
20 . The method of claim 13 , wherein the penalizing includes: calculating a penalty loss; calculating a temporal-difference (TD) loss; determining a first loss based on the penalty loss and the TD loss; and performing at least one of the offline reinforcement learning and the online reinforcement learning based on the first loss, and wherein the first loss is determined by adding, to the TD loss, a value obtained by multiplying the penalty loss by a weight value.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a Bypass Continuation of International Patent Application No. PCT/KR2025/013602, filed on Sep. 3, 2025, which claims priority from and the benefit of Korean Patent Application No. 10-2024-0126112, filed on Sep. 13, 2024, and Korean Patent Application No. 10-2025-0109486, filed on Aug. 8, 2025, each of which is hereby incorporated by reference for all purposes as if fully set forth herein. BACKGROUND Field Embodiments of the invention relate generally to a method, apparatus, and system for reinforcement learning using offline data, and more particularly, one embodiment of the present disclosure provides a method for appropriately adjusting a Q-value without overestimating the Q-value for an out-of-distribution space in which data is not yet available when performing reinforcement learning using offline data. Discussion of the Background Reinforcement learning is a method in which an agent learns how to make decisions by interacting with an environment, and is an artificial intelligence learning method mainly used in robot control, autonomous driving, and the like. The reinforcement learning may include online reinforcement learning and offline reinforcement learning. Online reinforcement learning is a learning method in which an agent directly interacts with the environment to collect data. In contrast, offline reinforcement learning is reinforcement learning in which the agent does not directly interact with the environment, and is a method in which a behavior algorithm separately exists to learn a policy based on fixed data collected in advance without interaction with the environment. Offline reinforcement learning has an advantage in that learning may be performed without risks to an actual environment in robots, autonomous driving, and the like, but there is a problem in that inference capability is degraded for situations other than the fixed data collected in advance. Accordingly, a method of training by utilizing both online reinforcement learning and offline reinforcement learning is currently being researched. The above information disclosed in this Background section is only for understanding of the background of the inventive concepts, and, therefore, it may contain information that does not constitute prior art. SUMMARY One embodiment of the present disclosure is directed to providing a method for preventing overestimation of a Q-value in offline reinforcement learning and online reinforcement learning. Additional features of the inventive concepts will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the inventive concepts. According to one embodiment of the present disclosure, a system for reinforcement learning may include at least one processor, and at least one memory storing at least one instruction that, when executed by the at least one processor, is configured to perform offline reinforcement learning, and perform online reinforcement learning. The performing of the offline reinforcement learning includes identifying a data-retained region and a data-unretained region, and reducing a Q-value estimated for the data-unretained region. The reducing of the Q-value estimated for the data-unretained region may include an operation of utilizing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant creward greater than 1 to reduce the Q-value. The reducing of the Q-value estimated for the data-unretained region may include an operation of performing reward scaling that multiplies a reward function used in the offline reinforcement learning by a constant creward greater than 1, an operation of performing layer normalization using a reward obtained by the reward scaling as an input, and an operation of learning a critic ensemble including a plurality of critic networks in which the layer normalization is performed. The performing of the online reinforcement learning may include an operation of performing online fine-tuning by utilizing reward scaling that multiplies a reward function of a replay buffer used in the online reinforcement learning by the constant creward. The constant creward may be set to a value of 10 or greater. The reducing of the Q-value estimated for the data-unretained region may include an operation of penalizing that sets the Q-value for the data-unretained region to be equal to or less than a predetermined value. The penalizing may include: calculating a penalty loss, calculating a temporal-difference (TD) loss, determining a first loss based on the penalty loss and the TD loss, and the performing at least one of the offline reinforcement learning and the online reinforcement learning based on the first loss. The first loss may be determined by adding, to the TD loss, a value obtained by multiplying the penalty loss by a weight value. According to another embodiment of the present di