CN-121980083-A - Multi-target reinforcement learning recommendation method and system for long-term user participation

CN121980083ACN 121980083 ACN121980083 ACN 121980083ACN-121980083-A

Abstract

The invention discloses a multi-target reinforcement learning recommendation method and system for long-term user engagement, and relates to the technical field of data processing, wherein the method comprises the steps of extracting user interaction records from an offline log, constructing a state representation characteristic vector set, generating a long-term engagement label set, training a long-term result prediction model to generate a long-term rewarding component; and constructing a multi-target Q network and a dynamic weight network, and performing alignment constraint combined training on the dynamic weight multi-target network by introducing long-term rewarding components to obtain a multi-target reinforcement learning recommendation model. The method solves the technical problems of low model learning efficiency and poor adaptability caused by long-term feedback delay and rigidification of a multi-objective weighing mechanism when the long-term user engagement is optimized in the prior art, and achieves the technical effects of realizing multi-objective dynamic weighing and effectively improving the learning efficiency and adaptability of the model to the long-term user engagement by introducing a progressive feedback and dynamic weight mechanism.

Inventors

LIU QIZHI
WANG MENG
XIE JING

Assignees

中国科学院文献情报中心

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. The multi-objective reinforcement learning recommendation method for the participation of the long-term user is characterized by comprising the following steps of: Extracting user interaction records from an offline log, and constructing state representation characteristics according to the user interaction records to obtain a state representation characteristic vector set; Performing long-term participation degree label construction and normalization processing according to the offline log to obtain a long-term participation degree label set; training a long-term result prediction model according to the state representation characteristic vector set and the long-term participation degree label set, and generating a long-term rewarding component based on the long-term result prediction model; constructing a dynamic weight multi-target network, wherein the dynamic weight multi-target network comprises a multi-target Q network and a dynamic weight network; Introducing the long-term rewarding component, and carrying out alignment constraint combined training on the dynamic weight multi-target network to obtain a multi-target reinforcement learning recommendation model.
2. The multi-objective reinforcement learning recommendation method for long-term user engagement according to claim 1, wherein the constructing state representation characteristics according to the user interaction record, obtaining a state representation characteristic vector set, comprises: extracting sequence characteristics according to the user interaction record to obtain an implicit state vector set; extracting static attribute characteristics according to the user interaction records to obtain a static attribute vector set; extracting context characteristics according to the user interaction records, and obtaining a context vector set; And performing alignment splicing according to the implicit state vector set, the static attribute vector set and the context vector set to generate the state representation characteristic vector set.
3. The multi-objective reinforcement learning recommendation method for long-term user engagement as claimed in claim 1, wherein the long-term engagement tag construction and normalization process are performed according to the offline log, and the long-term engagement tag set is obtained, comprising: defining a long-term time window; performing long-term participation calculation on the offline log according to the long-term time window to obtain a participation calculation tag set; and carrying out label normalization processing according to the participation degree calculation label set to generate the long-term participation degree label set.
4. The multi-objective reinforcement learning recommendation method for long-term user engagement as recited in claim 1, wherein training a long-term result prediction model based on said set of state representation feature vectors and said set of long-term engagement labels comprises: Constructing a long-term result prediction network; Predicting a loss function by taking the mean square error loss function as a long-term result; And based on the long-term result prediction loss function, taking the state representation characteristic vector set as a model input, taking the long-term participation degree label set as a supervision signal, performing supervision training on the long-term result prediction network, and generating the long-term result prediction model.
5. The multi-objective reinforcement learning recommendation method for long-term user engagement according to claim 4, wherein the long-term result prediction network employs a multi-layer perceptron, the long-term result prediction network includes an input layer, a plurality of hidden layers and an output layer, the plurality of hidden layers employ a ReLU activation function, and the output layer employs a Sigmoid activation function.
6. The multi-objective reinforcement learning recommendation method for long-term user engagement according to claim 1, wherein the input signal of the multi-objective Q network comprises a current state representation and a candidate action, and the output signal of the multi-objective Q network comprises a Q-value vector comprising a short-term click return assessment value, a short-term stay return assessment value, and a long-term engagement return assessment value.
7. The multi-objective reinforcement learning recommendation method for long-term user engagement according to claim 1, wherein an output layer of the dynamic weight network employs a Softmax activation function.
8. The multi-objective reinforcement learning recommendation method for long-term user engagement according to claim 1, wherein introducing the long-term rewards component, performing alignment constraint joint training on the dynamic weight multi-objective network, and obtaining a multi-objective reinforcement learning recommendation model, comprises: Performing mixed Q value calculation based on the multi-target Q network and the dynamic weight network to obtain a mixed Q value calculation result; constructing a total loss function; And carrying out joint optimization training on the dynamic weight multi-objective network according to the long-term reward component, the mixed Q value calculation result and the total loss function, and generating the multi-objective reinforcement learning recommendation model.
9. The multi-objective reinforcement learning recommendation method for long-term user engagement of claim 8, wherein the total loss function is constructed from a multi-objective time-series differential loss function, a long-term alignment constraint loss function, and a weight regularization term.
10. A multi-objective reinforcement learning recommendation system for long-term user engagement, characterized in that the system is configured to perform the multi-objective reinforcement learning recommendation method for long-term user engagement according to any one of claims 1 to 9, the system comprising: The state representation characteristic construction module is used for extracting user interaction records from the offline log, constructing state representation characteristics according to the user interaction records and acquiring a state representation characteristic vector set; the long-term participation degree label extraction module is used for carrying out long-term participation degree label construction and normalization processing according to the offline log to obtain a long-term participation degree label set; A long-term rewards generation module for training a long-term result prediction model according to the state representation characteristic vector set and the long-term participation degree label set, and generating a long-term rewards component based on the long-term result prediction model; The multi-target network construction module is used for constructing a dynamic weight multi-target network, wherein the dynamic weight multi-target network comprises a multi-target Q network and a dynamic weight network; And the alignment constraint combined training module is used for introducing the long-term rewarding component, and carrying out alignment constraint combined training on the dynamic weight multi-target network to obtain a multi-target reinforcement learning recommendation model.

Description

Multi-target reinforcement learning recommendation method and system for long-term user participation Technical Field The invention relates to the technical field of data processing, in particular to a multi-objective reinforcement learning recommendation method and system for long-term user participation. Background With the rapid development of internet application, the recommendation system is widely applied to platforms such as electronic commerce, social media, short videos and the like. The traditional recommendation method is mostly focused on optimizing short-term targets, such as click-through rate, e.g. CTR, stay time, and the like, and neglecting long-term participation of users, such as long-term retention, revisit probability, and the like. In order to realize long-term benefits of the recommendation system, in recent years, a recommendation method based on reinforcement learning is attracting attention. Reinforcement learning can realize dynamic adjustment of recommendation strategies by modeling the interaction process of users and a recommendation system. However, existing reinforcement learning methods often have difficulty balancing short term effects with long term engagement and face problems with delayed feedback when dealing with long term rewards. For example, long-term targets typically rely on short-term proxy index weighted synthesis, have subjectivity and dataset dependencies, and are difficult to accommodate for the differentiated needs of different user phases. Disclosure of Invention The application provides a multi-target reinforcement learning recommendation method and system for long-term user engagement, which are used for solving the technical problems of low model learning efficiency and poor adaptability caused by long-term feedback delay and rigidification of a multi-target weighing mechanism when the long-term user engagement is optimized in the prior art. The method comprises the steps of extracting user interaction records from an offline log, constructing state representation characteristics according to the user interaction records, obtaining a state representation characteristic vector set, constructing and normalizing long-term participation labels according to the offline log, obtaining a long-term participation label set, training a long-term result prediction model according to the state representation characteristic vector set and the long-term participation label set, generating long-term rewarding components based on the long-term result prediction model, constructing a dynamic weight multi-target network, wherein the dynamic weight multi-target network comprises a multi-target Q network and a dynamic weight network, introducing the long-term rewarding components, and performing alignment constraint joint training on the dynamic weight multi-target network to obtain the multi-target reinforcement learning recommendation model. The system comprises a state representation characteristic construction module, a long-term participation label extraction module, a long-term rewarding generation module, a multi-target network construction module and an alignment constraint joint training module, wherein the state representation characteristic construction module is used for extracting user interaction records from an offline log and carrying out state representation characteristic construction according to the user interaction records to obtain a state representation characteristic vector set, the long-term participation label extraction module is used for carrying out long-term participation label construction and normalization processing according to the offline log to obtain a long-term participation label set, the long-term rewarding generation module is used for training a long-term result prediction model according to the state representation characteristic vector set and the long-term participation label set and generating a long-term rewarding component based on the long-term result prediction model, the multi-target network construction module is used for constructing a dynamic weight multi-target network which comprises a multi-target Q network and a dynamic weight network, and the alignment constraint joint training module is used for introducing the long-term rewarding component, and carrying out alignment constraint joint training on the dynamic weight multi-target network to obtain the multi-target reinforcement learning recommendation model. One or more technical schemes provided by the application have at least the following technical effects or advantages: The multi-target reinforcement learning recommendation method and system for long-term user engagement provided by the embodiment of the application relate to the technical field of data processing, the sparse long-term feedback is converted into an instant progressive reward signal by constructing a long-term result prediction model, meanwhile, a dynamic weight network is introduced, the weights