CN-122022908-A - E-commerce marketing management method and device based on reinforcement learning and electronic equipment

CN122022908ACN 122022908 ACN122022908 ACN 122022908ACN-122022908-A

Abstract

The invention provides a reinforcement learning-based e-commerce marketing management method, a reinforcement learning-based e-commerce marketing management device and electronic equipment, which relate to the technical field of marketing management and comprise the steps of carrying out strict time-lag processing on historical transaction data and historical interaction data to obtain service interpretable features, carrying out two-stage inverse reinforcement learning processing on the basis of the service interpretable features to obtain reference marketing strategy distribution and corresponding decision reason codes, carrying out iterative updating on a baseline strategy model by adopting a group relative strategy optimization method under preset service constraint conditions to obtain an optimized marketing strategy model, carrying out quantitative estimation on expected benefits and risks on the optimized marketing strategy model by utilizing an offline inverse facts assessment method, and generating and outputting session-level promotion decisions and corresponding decision reason codes on line by using the optimized marketing strategy model on the basis of quantitative estimation results and real-time service constraint states. The invention realizes the generation and online execution of the interpretable, auditable and offline verifiable e-commerce marketing strategy.

Inventors

Song Bilian
LI SHENGGANG
QI YUNFENG

Assignees

上海画龙信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251216

Claims (10)

1. The E-commerce marketing management method based on reinforcement learning is characterized by comprising the following steps of: Acquiring historical transaction data and historical interaction data; Carrying out strict time-lag processing on the historical transaction data and the historical interaction data to obtain business interpretable characteristics; performing two-stage reverse reinforcement learning processing based on the service interpretable features to obtain reference marketing strategy distribution and corresponding decision reason codes; taking the reference marketing strategy distribution as a supervision signal, and obtaining a baseline strategy model through list distillation training; Under the preset business constraint condition, iteratively updating the baseline strategy model by adopting a group relative strategy optimization method to obtain an optimized marketing strategy model, wherein the group relative strategy optimization adopts a group relative advantage calculation mode taking the reference marketing strategy distribution as the center; Carrying out quantitative estimation on expected benefits and risks on the optimized marketing strategy model by using an offline counterfactual evaluation method; And generating and outputting session-level promotion decisions and corresponding decision reason codes on line by using the optimized marketing strategy model based on the quantitative estimation result and the real-time business constraint state.
2. The e-commerce marketing management method based on reinforcement learning of claim 1, wherein the performing a strict time-lapse process on the historical transaction data and the historical interaction data to obtain a business interpretable feature comprises: Identifying future leaked signals in the historical transaction data and the historical interaction data, and aligning the occurrence time of the future leaked signals to the starting time stamp of the business session to which the future leaked signals belong; Based on the aligned data, constructing a service interpretable feature corresponding to the service session using information having a time stamp no later than the service session start time stamp.
3. The e-commerce marketing management method based on reinforcement learning of claim 1, wherein the performing a two-stage inverse reinforcement learning process based on the business interpretable feature results in a reference marketing strategy distribution and a corresponding decision reason code, comprising: training a first model to predict a probability of performing a promotional action in a current session state based on the business interpretable feature at a session level; training a second model to predict a preference probability distribution over a set of possible promotional actions as the reference marketing strategy distribution based on features of a session and promotional action combination level for sessions determined to be suitable for performing the promotional action; The training targets of the first model and the second model are based on expert decision behaviors obtained by inversion from the historical transaction data and the historical interaction data.
4. The reinforcement learning-based e-commerce marketing management method of claim 3, wherein the training a first model to predict a probability of performing a promotional action in a current session state based on the session-level business-interpretable feature comprises: Wherein, the The method is the probability of executing the promotion action under the condition of the business interpretable feature x, wherein sigma (·) is a Sigmoid function, and x epsilon R d is a business interpretable feature vector with dimension d; Y epsilon {0,1} is a real label, 1 indicates promotion, and 0 indicates no promotion; is cross entropy loss, lambda >0 is L2 regularization coefficient; is the L2 norm square of the weight vector for preventing overfitting.
5. The reinforcement learning based e-commerce marketing management method of claim 4, wherein for a session determined to be suitable for performing a promotional action, based on characteristics of a session and promotional action combination level, training a second model to predict a preference probability distribution over a set of viable promotional actions as the reference marketing strategy distribution comprises: for each action k in the set of feasible promotional actions, an extended feature vector is constructed: Wherein, the The extended feature vector corresponding to the action k; is an indication vector identifying the current candidate action as the kth;,; The reference distribution is modeled by listwise softmax to yield a probability distribution of preference predicted over the set of possible promotional actions: Wherein, the Selecting a preference probability distribution for action k given x, z k being a score for action k; for scoring the j-th feasible preferential action, tau >0 is a temperature parameter, and the smoothness of distribution is controlled; Is a model target; Selecting offers for a given x time Is a preferred probability distribution of (1).
6. The e-commerce marketing management method based on reinforcement learning of claim 1, wherein the obtaining a baseline strategy model by list distillation training with the reference marketing strategy distribution as a supervisory signal comprises: Training a neural scorer of parameter θ, which outputs a predictive score for each action k in the set of possible promotional actions: Wherein, the Is a predictive score for action k; (. Cndot.) is a neural network with parameter θ; The training loss function of the baseline strategy model is a list-type cross entropy loss, expressed as: Wherein, the For the list-type cross entropy loss, Is distributed for teachers; Predicting a distribution for the student; is a temperature parameter.
7. The reinforcement learning-based e-commerce marketing management method of claim 6, wherein the obtaining a baseline policy model by list distillation training with the reference marketing policy distribution as a supervisory signal further comprises: Constructing a fusion teacher distribution, taking the fusion teacher distribution as a supervision signal, and training the baseline strategy model through list distillation: Wherein, the As a model of the baseline policy, the model of the policy, Is a teacher strategy based on business rules, and alpha E [0,1] is a fusion weight.
8. E-commerce marketing management device based on reinforcement learning is characterized by comprising: The acquisition module is used for acquiring historical transaction data and historical interaction data; The processing module is used for carrying out strict time-lag processing on the historical transaction data and the historical interaction data to obtain business interpretable characteristics; the learning module is used for executing two-stage inverse reinforcement learning processing based on the business interpretable features to obtain reference marketing strategy distribution and corresponding decision reason codes; the distillation module is used for taking the reference marketing strategy distribution as a supervision signal and obtaining a baseline strategy model through list distillation training; the updating module is used for carrying out iterative updating on the baseline strategy model by adopting a group relative strategy optimization method under the preset service constraint condition so as to obtain an optimized marketing strategy model, wherein the group relative strategy optimization adopts a group relative advantage calculation mode taking the reference marketing strategy distribution as the center; the estimation module is used for quantitatively estimating expected benefits and risks of the optimized marketing strategy model by using an offline counterfactual estimation method; And the output module is used for generating and outputting session-level promotion decisions and corresponding decision reason codes on line by using the optimized marketing strategy model based on the quantitative estimation result and the real-time business constraint state.
9. An electronic device, wherein the electronic device comprises: And a memory storing computer executable instructions that, when executed, cause the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.

Description

E-commerce marketing management method and device based on reinforcement learning and electronic equipment Technical Field The invention relates to the technical field of marketing management, in particular to an e-commerce marketing management method and device based on reinforcement learning and electronic equipment. Background Existing promotional systems mostly employ rule engines, traditional response models, or recommendation systems. The rule engine is difficult to adapt to seasonal/inventory fluctuations, the recommendation system is biased to match rather than whether to promote gating and explicit constraints of promotion cost, and the supervised response model lacks policy stability and guardrail controllability after online. While online reinforcement learning is adaptable, it explores the existence of excessive subsidy and fatigue accumulation risks on real users, and it is difficult to provide financial/compliance oriented interpretation and uncertainty quantification, with significant approval resistance. Offline assessment often relies on a/B or simple playback, and the off-line anti-facts assessment (OPE) means such as inverse trend weighting (IPS/SNIPS) is not used by the system to control bias and variance. Multi-channel reach (e.g., short message + mail + APP popup) often causes multi-objective conflicts with inventory-profit-subsidy across channel fatigue overlays, lacking a unified constrained optimization framework. Therefore, an e-commerce marketing management method and device based on reinforcement learning and electronic equipment are provided. Disclosure of Invention The specification provides a reinforcement learning-based e-commerce marketing management method, a reinforcement learning-based e-commerce marketing management device and electronic equipment, and realizes interpretable, auditable and offline verifiable e-commerce marketing strategy generation and online execution. The specification provides an e-commerce marketing management method based on reinforcement learning, which comprises the following steps: Acquiring historical transaction data and historical interaction data; Carrying out strict time-lag processing on the historical transaction data and the historical interaction data to obtain business interpretable characteristics; performing two-stage reverse reinforcement learning processing based on the service interpretable features to obtain reference marketing strategy distribution and corresponding decision reason codes; taking the reference marketing strategy distribution as a supervision signal, and obtaining a baseline strategy model through list distillation training; Under the preset business constraint condition, iteratively updating the baseline strategy model by adopting a group relative strategy optimization method to obtain an optimized marketing strategy model, wherein the group relative strategy optimization adopts a group relative advantage calculation mode taking the reference marketing strategy distribution as the center; Carrying out quantitative estimation on expected benefits and risks on the optimized marketing strategy model by using an offline counterfactual evaluation method; And generating and outputting session-level promotion decisions and corresponding decision reason codes on line by using the optimized marketing strategy model based on the quantitative estimation result and the real-time business constraint state. Optionally, the performing strict time-lag processing on the historical transaction data and the historical interaction data to obtain service interpretable features includes: Identifying future leaked signals in the historical transaction data and the historical interaction data, and aligning the occurrence time of the future leaked signals to the starting time stamp of the business session to which the future leaked signals belong; Based on the aligned data, constructing a service interpretable feature corresponding to the service session using information having a time stamp no later than the service session start time stamp. Optionally, the performing two-stage inverse reinforcement learning based on the service interpretable feature to obtain a reference marketing strategy distribution and a corresponding decision reason code includes: training a first model to predict a probability of performing a promotional action in a current session state based on the business interpretable feature at a session level; training a second model to predict a preference probability distribution over a set of possible promotional actions as the reference marketing strategy distribution based on features of a session and promotional action combination level for sessions determined to be suitable for performing the promotional action; The training targets of the first model and the second model are based on expert decision behaviors obtained by inversion from the historical transaction data and the historical interaction data. Optionally,