CN-121999896-A - Hydroxyalkyl acrylate yield optimization method and system based on deep reinforcement learning

CN121999896ACN 121999896 ACN121999896 ACN 121999896ACN-121999896-A

Abstract

The invention relates to the technical field of yield optimization, in particular to a hydroxyalkyl acrylate yield method and system based on deep reinforcement learning. The method comprises the steps of obtaining time domain and frequency domain features of real-time and historical working conditions, constructing current state features through weighting and fusion of updatable feature weight vectors, inputting the state features into a strategy network, outputting operation vector probability distribution and sampling to obtain adjustment quantity, calculating a merit function by combining a reward value after operation, calculating contribution degrees of the features according to the merit function to update feature weights and correct strategy loss functions, adjusting learning rate according to absolute values of the merit function, and updating the strategy network until convergence. The scheme of the invention realizes high-efficiency and stable yield optimization through the cooperative improvement of state representation, credit allocation and training process.

Inventors

ZHANG HUINA
YANG SHUANGBING
ZHANG SHENGHUI

Assignees

菏泽昌盛源科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20260304

Claims (10)

1. The hydroxyalkyl acrylate yield method based on deep reinforcement learning is characterized by comprising the following steps of: the method comprises the steps of obtaining time domain and frequency domain characteristics of real-time working condition parameters and historical working condition parameters in the production process of hydroxyalkyl acrylate, carrying out weighted fusion through updatable characteristic weight vectors to construct a multi-dimensional current state characteristic, inputting the current state characteristic into a strategy network, outputting multivariable probability distribution defining operation vectors through the strategy network, and sampling from the distribution to obtain the operation vectors containing adjustment amounts of all process parameters; The method comprises the steps of executing an operation vector, obtaining a state characteristic at the next moment and a yield at the current moment, calculating a rewarding value at the current moment, respectively evaluating the state characteristic at the current moment and the state characteristic at the next moment by using a value network, and calculating a dominance function by combining the rewarding value; Constructing a strategy loss function, correcting strategy items in the strategy loss function by utilizing the contribution vector, and enhancing the influence of key state characteristics on strategy network updating; And updating parameters of the strategy network based on the corrected strategy loss function and the adjusted learning rate, and repeatedly executing the steps at a plurality of time steps in the production process until the strategy network converges.
2. The method for obtaining hydroxyalkyl acrylate yield based on deep reinforcement learning according to claim 1, wherein said constructing a multi-dimensional current state feature comprises: For each working condition parameter, extracting historical data of past 60 time steps, calculating the mean value and standard deviation of the historical data as time domain characteristics, and calculating the power spectrum density on three specific frequency points as frequency domain characteristics through fast Fourier transformation; and carrying out element-by-element multiplication on the original feature vector and the updatable feature weight vector to obtain the multi-dimensional current state feature.
3. The method for obtaining a hydroxyalkyl acrylate yield based on deep reinforcement learning according to claim 1, wherein the step of calculating a prize value for the current time comprises: For the currently executed operation vector Performing normalization processing, dividing the adjustment quantity of each dimension of the vector by the maximum allowable adjustment quantity preset by the dimension to obtain a dimensionless normalization operation vector ; Using the formula Calculating the prize value at the current time Wherein For the hydroxyalkyl acrylate yield at the present time, For a preset target value of the yield, In order to normalize the operation vector, For the vectors The square of the norm, And Is a weighting factor for the bonus function.
4. The method of claim 1, wherein the step of calculating the dominance function comprises: respectively acquiring current states by using value networks And the next time state Value estimation of (2) And ; According to the formula Calculating a time sequence differential error Wherein For the current prize value, Is a discount factor; a generalized dominance estimation method is adopted, and according to the formula Calculating a dominance function, wherein As a smoothing factor, the smoothing factor is used, To calculate the step size.
5. The method of claim 1, wherein updating the feature weight vector for constructing the next time-step point-like feature in a moving average manner comprises: According to the formula Updating feature weight vectors, wherein For the updated feature weight vector, For the current feature weight vector to be used, A contribution vector of each dimension feature in the current state features to the dominance function, To update the coefficients.
6. The method of claim 1, wherein the step of correcting the policy term in the policy loss function using the contribution vector comprises: calculating scalar importance weights by using inner products of the contribution vectors and the current state feature vectors; Multiplying the scalar importance weight with the calculated dominance function to obtain a corrected dominance function; The modified dominance function is used in place of the original dominance function in computing the policy terms of the policy loss function.
7. The method for obtaining the hydroxyalkyl acrylate yield based on the deep reinforcement learning according to claim 1, wherein the step of adjusting the learning rate for updating the strategy network is specifically: According to the formula Calculating a current time step for updating a learning rate of a policy network Wherein As a basis for the rate of learning, To calculate the absolute value of the merit function, To adjust the coefficients.
8. The method of claim 1, wherein the step of sampling from the distribution an operating vector comprising the adjustment of each process parameter comprises: the output layer of the strategy network comprises a mean value vector and a standard deviation vector defining a diagonal covariance matrix, and the mean value vector and the standard deviation vector together define a multi-element Gaussian distribution; Sampling from the multivariate Gaussian distribution to generate an original operation vector, wherein each dimension of the original operation vector corresponds to one adjustment amount of the reactor temperature, the reactor pressure, the catalyst flow rate and the monomer feeding rate; And constraining each adjustment quantity in the original operation vector in a preset safe operation interval to form an operation vector to be executed.
9. The deep reinforcement learning-based hydroxyalkyl acrylate yield method of claim 1 wherein the policy loss function comprises a policy term composed based on the dominance function and a probability distribution of policy network output, and an entropy term for encouraging exploration.
10. A hydroxyalkyl acrylate yield system based on deep reinforcement learning, comprising: A processor; A memory storing computer instructions for deep reinforcement learning based hydroxyalkyl acrylate yield, which when executed by the processor, cause the system to perform the deep reinforcement learning based hydroxyalkyl acrylate yield method according to any of claims 1-9.

Description

Hydroxyalkyl acrylate yield optimization method and system based on deep reinforcement learning Technical Field The invention relates to the technical field of yield optimization. More particularly, the present invention relates to a hydroxyalkyl acrylate yield method and system based on deep reinforcement learning. Background In the industrial production of hydroxyalkyl acrylate, the product yield is a key for measuring the production efficiency and the economic benefit. The production process involves a plurality of technological parameters such as temperature, pressure, catalyst dosage, raw material proportion, reaction time and the like, and the parameters are mutually coupled to jointly influence the final yield. In the prior art, the process optimization method adopted for improving the yield mainly comprises mechanism modeling and experience regulation. The mechanism modeling method guides production by establishing complex mathematical models of chemical reaction dynamics and transfer phenomena, but accurate modeling is difficult to realize due to high complexity of an actual chemical process, and the established models are often poor in universality. The experience regulation method is seriously dependent on the expertise and practical experience of operators to carry out parameter adjustment, and the method has subjectivity and limitation, and is difficult to find a globally optimal operation strategy, so that further exploitation of production potential is limited. In order to solve the defect of searching the global optimal solution in the traditional method, the deep reinforcement learning technology is applied to chemical process optimization in the actual production process. The deep reinforcement learning can autonomously learn, explore and form an optimal control strategy from data through continuous interaction of an agent and a production environment, and realize dynamic adjustment of continuous process parameters. The method gets rid of the dependence on an accurate mechanism model and artificial experience, provides a new way for intelligent control of a complex industrial process, and can approach a global optimal operating point more effectively theoretically. However, existing standard deep reinforcement learning methods still have significant drawbacks when applied directly to hydroxyalkyl acrylate production optimization. Firstly, in state representation, the method generally takes original working condition parameters as input simply, the processing mode cannot fully mine and utilize the time sequence characteristics hidden in the data, and the importance of different process parameters cannot be distinguished, so that the comprehensive and accurate perception of the production state by the intelligent body is influenced. Secondly, on the policy updating mechanism, the traditional algorithm adopts equal treatment on all dimensions of the state characteristics, so that the characteristic dimension which is most critical to the influence of the current decision cannot be identified and focused, and the accuracy of the decision is reduced. Finally, in the training process, the learning rate is usually set to be a fixed value, and is difficult to adapt to the complex dynamic optimization environment, so that the strategy is easy to oscillate when the learning rate is too large, and the convergence speed is too slow when the learning rate is too small, and finally the overall stability and the convergence efficiency of the algorithm are affected. Disclosure of Invention The invention aims to provide a hydroxyalkyl acrylate yield method and a hydroxyalkyl acrylate yield system based on deep reinforcement learning, which are used for solving the problem that complex optimization environments are difficult to adapt to in the prior art, and the invention provides schemes in two aspects as follows. In a first aspect, the present invention provides a method for obtaining a hydroxyalkyl acrylate yield based on deep reinforcement learning, comprising the steps of: The method comprises the steps of obtaining time domain and frequency domain characteristics of real-time working condition parameters and historical working condition parameters in the production process of hydroxyalkyl acrylic ester, carrying out weighted fusion on the time domain and the frequency domain characteristics of the real-time working condition parameters and the historical working condition parameters through updatable characteristic weight vectors to construct a multi-dimensional current state characteristic, inputting the current state characteristic into a strategy network, outputting a multivariate probability distribution defining an operation vector through the strategy network, sampling the operation vector to obtain the operation vector containing adjustment quantity of each technological parameter from the distribution, executing the operation vector, obtaining the state characteristic at the next moment and th