CN-122018465-A - Robot control strategy optimization method based on offline data dominant function separation
Abstract
The invention belongs to the technical field of robot control strategy optimization, in particular relates to a robot control strategy optimization method based on separation of dominance functions of offline data, and aims to solve the problems of excessive dependence, training instability and overestimation on a Q value in a traditional offline reinforcement learning method, so as to improve the optimization effect of a robot control strategy. The method comprises the steps of firstly decomposing a Q function into a cost function and a dominance function, and optimizing the training process of Critic by applying zero-mean constraint, so that the cost function and the dominance function are stably separated in a Belman iteration. By taking the dominance function as a main strategy optimization signal, the method effectively reduces the problem of unstable training caused by overestimation of the Q value, and improves the robustness and generalization capability of the strategy learning process. Particularly in the offline reinforcement learning task, the method can better perform strategy optimization under the condition of limited data and sparse expert demonstration, and does not need to rely on environment interaction and additional expert data.
Inventors
- WU FENGGUO
- XU ZHU
- Wu shouying
- ZHANG JIANWEI
- LI HUI
Assignees
- 四川大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260203
Claims (9)
- 1. The robot control strategy optimization method based on the separation of the dominance functions of the offline data is characterized by comprising the following steps: S1, acquiring a plurality of groups of track data in the control process of a robot, wherein each track data comprises control state information, execution action information, control quality feedback rewarding information, next control state information and control termination mark information which are arranged in time sequence, forming an original offline data set, and executing sampling operation of a state-action sample based on the data set; s2, Q function decomposition, namely constructing a parameterized comment network based on an original offline data set, and representing an action cost function as the sum of a state cost function and a dominance function; S3, dominant function zero-mean constraint, namely introducing zero-mean dominant regular constraint in the training process of Critic; S4, policy optimization based on a dominance function, wherein in the process of policy network updating, the dominance function is used as a core supervision signal of policy optimization, and a guiding policy tends to select actions with higher relative dominance; s5, behavior constraint joint optimization, namely introducing a behavior cloning constraint item into a strategy optimization target to enable strategy updating to be compatible with advantage maximization and constraint on offline data distribution; And S6, the robot executes corresponding actions based on the optimized strategy.
- 2. The method for optimizing a robot control strategy based on the separation of dominance functions of offline data according to claim 1, wherein in the specific process of step S2: By adopting a value evaluation mode based on function approximation, a traditional single action cost function Q ϕ is explicitly split into two mutually independent but cooperative function modules of a cost function V ψ and a dominance function A δ , and the specific expression is as follows: ; Wherein, the In order to characterize the desired return for the state, As a merit function for characterizing motion relative to an average motion Q value in the offline dataset; The cost function module takes the state s as input and outputs corresponding scalar value estimation V ψ (s); The dominance function module takes as input the state-action pair (s, a) and outputs the corresponding dominance estimate a δ (s, a).
- 3. The robot control strategy optimization method of offline data-based dominance function separation according to claim 2, further comprising the following process in step S2: Redundant modeling is carried out on the cost function module and the dominance function module by adopting a double-network structure, namely two sets of independent parameter cost-dominance function module pairs are respectively constructed; setting corresponding target networks for each group of cost function modules and dominant function modules respectively, and keeping the parameters thereof slowly synchronous with the main network in a soft update or delay update mode; In the parameter initialization stage, the weight parameters of the cost function module and the advantage function module adopt a random initialization mode; performing scale constraint or initializing an output layer of the dominance function module to a range close to zero; setting key super parameters related to decomposition and regularization of the dominant function, wherein the key super parameters comprise a dominant decomposition intensity parameter zeta and a dominant guiding strategy optimization weight parameter eta; the parameter zeta is used for adjusting regularization strength of the dominance function in the Critic training process, and the parameter eta is used for controlling relative weights between optimization items and behavior cloning constraint items based on the dominance function in the strategy updating process.
- 4. The method for optimizing a robot control strategy based on the separation of dominance functions of offline data according to claim 1, wherein in the specific process of step S3: in each training iteration: Firstly, a small batch of sample set is extracted from an offline data set D according to a preset sampling strategy For each sample, calculating a corresponding target Q value Q target by using the constructed target cost function module and the target advantage function module and combining a target strategy network; on the basis, constructing a Critic training target, namely realizing value consistency constraint based on Belman iteration by minimizing the mean square error between the current Q value estimation and the target Q value, and introducing a zero-mean-value advantage regularization term in the training process: In each training batch, calculating an output average value of the dominance function corresponding to the batch of samples, and taking the average value as a reference, and applying zero-average constraint on the dominance function, so that the expectations of the dominance function are close to zero under offline data distribution; finally, the joint training loss of the cost function and the dominance function is expressed as a weighted sum of a Bellman error term and a zero-mean dominance regularization term, wherein the regularization strength is controlled by a parameter zeta set in the second step; in the training process, the cost function module and the dominant function module parameters are subjected to gradient back propagation and updating, and meanwhile, the target network parameters are synchronized in a soft updating or delay updating mode.
- 5. The method for optimizing a robot control strategy based on the separation of dominance functions of offline data according to claim 4, wherein the target Q value is calculated as follows: S31, generating a corresponding action a next on the next state S next through the target policy network; s32, calculating by the objective cost function module and the objective dominance function module respectively And (3) with And adding the two to obtain a target Q estimation of the next state-action pair; S33, combining the instant rewards and the discount factors to construct a target value based on a Belman equation: ; S34, forward computing the current state-action pair (S i ,a i ) by using the current cost function module and the advantage function module to obtain the current estimated V ψ (s i ) and A δ (s i ,a i ), and reconstructing the current Q value estimation according to the current state-action pair.
- 6. The method for optimizing a robot control strategy based on the separation of dominance functions of offline data according to claim 5, wherein in the specific process of step S4: S41, extracting state-action samples from an offline data set according to a preset batch size (S i ,a i ), and calculating a dominant function output A δ (s i ,a i of each sample by using a cost function module and a dominant function module which are obtained through training; S42, comparing the output of the strategy network with the actual action a i in the offline data, and calculating the behavior cloning loss The constraint strategy output is consistent with the data distribution, and the specific formula is as follows: 。
- 7. The method for optimizing a robot control strategy based on the separation of dominance functions of offline data according to claim 6, further comprising, after step S42, the steps of: introducing a strategy optimization term guided by a dominance function: ; By maximizing the dominance function output, the strategy is prompted to preferentially select actions with higher value relative to average; the total loss function of the policy network is obtained by weighted combination of the dominant guide term and the behavior clone term: ; Wherein eta is the super parameter set in the second step and is used for adjusting the balance between the dominant driving update and the behavior cloning constraint; the network weights are updated using an optimizer by performing gradient calculations and back-propagation on the policy network parameters θ.
- 8. The robot control strategy optimization method of offline data-based dominance function separation according to claim 1, wherein step S5 further comprises the following process: Introducing a target network in the cost function module V ψ (s i ) and the dominance function module a δ (s i ,a i ) And (3) with The parameters of the target network are synchronized by exponentially weighted averaging the primary network parameters: ; ; Wherein τ e (0, 1) is the soft update coefficient; In terms of policy network, the target policy network pi θ ' adopts a similar soft update policy; Secondly, comprehensively monitoring a training process through a training convergence judging mechanism, specifically comprising the following steps: After each round of training is finished, comparing the average return of the strategy network in a verification set or environment simulation with the previous rounds of return, and judging that the strategy has reached a stable state if the variation amplitude of the average return of a plurality of continuous rounds is lower than a preset threshold epsilon 1 ; Gradient convergence determination, namely calculating gradient amplitude values of parameter update of strategy network, cost function module and dominance function module If the gradient amplitude of a plurality of continuous rounds is lower than a threshold epsilon 2 , judging that the network tends to be locally converged; q function convergence determination, monitoring cost function module When the continuous loss reduction amplitude of a plurality of rounds is lower than a threshold epsilon 3 or the average value tends to be stable, judging that the Q function converges; And (3) comprehensively stopping the condition, namely combining strategy performance stability, gradient convergence judgment and Q function convergence judgment, stopping the training process when the three are met, and otherwise, continuing to update iteratively.
- 9. The method for optimizing a robot control strategy based on separation of dominance functions of offline data according to claim 8, further comprising a process of strategy deployment and online evaluation in step S5, specifically comprising the steps of: after the offline training and convergence judgment of the previous steps are completed, the method deploys the strategy network pi θ (s) obtained by training into an actual environment or a simulation environment for online evaluation: Firstly, interfacing a strategy network pi θ (s) with an environment state input interface to realize the closed loop of strategy decision and environment interaction, outputting an action a t =π θ (s t by the strategy according to the current observation state s t for each time step t, and executing the action by the environment to generate a new state s t+1 and a reward r t+1 ; Secondly, the following indexes are adopted for measurement: Cumulative rewards calculating the cumulative rewards of an agent in a complete track given an initial state ; The task completion rate or success rate is that for discrete target tasks or sparse rewarding tasks, the frequency of target completion of the strategy is counted; And (3) evaluating whether the strategy has abnormal or oscillating behaviors or not by analyzing action selection distribution of the strategy in different states.
Description
Robot control strategy optimization method based on offline data dominant function separation Technical Field The invention belongs to the technical field of robot control strategy optimization, and particularly relates to a robot control strategy optimization method based on offline data and with advantage function separation. Background In the field of intelligent manufacturing, precise control of robots is a core link for realizing large-scale production of high-end products, and the control precision, efficiency and stability of the precise control directly determine the product quality and production benefit. Along with the promotion of industry 4.0, control demands such as transfer of high-precision parts, switching of processing stations and the like are growing increasingly, and strict requirements are put on the accuracy and the robustness of a robot control strategy. The precise control of the robot not only needs to realize millimeter-level or even micron-level positioning precision, but also needs to maintain the continuity and safety of a control path in a complex production environment (such as a multi-equipment cross operation area and a dynamic obstacle interference scene), and is a key for connecting production procedures and guaranteeing smooth production flow. The traditional robot control strategy mainly depends on two implementation modes, namely a fixed action sequence control based on pre-programming, the mode needs to manually and accurately plan action parameters of each control step, the problems of clamping stagnation, excessive control deviation and the like easily occur when the mode faces uncertainty factors such as part tolerance, small change of control environment and the like, and the strategy optimization method based on online reinforcement learning adjusts the action strategy through real-time interaction feedback of a robot and the control environment, although the strategy can adapt to environmental changes to a certain extent, the method has obvious limitations in a precise control scene, including the damage of high-value parts and the abrasion of control tools caused by the error test action of the robot in the online interaction process, production equipment is occupied by real-time interaction training, the production efficiency is seriously affected, and the requirement of large-scale industrial production cannot be met. In order to solve the above problems, the offline reinforcement learning method is gradually applied to the training of the robot control strategy, and has the core advantage of completing the strategy optimization only by relying on the offline data set acquired in advance, and no environment interaction in the training process is needed. However, the existing offline reinforcement learning method still faces a key technical bottleneck in the application of the robot precise control scene: Firstly, the traditional offline reinforcement learning severely depends on the absolute value of a Q function (action cost function) to perform strategy optimization, but in a precise control scene, the acquisition of an offline data set is limited by an actual production flow, and the problems of incomplete data coverage (such as partial extreme control gesture, scarce samples of sudden clamping working conditions), uneven data distribution (low expert-level high-precision control track occupation ratio) and the like often exist, so that the absolute estimation error of the Q value is larger. For example, in the control of a miniature part, due to the micro fluctuation of a control force/moment signal, a significant deviation may occur in the Q value estimation of the same state-action pair, and the excessive dependence on the Q value may cause unstable strategy training process, so that the problems of jitter, positioning accuracy degradation, etc. occur when the robot performs the control action, and even the control failure is caused. Secondly, the problem of Q-value overestimation is particularly prominent in the context of fine control. When the Q value is updated through the Bayesian equation iteration in the existing offline reinforcement learning method, the Q value of the state-action pair which is partially insufficiently sampled is easy to be overestimated because the data set cannot completely cover all the control state-action pairs. When the strategy selects actions according to the overestimated Q value, the robot may be caused to perform actions beyond the safety range, such as excessive application of control force to deform the part, or a control path which looks optimal but is not sufficiently verified in practice is preferentially selected, so that the reliability of the control process is reduced. In addition, in the existing offline reinforcement learning method, the definition of the dominance function is limited to the relative dominance of the motion in a single state, and the value distribution characteristic of the global motion in