CN-121786505-B - Retired battery echelon utilization grouping method based on reinforcement learning and local search

CN121786505BCN 121786505 BCN121786505 BCN 121786505BCN-121786505-B

Abstract

The invention discloses a method for grouping retired batteries in a echelon manner based on reinforcement learning and local search, which belongs to the technical field of battery manufacturing and management and comprises the steps of collecting multidimensional characteristic parameters of battery monomers in a battery set to be grouped, carrying out standardized processing to obtain battery characteristic data, constructing reinforcement learning environment, wherein a state space of the reinforcement learning environment comprises mask information reflecting optional states of batteries and battery characteristics, and is provided with a reward function, training an agent by using a near-end strategy optimization algorithm with masks based on the reinforcement learning environment, generating an initial battery grouping sequence by using the trained agent, and carrying out fine adjustment on the initial battery grouping sequence by using a local search algorithm to obtain a final battery grouping scheme to realize retired battery recombination. The invention effectively solves the grouping problem caused by inconsistent multidimensional parameters of the retired battery, does not need a hardware equalization circuit, and has the advantages of high grouping precision, high calculation efficiency, strong generalization capability and the like.

Inventors

TIAN JIAQIANG
ZHOU YUJIE
PAN TIANHONG
Li Mince
ZHANG DEXIANG
FAN YUAN
NI LIPING
LAO LI
ZHANG XU

Assignees

安徽大学

Dates

Publication Date: 20260512
Application Date: 20260305

Claims (10)

1. The method for grouping the retired batteries in a cascade based on reinforcement learning and local searching is characterized by comprising the following steps: collecting multidimensional characteristic parameters of each battery monomer in a battery set to be grouped, and performing standardized processing to obtain battery characteristic data; Constructing a reinforcement learning environment based on the battery characteristic data, wherein a state space of the reinforcement learning environment comprises mask information reflecting optional states of the battery and battery characteristics, and is provided with a reward function; training the intelligent agent by adopting a masked near-end strategy optimization algorithm based on the reinforcement learning environment, and generating an initial battery grouping sequence through the trained intelligent agent; Fine-tuning the initial battery grouping sequence by adopting a local search algorithm based on monomer exchange to obtain a final battery grouping scheme; and outputting a battery reorganization instruction according to the final battery grouping scheme to realize retired battery reorganization.
2. The method of claim 1, wherein collecting the multidimensional characteristic parameters of each battery cell in the battery set to be grouped and performing standardization processing to obtain the battery characteristic data comprises: For each dimension characteristic parameter of each battery monomer, determining a maximum value and a minimum value according to original parameter values of all batteries in the corresponding dimension; And processing the corresponding dimension parameter values by adopting a maximum and minimum normalization method according to the maximum value and the minimum value to obtain normalized parameter values, and further obtaining battery characteristic data.
3. The method of claim 1, wherein the constructing of the reward function of the reinforcement learning environment comprises: according to the action of selecting batteries by the intelligent agent, when the number of the selected batteries meets the preset grouping size, calculating the intra-group consistency rewards of the current group and accumulating; calculating consistency rewards among all groups according to the grouping states of all batteries; And summing the intra-group consistency rewards and the inter-group consistency rewards to complete the construction of the rewards function.
4. A method according to claim 3, wherein the process of calculating an intra-group consistency prize for the current group comprises: Calculating a characteristic mean vector of the current group according to the battery characteristic data of all batteries in the current group; According to the difference value between the characteristic mean value vector and each battery characteristic vector in the group, combining with a preset intra-group divergence characteristic weight, and calculating the intra-group weighted Euclidean distance divergence of the current group; And calculating to obtain the intra-group consistency rewards of the current group according to the intra-group weighted Euclidean distance divergence and a preset intra-group consistency weight coefficient.
5. The method of claim 4, wherein calculating an inter-group consistency prize among all groups comprises: Calculating a global center vector according to the characteristic average value vectors of all groups; Calculating inter-group weighted Euclidean distance divergence according to the difference value between each group of characteristic mean vectors and the global center vector and by combining with a preset inter-group divergence characteristic weight; and calculating to obtain the inter-group consistency rewards according to the inter-group weighted Euclidean distance divergence and a preset inter-group consistency weight coefficient.
6. The method of claim 1, wherein training an agent using a masked near-end policy optimization algorithm based on the reinforcement learning environment comprises: acquiring a current state including battery characteristics and an action mask based on the reinforcement learning environment; outputting action probability distribution through a policy network of the agent based on the current state; Sampling and selecting an index of a next battery as an execution action based on the action probability distribution and the action mask; based on the executing action, acquiring a reward signal and an updated state from the reinforcement learning ring, and repeating the interactive process to collect training data; Based on the collected training data, the policy network and value network parameters of the agent are updated by a near-end policy optimization algorithm to learn the battery grouping policy.
7. The method of claim 6, wherein the awarding of the bonus signal during training of the agent is performed using a staged accumulation mechanism, in particular: Judging whether the current group is full of the preset number according to the action of selecting the batteries by the intelligent agent; if the current group is full, calculating intra-group consistency rewards according to the characteristic data of the batteries in the current group, and immediately giving the intelligent agent to accumulate the corresponding intra-group consistency rewards; And according to the judgment of whether all the batteries are distributed, if all the groups are completed, calculating the consistency rewards among all the groups and giving the rewards to the intelligent agent.
8. The method according to claim 6, wherein the process of training the agent further comprises intelligent convergence detection, in particular: calculating an average value and a standard deviation in a sliding time window according to the historical rewarding value recorded in the training period; calculating average improvement amplitude of rewards according to the average value of the current window and the average value of the previous window; and judging whether the standard deviation is smaller than a corresponding threshold value or not and judging whether the absolute value of the average improvement amplitude is smaller than the corresponding threshold value or not, if the state continuously maintains to reach the preset cycle number, judging that the training is converged and stopping the training.
9. The method of claim 1, wherein fine-tuning the initial battery grouping sequence using a local search algorithm based on a monomer exchange to obtain a final battery grouping scheme comprises: Determining a plurality of battery groupings included based on the initial battery grouping sequence; traversing the battery cell pairs between different battery packs based on the determined plurality of battery groupings; for each traversed battery monomer pair, trying to exchange to obtain a new grouping result, and calculating an objective function score after exchange; When the objective function score after the exchange is lower than or equal to the objective function score before the exchange, the exchange operation is canceled; and repeatedly executing the processes of traversing, trying to exchange and evaluate the comparison until reaching the preset iteration termination condition, and outputting the optimized final battery grouping scheme.
10. An electronic device comprising a memory, a processor and a computing program stored in the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1-9 when executing the computing program.

Description

Retired battery echelon utilization grouping method based on reinforcement learning and local search Technical Field The invention belongs to the technical field of battery manufacturing and management, and particularly relates to a retired battery echelon utilization grouping method based on reinforcement learning and local search. Background The consistency of the battery pack is a key factor affecting the performance, life and safety of the battery module. In the battery pack production process, if the series-parallel single batteries have differences in parameters such as capacity, internal resistance and the like, a wooden barrel effect can be caused in the battery pack in the charge-discharge process, namely, the single battery with the worst performance limits the capacity of the whole battery pack, and the differences can be expanded along with the increase of the cycle times, so that the battery pack is finally caused to be in early failure. At present, the traditional battery sorting and grouping method mainly comprises the steps of 1. A single parameter sorting method is used for simply dividing gears according to capacity or voltage. The method ignores the influence of dynamic parameters such as internal resistance, polarization characteristics and the like, and has poor grouping effect. 2. The multi-parameter static sorting method is to take capacity and internal resistance into consideration and group by using k-means clustering or genetic algorithm. However, the conventional optimization algorithm is easy to fall into local optimum when facing large-scale battery data, and the calculation complexity increases exponentially with the number of batteries, so that it is difficult to find a global optimum solution in a limited time. 3. The passive/active equalization technology can equalize the difference in operation through a hardware circuit, but increases the hardware cost and the complexity of a control system, and can not fundamentally solve the problem of poor monomer matching degree. Therefore, how to comprehensively consider the static (capacity, ohmic internal resistance) and dynamic (polarized internal resistance, polarized capacitance) multidimensional characteristics of the battery, and realize efficient and high-consistency battery grouping by using an intelligent algorithm is a problem to be solved currently. Disclosure of Invention In order to solve the technical problems, the invention provides a retired battery echelon utilization grouping method based on reinforcement learning and local search, which comprises the following steps: collecting multidimensional characteristic parameters of each battery monomer in a battery set to be grouped, and performing standardized processing to obtain battery characteristic data; Constructing a reinforcement learning environment based on the battery characteristic data, wherein a state space of the reinforcement learning environment comprises mask information reflecting optional states of the battery and battery characteristics, and is provided with a reward function; training the intelligent agent by adopting a masked near-end strategy optimization algorithm based on the reinforcement learning environment, and generating an initial battery grouping sequence through the trained intelligent agent; Fine-tuning the initial battery grouping sequence by adopting a local search algorithm based on monomer exchange to obtain a final battery grouping scheme; and outputting a battery reorganization instruction according to the final battery grouping scheme to realize retired battery reorganization. Optionally, collecting multidimensional characteristic parameters of each battery monomer in the battery set to be grouped, and performing standardization processing to obtain battery characteristic data, including: For each dimension characteristic parameter of each battery monomer, determining a maximum value and a minimum value according to original parameter values of all batteries in the corresponding dimension; And processing the corresponding dimension parameter values by adopting a maximum and minimum normalization method according to the maximum value and the minimum value to obtain normalized parameter values, and further obtaining battery characteristic data. Optionally, the construction process of the reward function of the reinforcement learning environment includes: according to the action of selecting batteries by the intelligent agent, when the number of the selected batteries meets the preset grouping size, calculating the intra-group consistency rewards of the current group and accumulating; calculating consistency rewards among all groups according to the grouping states of all batteries; And summing the intra-group consistency rewards and the inter-group consistency rewards to complete the construction of the rewards function. Optionally, the process of calculating the intra-group consistency rewards for the current group includes: Calcula