CN-122018437-A - Self-adaptive motion control method and system for dirt cleaning mechanical arm based on reinforcement learning

CN122018437ACN 122018437 ACN122018437 ACN 122018437ACN-122018437-A

Abstract

The invention discloses a self-adaptive motion control method and a self-adaptive motion control system for a cleaning mechanical arm based on reinforcement learning, and relates to the technical field of self-adaptive motion control of mechanical arms; the method comprises the steps of generating a scene classification model, calling an aptamer strategy through scene classification, generating an action parameter instruction based on reinforcement learning model reasoning, then driving a mechanical arm to cooperatively operate with a storage unit, a navigation system and other associated systems, collecting feedback parameters and calculating a reward value, finally iteratively optimizing the model strategy according to the reward value, dynamically adjusting the action parameters and processing abnormality through a hierarchical abnormality response mechanism.

Inventors

GONG HUI
YUAN KUAN
Lv Linhuo
SHI YINAN
CHEN ZHUOFEI

Assignees

成都河宝机器人有限公司
东方水利智能科技股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260126

Claims (10)

1. The self-adaptive motion control method for the dirt cleaning mechanical arm based on reinforcement learning is characterized by comprising the following steps of: Collecting multi-source associated data, analyzing the multi-source associated data, and generating a parameterized state vector; Step two, matching decision generation strategies, namely completing scene classification through characteristic parameter analysis, calling an aptamer strategy as constraint, carrying out reasoning operation on a state vector based on a reinforcement learning model, and generating an action parameter instruction; step three, cooperatively executing feedback collection, namely driving the mechanical arm and the associated system to cooperatively execute operation actions according to parameters, collecting feedback parameters after the action execution, and calculating a reward value; And step four, the strategy optimization task is terminated, namely the strategy of the model is iterated and optimized according to the rewarding value and the feedback result, the action parameters are dynamically adjusted synchronously based on the characteristic parameter change, and the abnormality is processed through an abnormality response mechanism.
2. The adaptive motion control method of the reinforcement learning-based cleaning mechanical arm according to claim 1, wherein the multi-source correlation data is analyzed by the following specific analysis process: Acquiring a mechanical arm state parameter, an operation object characteristic parameter and an equipment state parameter; the continuous parameters are mapped to a preset numerical value interval through Min-Max normalization, discrete parameters are independently coded, mechanical arm state parameter weights, operation object characteristic parameter weights and equipment state parameter weights are determined through a hierarchical analysis method, and state vectors with preset dimensions are generated through weighted summation.
3. The adaptive motion control method of the cleaning mechanical arm based on reinforcement learning according to claim 2, wherein the generating of the motion parameter instruction comprises the following specific generating process: S1, setting a scene classification parameter threshold, wherein a scene is a single type scene when the duty ratio of the type of a work object reaches a preset type duty ratio threshold, the obstacle density is lower than the preset obstacle density threshold, the scene is a simple scene, the dynamic change rate is lower than the preset change rate threshold, the scene is a stable scene, and the residual capacity of the equipment is higher than the preset capacity threshold, and the scene is a conventional scene; S2, constructing a core scene sub-strategy parameter library containing the number of preset scenes, wherein each type of scene is provided with parameter items of the number of preset parameter items, the matching degree of a current state parameter vector and the sub-strategy parameter library is calculated by adopting a cosine similarity algorithm, corresponding sub-strategy parameters are directly called when the matching degree reaches a preset high matching degree threshold value, the parameters are finely adjusted when the matching degree is in a preset middle matching degree interval, and default sub-strategy parameters are called when the matching degree is lower than a preset low matching degree threshold value; S3, dividing an action parameter dimension system, wherein discrete action parameters comprise a preset operation mode and a preset operation priority, continuous action parameters comprise joint angle adjustment quantity and terminal speed preset parameter ranges, classifying according to preset quantization units, generating confidence degrees of various discrete actions and various continuous parameters through correlation analysis, recording a certain type of discrete actions and a certain type of continuous parameters with the confidence degrees being larger than a preset threshold value as a pair of discrete actions and continuous parameters of associated parameters, establishing a parameter association table, obtaining a mapping relation between the discrete parameters and the continuous parameters, inputting parameterized state vectors into a reinforcement learning model, determining the discrete parameters through a Softmax function, and correcting the continuous parameters according to preset correction rules by combining the characteristics of an operation object through Gaussian distribution sampling, so as to generate an action parameter instruction.
4. The adaptive motion control method of a cleaning mechanical arm based on reinforcement learning according to claim 3, wherein the reinforcement learning model adopts a two-stage training, and the specific training process is as follows: constructing a simulation parameter model, generating samples of a typical scene common preset sample number group of a preset scene combination, training by adopting a PPO algorithm, calculating an average rewarding value and a cleaning efficiency parameter of a verification set every iteration preset verification interval number, stopping training when the average rewarding value of the verification set reaches a preset verification rewarding threshold value and the cleaning efficiency parameter reaches a preset verification efficiency threshold value, and storing model parameters; And acquiring 1 group of samples in each operation cycle with preset acquisition cycle times, starting fine adjustment when the increment of the samples reaches a preset fine adjustment sample threshold value, the rewarding value of continuously preset low rewarding cycle times is lower than the preset fine adjustment rewarding threshold value or the state parameter change rate reaches a preset state change rate threshold value, adopting an increment learning algorithm, reducing the learning rate to a preset learning rate interval, only updating the full-connection layer parameters of an Actor network, freezing the characteristic extraction layer parameters, continuously acquiring operation data of the preset verification sample number group after fine adjustment, calculating the dirt cleaning efficiency lifting rate and the abnormal occurrence rate reducing rate, and storing the updated parameters when the increment of the samples reaches the preset fine adjustment efficiency lifting threshold value and the abnormal occurrence rate reducing rate reaches the preset fine adjustment abnormal reduction threshold value, otherwise backtracking to the parameters before adjustment.
5. The adaptive motion control method for a reinforcement learning-based cleaning mechanical arm according to claim 3, wherein the reinforcement learning model is implemented as follows: Inputting the parameterized state vector into an Actor network of the reinforcement learning model, calculating probability distribution of discrete action parameters through a Softmax function, selecting a parameter combination with highest probability as a discrete decision result parameter, generating a continuous action parameter initial value through Gaussian distribution sampling, and summarizing to obtain a continuous parameter sequence, wherein the action instruction comprises a discrete decision result, a continuous parameter sequence and a time sequence parameter.
6. The adaptive motion control method of a cleaning mechanical arm based on reinforcement learning according to claim 1, wherein the feedback parameters are collected after the action is executed, and the specific collection process is as follows: The mechanical arm is linked with the storage unit, when the difference value between the mechanical arm conveying speed parameter and the storage unit receiving speed parameter exceeds a preset speed difference value threshold, a corrected speed parameter is calculated, synchronous adjustment is carried out according to the corrected speed parameter, the preset height allowance is added to the storage unit inlet height parameter, a mechanical arm lifting height parameter threshold is obtained, and the mechanical arm lifting height parameter is larger than the mechanical arm lifting height parameter threshold and is coordinated; The mechanical arm cooperates with the navigation system, namely the preset maximum operation efficiency parameter is divided by the preset mechanical arm operation width parameter and then divided by the preset operation depth parameter to obtain a navigation speed parameter threshold, when the navigation speed parameter is smaller than or equal to the navigation speed parameter and the unilateral stress of the mechanical arm exceeds the preset stress threshold, the navigation system outputs thrust in a reverse direction, and the adjustment time does not exceed the preset adjustment time; The mechanical arm and the auxiliary mechanism are synchronized, wherein a preset cutting length parameter is divided by a mechanical arm conveying speed parameter to obtain a cutting mechanism grabbing interval duration, the cutting mechanism is used according to the cutting mechanism grabbing interval duration, meanwhile, the mechanical arm action frequency is multiplied by a preset frequency multiple to obtain an auxiliary detection mechanism updating frequency, and the auxiliary detection mechanism is used according to the auxiliary detection mechanism updating frequency; and T4, constructing a feedback parameter system, namely carrying out normalization and weighting calculation on various data of the decontamination effect parameters to obtain decontamination effect rewards, carrying out normalization and weighting calculation on various data of the equipment safety parameters to obtain equipment safety rewards, carrying out normalization and weighting calculation on various data of the energy consumption optimization parameters to obtain energy consumption optimization rewards, carrying out normalization and weighting calculation on various data of the task propulsion parameters to obtain task propulsion rewards, dynamically adjusting weights by an entropy weight method, and summing to obtain rewards.
7. The adaptive motion control method of the reinforcement learning-based cleaning mechanical arm according to claim 6, wherein the entropy weight method dynamically adjusts weights, and the specific adjustment process is as follows: when the residual electric quantity is lower than a preset low electric quantity threshold value, adjusting the energy consumption optimization rewarding weight factor according to a preset energy consumption weight lifting proportion, and when the density of an operation object reaches a preset high density threshold value, adjusting the pollution cleaning effect rewarding according to a preset pollution cleaning weight lifting proportion, and recalculating the weight every time a preset weight recalculation period is passed.
8. The adaptive motion control method of the reinforcement learning-based cleaning mechanical arm according to claim 1, wherein the processing of the abnormality by the abnormality response mechanism is as follows: G1, adaptively adjusting action parameters, namely adjusting the action parameters according to the size of an operation object, the weight of the operation object, the distribution density of the operation object, the environmental dynamic parameters, the residual capacity of a storage unit, the residual electric quantity of a power system, the task completion rate and the priority of the operation object; the G2, an abnormal response mechanism judges according to a preset abnormal threshold, wherein the jam is when the load of a driving unit reaches the preset load threshold and the duration exceeds the preset abnormal duration, the collision is that the distance detection parameter is lower than the preset safety distance threshold, when the overload is when the grabbing force reaches the preset force threshold, the abnormal index is obtained according to the normalized weighted calculation of the abnormal duration and the influence range, the state grade is obtained according to the abnormal index interval corresponding to the light, medium and heavy grade in the database, the action parameter is adjusted according to the preset reverse adjustment proportion when the state grade is slightly abnormal, the mechanical arm is reset to the initial posture when the medium is abnormal, and the action parameter is zeroed and an abnormal data packet is sent when the state grade is severely abnormal; Screening samples with reward values in feedback parameters reaching a preset high reward threshold or lower than a preset low reward threshold, marking the samples as high-value samples, screening samples with the dirt cleaning efficiency improved to reach a preset efficiency improving threshold or the abnormal occurrence rate reduced to reach a preset abnormal reducing threshold, marking the samples as effective samples, and marking the sample retention rate as preset sample retention rate; And G4, parameter iteration, namely starting iteration when the newly added effective sample reaches a preset sample increment threshold, the continuous preset attenuation cycle times of the dirt cleaning efficiency are reduced to reach a preset efficiency attenuation threshold or the scene parameter similarity is lower than a preset scene similarity threshold, updating model weight parameters by adopting a small-batch gradient descent algorithm, verifying the iteration effect, setting a preset evaluation threshold system, and storing parameters when the threshold is met.
9. The reinforcement learning-based adaptive motion control method of a cleaning robot arm of claim 1, wherein the starting iteration, iteration triggering conditions further comprise: When other triggering scenes are preset, a small-batch gradient descent algorithm is adopted to update the weight parameters, a mean square error formula is adopted for the loss function, the difference value between the predicted rewarding value and the actual rewarding value is calculated through the mean value after the squares value, the updated rewarding offset rate is obtained, the rewarding offset rate is smaller than a preset threshold value, the updated parameters are saved, and if the parameters are not met, the iteration parameters are adjusted or high-value samples are supplemented, and the iteration is executed again.
10. A motion control system utilizing the reinforcement learning-based adaptive motion control method of a cleaning robot arm as claimed in any one of claims 1 to 9, comprising the following modules: the data acquisition processing analysis module is used for acquiring multi-source associated data, analyzing the multi-source associated data and generating a parameterized state vector; the decision generation strategy matching module is used for completing scene classification through characteristic parameter analysis, calling an aptamer strategy as a constraint, and performing reasoning operation on the state vector based on the reinforcement learning model to generate an action parameter instruction; The cooperative execution feedback acquisition module is used for driving the mechanical arm and the associated system to cooperatively execute operation actions according to parameters, acquiring feedback parameters after the actions are executed, and calculating a reward value; And the strategy optimization task termination module is used for iterating and optimizing a model strategy according to the reward value and the feedback result, dynamically adjusting the action parameters synchronously based on the characteristic parameter change, and processing the abnormality through an abnormality response mechanism.

Description

Self-adaptive motion control method and system for dirt cleaning mechanical arm based on reinforcement learning Technical Field The invention relates to the technical field of self-adaptive motion control of mechanical arms, in particular to a self-adaptive motion control method and system of a dirt cleaning mechanical arm based on reinforcement learning. Background Along with the increasing environmental protection demands, the application of the cleaning mechanical arm in the fields of water area cleaning, environmental treatment and the like is wider, so that a self-adaptive motion control method and a self-adaptive motion control system of the cleaning mechanical arm based on reinforcement learning are needed. The existing motion control method of the cleaning mechanical arm depends on fixed parameter programming or simple self-adaptive logic. Aiming at the prior art, the technical problems that 1, the data processing lacks systematicness, the multisource data is not subjected to scientific parameterization fusion, so that the state representation accuracy is insufficient, and the decision reliability is affected. 2. The scene suitability is poor, an effective scene classification and sub-strategy matching mechanism is not established, and complex working conditions of various types of job objects and dynamic changes of environments are difficult to deal with. 3. The reinforcement learning model has a single training mode, is difficult to adapt to scene differences in actual operation only by off-line training, and has insufficient model generalization capability. 4. The cooperative execution logic is imperfect, and the mechanical arm and the action linkage of the storage unit, the navigation system and the auxiliary mechanism lack of accurate parameter matching, so that the problems of dirt drop, operation omission and the like are easy to occur. 5. The exception handling mechanism is simple, most of the exception handling mechanism is emergency shutdown triggered by a single threshold value, a hierarchical response strategy is not formed, the safety risk is high, and the operation continuity is affected. 6. The reward calculation weight is fixed, and the multi-objective requirements such as the dirt cleaning efficiency, the equipment safety, the energy consumption optimization and the like cannot be dynamically balanced. These problems lead to that current clean-up arm operating efficiency is low, the suitability is poor, the potential safety hazard is outstanding, is difficult to satisfy intelligent clean-up demand under the complex scene. Disclosure of Invention Aiming at the technical defects, the invention aims to provide a self-adaptive motion control method and a self-adaptive motion control system for a cleaning mechanical arm based on reinforcement learning. In order to solve the technical problems, the invention adopts the following technical scheme that the self-adaptive motion control method of the dirt cleaning mechanical arm based on reinforcement learning comprises the following steps of collecting multi-source associated data, analyzing the multi-source associated data and generating a parameterized state vector. And step two, matching a decision generation strategy, namely completing scene classification through characteristic parameter analysis, calling an aptamer strategy as constraint, and performing reasoning operation on the state vector based on the reinforcement learning model to generate an action parameter instruction. And thirdly, cooperatively executing feedback collection, namely driving the mechanical arm and the associated system to cooperatively execute operation actions according to the parameters, collecting feedback parameters after the actions are executed, and calculating a reward value. And step four, the strategy optimization task is terminated, namely the strategy of the model is iterated and optimized according to the rewarding value and the feedback result, the action parameters are dynamically adjusted synchronously based on the characteristic parameter change, and the abnormality is processed through an abnormality response mechanism. S1, setting a scene classification parameter threshold, wherein the job object type ratio reaches a preset type ratio threshold to be a single type scene, the obstacle density is lower than a preset obstacle density threshold to be a simple scene, the dynamic change rate is lower than the preset change rate threshold to be a stable scene, and the residual capacity of the equipment is higher than a preset capacity threshold to be a conventional scene. S2, constructing a core scene sub-strategy parameter library containing the number of preset scenes, wherein each type of scene is provided with parameter items of the number of preset parameter items, the matching degree of the current state parameter vector and the sub-strategy parameter library is calculated by adopting a cosine similarity algorithm, the corresponding sub-strateg