CN-122008205-A - Multi-mode perception-based self-adaptive medicine grabbing system and method

CN122008205ACN 122008205 ACN122008205 ACN 122008205ACN-122008205-A

Abstract

The invention provides a multimode fusion medicine efficient grabbing system and a multimode fusion medicine efficient grabbing method based on imitation learning.A visual perception and counting module collects RGB images containing medicine scenes through cameras arranged at the tail end of a robot or in the environment, and high-precision identification and positioning are carried out on medicines in the scenes; the strategy decision and routing module dynamically selects and activates the double-channel body execution module, the double-channel body execution module generates a smooth joint action sequence, performs semantic understanding and long-range planning to generate an action track in a flow matching mode, and the tail end control execution module receives the action instruction output by the double-channel body execution module and drives the mechanical arm and the clamping jaw to finish the task of grabbing and placing the medicine. The semantic breadth of the large model and the operation depth of the special model are effectively integrated, the contradiction between the difficulty of cold start and high-frequency real-time control in medicine grabbing is solved, and the method is particularly suitable for application scenes such as automatic sorting in hospital pharmacy and quality inspection and grabbing in a pharmaceutical factory assembly line.

Inventors

LIU KAI
Zhang Shuoqin
Gao Xiru
LIU ZELIN
SHI KECHAO
HU ZHE

Assignees

重庆大学

Dates

Publication Date: 20260512
Application Date: 20260202

Claims (10)

1. Multimode fusion medicine high-efficiency grabbing system based on imitation learning is characterized by comprising: visual perception and counting module, which collects RGB image containing medicine scene through camera installed at robot end or in environment, combines SAHI slice assisted reasoning YOLO11 target detection algorithm to identify and position medicine in scene with high precision, and outputs total quantity of medicine in scene Boundary box information of various medicines; the strategy decision and routing module is connected with the visual perception and counting module and is used for receiving the total number of medicines And compares it with a preset scene complexity threshold Comparing, dynamically selecting and activating one of the sub-modules in the dual-channel body execution module according to the comparison result; A dual-channel body execution module comprising ACT fine operation submodule and ACT fine operation submodule which are arranged in parallel A generalization operation sub-module for generating a smooth joint action sequence, performing semantic understanding and long-range planning to generate an action track in a flow matching form, and And the tail end control execution module receives the action instruction output by the double-channel body execution module and drives the mechanical arm and the clamping jaw to finish the task of grabbing and placing the medicine.
2. The imitation learning based multimodal fusion drug substance efficient grasping system of claim 1, wherein the visual perception and counting module uses a detection strategy of YOLO11 in combination SAHI, the steps comprising: image of original high resolution Divided into a plurality of overlapping slices Each slice has a size of ; Each slice is independently input into a YOLO11 detection network for reasoning, the YOLO11 detection network uses a C3k2 module as a feature extraction backbone, and the multi-scale feature fusion capability is enhanced by combining an SPPF module; Coordinate mapping is carried out on the detection results of all the slices to restore to an original image coordinate system, a non-maximum suppression (NMS) algorithm is used for removing redundant frames, and a final detection frame set is calculated ; If two prediction frames Cross-over ratio of (C) If the confidence coefficient is larger than the preset threshold value, reserving a prediction frame with higher confidence coefficient; total number of final output medicines Is a collection The number of elements in the list.
3. The efficient multimode fusion drug grabbing system based on imitation learning as claimed in claim 1, wherein the coordinate mapping and fusion process is to coordinate a detection frame in a subgraph Mapping back to the original coordinate system: ; wherein the method comprises the steps of Is the first The upper left corner offset of each slice.
4. The efficient multimode fusion drug capture system based on imitation learning of claim 1, wherein the decision logic of the policy decision and routing module is: setting scene complexity threshold The threshold is determined based on the crowdedness of the robot operating space and the computational resource consumption of a single plan; When (when) When the system is judged to be a 'low-density fine operation scene', the system routes to an ACT fine operation sub-module so as to utilize the high-precision and low-delay characteristics of the ACT fine operation sub-module under the demonstration of a few-sample expert; When (when) When the high-density complex semantic scene is judged, the system is routed to The generalization operation sub-module is used for utilizing semantic understanding capability and generalization grabbing capability of the Internet-based large-scale data pre-training.
5. The imitation learning-based multimodal fusion drug substance efficient grasping system of claim 1, wherein the ACT fine manipulation sub-module is configured to When activated, a smooth joint action sequence is generated for a small sample and high-precision grabbing task based on a conditional variation automatic encoder and an action blocking mechanism.
6. The imitation learning based multimodal fusion drug substance efficient grasping system of claim 5, wherein the ACT fine manipulation sub-module uses a fransformer based CVAE architecture, the action generation process of which comprises: a training phase, optimizing an objective function comprising reconstruction loss and KL divergence loss: ; Wherein, the For the future A sequence of actions of the steps, For the current observation to be made, As a potential variable of the set of variables, In the case of an encoder, In order to have a prior gaussian distribution, Is a weight coefficient; inference stage, setting up (Mean) the decoder is based on the current observations And Predicting the future at one time Action block of step ; Using a time integration (Temporal Ensembling) policy smoothing action, weighted averaging the action predictions for overlapping time steps: ; wherein the method comprises the steps of As a smoothing factor, the smoothing factor is used, For the time offset of the motion block, a high frequency control response of 50Hz or more is ensured.
7. The model learning based multimodal fusion drug efficient grasping system of claim 6 wherein ACT models expert actions using a conditional variation self-encoder (CVAE), training loss functions include reconstruction loss and KL divergence loss: ; Wherein, the For the future The action block of the step is that, For the current observation to be made, Is a Style Variable (Style Variable).
8. The learning-based multimodal fusion drug delivery system of claim 1, wherein the drug delivery system comprises Generalization operation sub-module is in When the method is activated, based on a vision-language-action multi-mode large model framework, semantic understanding and long-range planning are carried out on a complex background and multi-target stacked scene, and an action track in a flow matching form is generated.
9. The learning-based multimodal fusion drug delivery system of claim 8, wherein the drug delivery system comprises The generalization operations sub-module uses a vision-language-action heterogeneous architecture, comprising: extracting scene characteristics by using SigLIP visual encoders, and performing cross-attention fusion with natural language instructions (such as 'grabbing amoxicillin medicine boxes'); Using a Flow Matching (Flow Matching) technique as the motion decoder, generating a continuous motion trajectory by iterative denoising from gaussian noise, the optimization objective is to minimize vector field regression loss: ; wherein the method comprises the steps of In order for the noise to be distributed, For the distribution of the target actions, A velocity field predicted for the model; The module outputs high-level planning actions including semantic alignment for processing complex scenes of multi-object occlusion and unstructured placement.
10. A multimode fusion medicine efficient grabbing method based on imitation learning is characterized by comprising the following steps: S1, acquiring image data of a medicine sorting area through visual perception equipment; s2, inputting the images into a YOLO11 network to detect small target medicines and outputting the types, positions and total numbers of the medicines by utilizing SAHI slice auxiliary reasoning technology ; S3, total number of medicines to be detected And a preset threshold value Comparing; S4, if Activating ACT algorithm branches, namely reading ACT model weights trained on the basis of a small amount of expert demonstration data in advance, predicting a joint angle sequence in a future period of time through an action blocking mechanism by combining current observation, and applying time integration smoothing; S5, if Activating pi 0.5 algorithm branches, namely loading a pre-trained VLA large model, inputting a visual image and a corresponding language instruction, and generating a grabbing track adapting to a complex environment through a stream matching decoder; and S6, issuing the action instruction generated in the step S4 or the step S5 to a robot controller, performing medicine grabbing and placing, and monitoring the execution state in real time until the task is completed.

Description

Multi-mode perception-based self-adaptive medicine grabbing system and method Technical Field The invention relates to the technical field of intelligent robots and industrial automation, and particularly discloses a multi-mode perception-based self-adaptive medicine grabbing system and method. Background Along with the rapid development of intelligent medical treatment and automatic logistics, automatic sorting and grabbing of medicines becomes a key link for improving medical service efficiency. However, the medicine gripping scene has high complexity and uncertainty that medicine packages are large in size difference (from tiny ampoule bottles to large medicine boxes), are placed in disorder, and have serious shielding problems. Conventional automated gripping schemes typically rely on conventional computer vision algorithms in conjunction with fixed motion planning, which are difficult to deal with unstructured environments. In recent years, a deep learning-based grabbing detection algorithm (e.g., graspNet) improves the grabbing success rate, but the detection effect is often poor due to resolution limitation when facing very small targets (e.g., scattered tablets or small penicillin bottles). End-to-end mimicking learning (Imitation Learning) is becoming the dominant aspect of control strategies. Among them, visual-Language-Action (VLA) large models (e.g., pi 0.5 series) exhibit powerful semantic understanding and generalization capabilities that can handle objects that have never been seen. However, the large model has two remarkable defects that firstly, the cold start is difficult, namely, mass data is required to be finely adjusted aiming at a specific vertical field (such as special medicine treatment), the data acquisition cost is extremely high, and secondly, the reasoning delay is large, the high-frequency closed-loop control requirement of the industrial robot above 50Hz is difficult to meet, and the grabbing action is slow or shaking is caused. On the other hand, the ACT (Action Chunking with Transformers) algorithm based on action blocking is excellent in terms of learning with a small number of samples and high-frequency control, but is inferior to the VLA large model in generalization ability in the face of extremely complex, numerous objects and semantically complex scenes. Therefore, a fusion type grabbing algorithm capable of combining the advantages of small target detection and dynamically switching between high generalization and high precision according to scene complexity is needed to solve the problems of difficult identification, data lack and slow control existing in the current automatic medicine grabbing. Disclosure of Invention Based on the method, the invention provides a multi-mode perception-based self-adaptive medicine grabbing system and a multi-mode perception-based self-adaptive medicine grabbing method, so that the semantic breadth of a large model is fused with the operation depth of a special model, and the contradiction between the difficulty of cold start and high-frequency real-time control in medicine grabbing is solved. In order to achieve the above purpose, the invention provides a multimode fusion medicine efficient grabbing system based on imitation learning, which comprises a visual perception and counting module, a strategy decision and routing module, a two-channel special execution module and an end control execution module. The visual perception and counting module collects RGB images containing medicine scenes through cameras arranged at the tail end of a robot or in the environment, and a SAHI slice-assisted reasoning YOLO11 target detection algorithm is combined to accurately identify and position medicines in the scenes and output the total quantity of the medicines in the scenesThe policy decision and routing module is connected with the visual perception and counting module and receives the total number of medicinesAnd compares it with a preset scene complexity thresholdComparing, dynamically selecting and activating one of the sub-modules in the dual-channel body execution module according to the comparison result, wherein the dual-channel body execution module comprises an ACT fine operation sub-module and an ACT fine operation sub-module which are arranged in parallelThe device comprises a generalization operation sub-module, a terminal control execution module and a mechanical arm, wherein the generalization operation sub-module generates a smooth joint action sequence, performs semantic understanding and long-range planning to generate an action track in a flow matching mode, and receives action instructions output by the dual-channel body execution module to drive the mechanical arm and the clamping jaw to finish the task of grabbing and placing medicines. SAHI (SLICING AIDED HYPER INFERENCE, slice assisted super reasoning) is an enhanced reasoning framework designed specifically to solve the problem of "missed detection (FALSE NEGATIVES)"