CN-116935025-B - Instruction grabbing method and system based on concept learning and priori knowledge

CN116935025BCN 116935025 BCN116935025 BCN 116935025BCN-116935025-B

Abstract

The invention provides a command grabbing method and a command grabbing system based on concept learning and priori knowledge, wherein the command grabbing method comprises the steps of training a target recognition network of a known object, providing the priori knowledge of the known object, enabling a mechanical arm to have recognition capability of part of the object and grab according to a language command, enabling the mechanical arm to provide language knowledge according to rules when the language command requires grabbing a new class of object, enabling the fused feature and an original scene image to be crossattentive and pay attention to the new class of object, enabling the mechanical arm to continuously inquire to obtain injection of the new knowledge if attentive does not pay attention to the new class of object or attention error indicated by the language command, enabling the mechanical arm to learn the feature of the new class of object after continuous circulation, and enabling the mechanical arm to directly grab when the mechanical arm is required to grab the new class of object when the next language command.

Inventors

ZHOU FENGYU
ZHU ZHENWEI
LIU JIN
Huang Saike
YIN LEI
SUN ZHENGHUI
GAO HE
WANG ZHE

Assignees

山东大学
山东芯辰人工智能科技有限公司
山东正晨科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20230619

Claims (10)

1. The instruction grabbing method based on concept learning and priori knowledge is characterized by comprising the following steps of: training a target recognition network of a known object, providing priori knowledge for the known object, enabling the mechanical arm to have recognition capability of part of the object, and grabbing according to a language instruction; when the language instruction requires grabbing a new class of objects, the mechanical arm cannot locate the new class of objects, and language knowledge is required to be provided according to rules; The rule is that a new class of objects are connected with a known object in the description of external language knowledge, the prior knowledge of an image is obtained by a mechanical arm according to a multi-mode knowledge map on the Internet based on the injection of prior language knowledge, and the prior knowledge of the image is fused with the language prior; the fused features and the original scene images are subjected to cross attention, and the new types of objects are focused; if attention does not pay attention to the new class of objects or attention errors indicated by the language instructions, the mechanical arm continues to inquire to obtain injection of new knowledge, and through continuous circulation, the mechanical arm can learn the characteristics of the new class of objects; When the next language instruction requires the mechanical arm to grasp a new class of object, the mechanical arm directly grasps.
2. The instruction grabbing method based on concept learning and priori knowledge according to claim 1, wherein when training a target recognition network of a known object, the method comprises: three-dimensional modeling is carried out on the known object; And randomly selecting a plurality of known objects from the images every time, putting the known objects into a simulation environment as a scene according to random positions and angles, collecting RGB images under the scene, marking the RGB images, performing network training on the marked images, and detecting the positions of the known objects by the mechanical arm based on the trained network.
3. The instruction grabbing method based on concept learning and priori knowledge according to claim 1, wherein network training is used for the marked images, the training language comprises language instructions and language priors, the language instructions are used for grabbing a certain object and are composed of the relation among the categories, the attributes and the object areas of the object, and the generated language instructions have different prefixes; for language priori, BERT is adopted for encoding, so that similarity calculation is conveniently carried out on the recognition of new types of objects and the injected human text priori knowledge; Language instructions and language priors are trained grounding on the identified targets proposal, respectively.
4. The command grabbing method based on concept learning and priori knowledge according to claim 1, wherein when language commands require grabbing new types of objects, the mechanical arm cannot locate the new types of objects, language knowledge is required to be provided according to rules, if No is selected, the mechanical arm grabs randomly, and if Yes is selected, text knowledge is required to be provided according to rules.
5. The instruction grabbing method based on concept learning and priori knowledge according to claim 4, wherein text knowledge is word2vec coded to obtain semantic expression Semantic representation, and the semantic expression is input into a multi-modal knowledge graph or image knowledge of the concept of a new class of objects is searched on the Internet.
6. The instruction grabbing method based on concept learning and priori knowledge according to claim 5, wherein the image knowledge is coded by ResNet to obtain a visual expression Visual representation; the language instruction and the original scene image grounding obtain proposals representation, three feature vectors obtain three weight matrices from the neural network respectively, and are used for the attention mechanism to interpret the information of the three input feature vectors, and the three feature vectors are multiplied by the three weight matrices respectively to obtain each input vector.
7. The instruction fetching method based on concept learning and priori knowledge as claimed in claim 6, wherein the input vector is calculated To the point of Corresponding to , , Proposal with the largest score is obtained as the position of the new class object; the mechanical arm moves to the upper part of the object to inquire whether the object is wanted, if Yes is answered, GRCNN is called to generate a residual grabbing network, an optimal grabbing point is found, and the concept of a new class of object is updated to a language priori system so as to facilitate grabbing of the next instruction; If the answer No is fed back, the concept learning is wrong, knowledge is continuously required to be provided, and the external text knowledge and the external image knowledge are fused again to perform cross attention verification; If three external knowledge is provided, the concept cannot be learned, and then no object pointed by the language instruction in the feedback scene is output.
8. The instruction grabbing system based on concept learning and priori knowledge is characterized by comprising the following components: The known object grabbing module is configured to train a target recognition network of the known object and provide priori knowledge for the known object, so that the mechanical arm has the recognition capability of part of the object and can grab according to language instructions; The new-class object recognition module is configured to, when the language instruction requires grabbing the new-class object, enable the mechanical arm to provide language knowledge according to rules when the mechanical arm cannot locate the new-class object; The rule is that a new class of objects are connected with a known object in the description of external language knowledge, the prior knowledge of an image is obtained by a mechanical arm according to a multi-mode knowledge map on the Internet based on the injection of prior language knowledge, and the prior knowledge of the image is fused with the language prior; the fused features and the original scene images are subjected to cross attention, and the new types of objects are focused; if attention does not pay attention to the new class of objects or attention errors indicated by the language instructions, the mechanical arm continues to inquire to obtain injection of new knowledge, and through continuous circulation, the mechanical arm can learn the characteristics of the new class of objects; When the next language instruction requires the mechanical arm to grasp a new class of object, the mechanical arm directly grasps.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any of the preceding claims 1-7 when the program is executed.
10. A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims 1-7.

Description

Instruction grabbing method and system based on concept learning and priori knowledge Technical Field The invention belongs to the technical field of machine learning, and particularly relates to a command grabbing method and system based on concept learning and priori knowledge. Background The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art. The robotic arm command grabbing refers to a task of grabbing a target object from a known or unknown object class through a free form language command. The task needs to resolve the names, potential attributes and spatial relationships of objects from the language description to locate the target object, and on this basis, an optimal grabbing gesture is generated to perform grabbing, i.e. combining Computer Vision (CV), natural Language Processing (NLP) and robotics. The task is described by giving an instruction Q containing N words and an image I consisting of K objects, including known objectsAnd unknown objectAccording to the multimodal knowledge graph G, the instruction grabbing task grabs a specific target o from the image I according to a concept c (such as a description of an object) in the slave language instruction Q. In general, the instruction fetch task may be described as follows: Wherein the method comprises the steps of Is the object to be grabbed and Γ θ represents a model function with a learnable parameter θ. The instruction grabbing process can generally be divided into two steps, visual grounding and estimating the target object grabbing pose. The Visual grounding step is divided into Two methods, one-stage and Two-stage. The One-stage method is a feature of directly fusing images and texts, assuming that an RGB-D image (I, D) and a natural language instruction W are given, whereinIs an RGB image which is displayed in a picture frame,Is a depth image. The RGB image I is generally subjected to a target feature extraction network such as ResNet, darknet to obtain visual features. The language instruction W is encoded through a text processing network such as RNN, BERT and the like to obtain text characteristics. And finally, directly splicing and fusing the visual features and the text features, and inputting the spliced and fused visual features and the text features into a positioning module to obtain a prediction frame. The Two-stage method is to generate proposals from the image, i.e., where some target objects may be, before the visual features and text features are fused. Score proposals based on the fused features, and proposal with the highest confidence will be the final location of the target object. In the step of estimating the target grabbing gesture, a mask of the target object is obtained by using the 2D segmentation model according to proposal obtained in the first stage. And cutting the depth image D according to proposal prediction frames to obtain a point cloud image containing the object and the background, and further obtaining a more accurate point cloud image of the object to be grabbed by utilizing a 3D segmentation model so as to obtain the gesture of the target object. The instruction grabbing process can also be made into an end-to-end unified model, namely, an RGB-D image and a language instruction Query are input, the input RGB-D image and the language instruction Query are fed to the point-modeling through preliminary feature extraction and encoding, a grabbing strategy is output, and finally instruction grabbing is realized. The existing instruction grabbing method still has many defects in extracting the target object proposal and obtaining the grabbing gesture to execute grabbing. First, their first task of the instruction grabbing method is to predict the target object frame, i.e. the class of the object to be grabbed must be known in the training set. The existing method is to collect data, mark frames, train and predict, and mainly has the following two problems: (1) The object to be grasped must be identified, i.e. the characteristics (concepts) of the target object are known. Therefore, the object trained by the target detection network can be grasped only, the generalization capability is poor, the capability of learning new knowledge and evolution is lacking, and the thinking of human learning is not met, so that the object is difficult to land in practical application. (2) When language instructions are used for positioning objects which cannot be positioned in the image, the mechanical arm lacks corresponding action replies, so that the intention of people cannot be really understood according to priori knowledge of the people or knowledge on the Internet, and the cognitive ability is lacking. In the target grabbing gesture estimating module, a method based on 6-DoF is generally focused on direct mapping of actions observed from vision, but the grabbing method has strict requirements on training data in the networ