CN-122021940-A - Availability generalization reasoning method, system, computer equipment and medium based on reinforcement learning framework

CN122021940ACN 122021940 ACN122021940 ACN 122021940ACN-122021940-A

Abstract

The invention relates to the technical field of intelligent control of robots, in particular to a method, a system, computer equipment and a medium for generalized reasoning of availability based on a reinforcement learning framework; the method comprises the steps of obtaining multi-modal environment data collected by a perception system, preprocessing, introducing availability priori knowledge to construct an input representation of a availability reasoning task based on the preprocessed multi-modal environment data, executing availability reasoning calculation through a large language model to generate initial availability region prediction, and carrying out iterative optimization and generalization enhancement processing on the initial availability region prediction by using a thinking chain reasoning mechanism based on a reinforcement learning framework to output target availability region information. By the method, the technical problem that the external generalization capability is insufficient under the support of a multi-mode large language model in the prior art is solved, and the reliability, the interpretation and the generalization operation capability of the robot for object availability region reasoning are improved.

Inventors

XIONG HUI
WANG HANQING

Assignees

香港科技大学(广州)

Dates

Publication Date: 20260512
Application Date: 20260408

Claims (10)

1. A reinforcement learning framework-based affordable generalized reasoning method, characterized in that the method is based on a robot configured with a perception system integrated with a binocular stereoscopic camera supporting high-precision depth information perception, comprising: Acquiring multi-modal environment data acquired by the sensing system, and performing preprocessing on the multi-modal environment data, wherein the multi-modal environment data comprises visual image data and task description text captured by the binocular stereoscopic vision camera; Introducing availability priori knowledge based on the preprocessed visual image data and task description text to construct an availability reasoning task input representation; Inputting the availability reasoning task input representation into a preset large language model, and executing availability reasoning calculation through the large language model to generate an initial availability region prediction; based on the reinforcement learning framework, iterative optimization and generalization enhancement processing are carried out on the initial availability region prediction by using a thinking chain reasoning mechanism so as to output target availability region information.
2. The method of claim 1, wherein the sensing system is configured with an image acquisition device, wherein the acquiring the multi-modal environmental data acquired by the sensing system and performing preprocessing on the multi-modal environmental data comprises: Performing instance segmentation processing on the visual image data acquired by the image acquisition equipment, and extracting a mask area of at least one target object; executing semantic analysis processing on the task description text, and extracting action intention semantic features corresponding to task instructions; And performing cross-modal correlation on the mask area and the action intention semantic features, and realizing alignment of visual features and semantic features in a vector space through a cross-modal attention mechanism so as to construct the affordable reasoning task input representation.
3. The method of claim 1, wherein the availability prior knowledge refers to pre-stored intrinsic logic knowledge of the adaptation relationship between object function attributes and action intents in the large language model, wherein the introducing availability prior knowledge to construct a availability inference task input representation based on the pre-processed visual image data and task description text comprises: constructing a thinking chain prompt template based on the availability priori knowledge, wherein the thinking chain prompt template is used for guiding the large language model to conduct availability logic deduction; And integrating the visual image data, the task description text and the thought chain prompt template to generate the affordable reasoning task input representation.
4. The method of claim 3, wherein the large language model is configured with a availability-specific inference module, and wherein the step of performing availability inference calculations by the large language model to generate an initial availability zone prediction comprises: carrying out hierarchical reasoning and deducing on the input representation of the affordance reasoning task through the affordance special reasoning module; after the action adaptation area recommending stage is completed, triggering a thinking-back verification sub-process, and checking logic consistency between a recommending area and the affordable function attribute; deriving an output from the thinking-back verification sub-process according to the hierarchical reasoning, generating the initial availability region prediction comprising region coordinate information and availability tags; The hierarchical reasoning deduction comprises an object function attribute identification stage and an action adaptation area recommendation stage, wherein in the object function attribute identification stage, structural characteristics of a target object are analyzed, and available function attributes matched with action intents in the task description text are associated, and in the action adaptation area recommendation stage, a physical area capable of executing the action intents is positioned and recommended in the visual image data based on the identified available function attributes.
5. The method of claim 1, wherein the reinforcement learning framework is configured with a hybrid rewards function, wherein the chain of thought reasoning mechanism is configured to generate a structured reasoning chain comprising output of a thought phase, a thinking back phase and an answer phase, wherein the iterative optimization and generalization enhancement processing of the initial availability zone prediction using the chain of thought reasoning mechanism based on the reinforcement learning framework to output target availability zone information comprises: Generating a plurality of candidate answers through the reinforcement learning framework, each candidate answer including thinking content, anti-thinking content, and answer content generated based on the thought chain reasoning mechanism; Performing a group relative strategy optimization calculation on the plurality of candidate answers, and evaluating and ranking the reasoning quality of each candidate answer based on the mixed rewards function; And selecting the candidate answer with the highest comprehensive rewards score as the target availability area information according to the evaluation and ranking results.
6. The method of claim 5, wherein the reinforcement learning framework is further configured with a packet relative policy optimization algorithm, the step of performing a packet relative policy optimization calculation on the plurality of candidate answers comprising: calculating a strategy probability loss item corresponding to each candidate answer; Calculating a model output distribution relative entropy penalty item corresponding to each candidate answer; Determining a relative dominance score for each candidate answer within the group based on the calculated reward value of the hybrid reward function; constructing a grouping relative strategy optimization total loss function according to the strategy probability loss term, the model output distribution relative entropy penalty term and the grouping relative advantage score; Model parameters of the large language model are updated by minimizing the group relative strategy optimization total loss function.
7. The method of claim 6, wherein the hybrid rewards function includes format rewards, perceived rewards and perceived rewards, and wherein the step of evaluating and ranking the inferred quality of each candidate answer based on the hybrid rewards function includes: calculating a format rewards score of the candidate answer, wherein the format rewards score is determined based on the integrity and pairing correctness of the structured labels in the candidate answer; Calculating a perceived rewards score of the candidate answer, wherein the perceived rewards score is determined based on the space intersection ratio of the available area coordinate information in the candidate answer and a preset reference area; Calculating cognitive rewards of the candidate answers, wherein the cognitive rewards are determined based on the similarity of the semantic meaning of the available labels in the candidate answers and the cosine of the word vector of the preset reference label; calculating a composite rewards score of the candidate answer according to the weighted sum of the format rewards score, the perceived rewards score and the perceived rewards score; The plurality of candidate answers are ranked based on the composite bonus score.
8. A reinforcement learning framework-based affordable generalized reasoning system, characterized in that the affordable generalized reasoning system is based on a robot configured with a perception system integrated with a binocular stereoscopic camera supporting high-precision depth information perception, comprising: A multi-modal environmental data acquisition module configured to acquire multi-modal environmental data acquired by a robotic perception system and perform preprocessing on the multi-modal environmental data, the multi-modal environmental data including visual image data captured by the binocular stereoscopic camera and task description text; A affordance input representation construction module configured to introduce a priori knowledge of affordances based on the pre-processed visual image data and task description text to construct a affordance inferential task input representation; the initial availability region prediction module is configured to input the availability reasoning task input representation into a preset large language model, execute availability reasoning calculation through the large language model and generate initial availability region prediction; And the affordance reasoning result output module is configured to perform iterative optimization and generalization enhancement processing on the initial affordance region prediction by using a thinking chain reasoning mechanism based on the reinforcement learning framework so as to output target affordance region information.
9. A computer device, comprising: and a memory communicatively coupled to the at least one processor; Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions, wherein the computer instructions are for causing a computer to perform the method of any one of claims 1-7.

Description

Availability generalization reasoning method, system, computer equipment and medium based on reinforcement learning framework Technical Field The invention relates to the technical field of intelligent control of robots, in particular to a method, a system, computer equipment and a medium for generalized reasoning of availability based on a reinforcement learning framework. Background Availability in the autonomous operation field of robots) Inference is a key technology linking the perceptibility of robots with physical operation behavior, and its core goal is to enable robots to autonomously identify potential physical areas in an object or environment where some type of action can be performed, through perceptual information. Along with the wide application of robots in complex scenes such as industrial automation, home service and the like, the reliability and generalization capability of the affordable reasoning technology have become important bases for realizing intelligent operation of the robots. In the development process of the affordable reasoning technology, a manual definition method based on geometric rules, a pixel level identification method based on deep learning and a reasoning enhancement method combining a multi-mode large language model in recent years appear successively. The invention patent (application publication number CN 119693768A) discloses a multi-modal large language model attribute prediction method based on a multi-modal thinking chain, which combines a hierarchical thinking chain reasoning and logic checking mechanism by constructing a mask generator and a scene graph analyzer to improve the context understanding capability and logic consistency of an attribute prediction task. The technical scheme has a certain advancement in the aspects of attribute identification and relationship inference, but has obvious limitation when the technical scheme is directly applied to the robot availability inference task. The method is characterized in that the existing method focuses on static matching of attributes and object types, lacks an explicit reasoning mechanism of availability essence logic, is difficult to adapt to dynamic change of availability areas among different objects and cross-type generalization requirements, and meanwhile, when the existing method faces challenges such as various object forms, complex scene interference and the like in an open environment, the reasoning process of the existing method lacks interpretability, and has limited adaptability to unseen object types or interaction scenes, so that the accuracy and reliability of availability reasoning in actual robot operation are difficult to meet actual application requirements. Therefore, the existing affordable reasoning technology has a certain semantic guidance capability under the support of a multi-mode large language model, but still has the technical problem of insufficient external generalization capability, and restricts the further improvement of the autonomous operation capability of the robot in an unstructured environment. Disclosure of Invention The invention aims at the defects or shortcomings, provides a reinforcement learning framework-based availability generalization reasoning method, a reinforcement learning framework-based availability generalization reasoning system, a reinforcement learning framework-based availability generalization reasoning computer device and a reinforcement learning framework-based availability generalization reasoning medium, and can solve the technical problem that the external generalization capability is insufficient under the support of a multi-mode large language model in the prior art. The invention provides a reinforcement learning framework-based availability generalization reasoning method, which is based on a robot provided with a perception system, wherein the perception system is integrated with a binocular stereoscopic vision camera supporting high-precision depth information perception, and the method comprises the following steps: and acquiring multi-modal environment data acquired by the sensing system, and preprocessing the multi-modal environment data, wherein the multi-modal environment data comprises visual image data and task description text captured by the binocular stereoscopic vision camera. Based on the preprocessed visual image data and task description text, affordance prior knowledge is introduced to construct a affordable reasoning task input representation. Inputting the availability reasoning task input representation into a preset large language model, executing availability reasoning calculation through the large language model, and generating an initial availability region prediction. Based on the reinforcement learning framework, iterative optimization and generalization enhancement processing are carried out on the initial availability region prediction by using a thinking chain reasoning mechanism so as to output target av