CN-121973187-A - Collaborative decision-making method of search and rescue robot based on language guidance

CN121973187ACN 121973187 ACN121973187 ACN 121973187ACN-121973187-A

Abstract

The application discloses a collaborative decision-making method of a search and rescue robot based on language guidance, which relates to the technical field of robots, and comprises the following steps: the method comprises the steps of obtaining a real-time video, allowing an action set and a search and rescue task prompt word, inputting the real-time video into a video text coding model to obtain a video description text, inputting the video description text, the allowing action set and the search and rescue task prompt word into a natural language guiding model to obtain an action sequence decision result, wherein the natural language guiding model is obtained by training a basic model based on a video text sample and an action sequence sample. According to the application, the real-time video is converted into the corresponding text through the video text coding model, and the action sequence required by the current task is determined according to the video text, the prompt word and the allowed action set through the natural language guiding model, so that the search and rescue robot is not distributed to tasks which are not required by the current rescue environment, and the rescue effect can be improved.

Inventors

WANG XUEQIAN
TAN JUNBO
CHEN YIFEI
Guo Guanqiu
YAN CHUAN
HUI HANG

Assignees

浙江华莲智能科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260106

Claims (10)

1. The collaborative decision-making method of the search and rescue robot based on language guidance is characterized by comprising the following steps of: acquiring real-time videos collected by a search and rescue robot, an allowed action set of the search and rescue robot and search and rescue task prompt words; inputting the real-time video into a preset video text coding model to obtain a video description text corresponding to the real-time video, wherein the video text coding model is obtained by training a preset model to be trained based on a preset search and rescue image sample and an image text sample; Inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain an action sequence decision result of the search and rescue robot, wherein the natural language guidance model is obtained by training a preset basic model based on a preset video text sample and an action sequence sample corresponding to the video text sample.
2. The method of claim 1, wherein the step of inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain an action sequence decision result of the search and rescue robot further comprises: Acquiring the basic model, the video text sample, the action sequence sample and the allowed action set, wherein the basic model is a pre-trained large language model; Inputting the video text sample into the basic model, determining a plurality of target actions from the allowed action set based on the basic model, and forming a target action sequence; and training the basic model through a preset reinforcement learning training paradigm based on the target action sequence and the action sequence sample to obtain the natural language guidance model.
3. The method of claim 2, wherein the step of training the base model through a preset reinforcement learning training paradigm based on the target motion sequence and the motion sequence samples to obtain the natural language guidance model comprises: Calculating sequence similarity between the target action sequence and the action sequence sample; If the sequence similarity is higher than a preset similarity threshold, determining a reward value of the target action sequence based on the sequence similarity; if the sequence similarity is lower than the similarity threshold, determining a penalty value of the target action sequence based on the sequence similarity; And adjusting parameters of the basic model based on the reward value or the penalty value to obtain the natural language guidance model.
4. The method of claim 1, wherein before the step of inputting the real-time video into a preset video text coding model to obtain the video description text corresponding to the real-time video, the method further comprises: Acquiring a search and rescue image sample, an image text sample corresponding to the search and rescue image sample and the model to be trained, wherein the model to be trained is obtained based on a preset visual data set through pre-training; encoding the search and rescue image sample and the image text sample to obtain a search and rescue image code corresponding to the search and rescue image sample and an image text code corresponding to the image text sample; Selecting a target search and rescue image code from the search and rescue image codes, selecting a target image text code from the image text sample, and calculating a loss value based on the target search and rescue image code, the target image text code and a preset average loss function; and updating parameters of the model to be trained based on the loss value to obtain the video text coding model.
5. The method of claim 4, wherein the step of calculating a loss value based on the target search and rescue image encoding and the target image text encoding comprises: performing transposition operation on the target image text code, and performing logarithmic operation on the transposed target image text code to obtain a logarithmic image text code; Calculating the product of the target search and rescue image code and the logarithmic image text code to obtain a first loss value; performing transposition operation on the target search and rescue image code, and performing logarithmic operation on the transposed target search and rescue image code to obtain a logarithmic search and rescue image code; calculating the product of the target search and rescue image code and the logarithmic search and rescue image code to obtain a second loss value; and calculating the average value of the first loss value and the second loss value to obtain the loss value.
6. The method of claim 1, wherein the search and rescue task prompt word includes a stage task prompt word, the stage task prompt word includes a stage task description and a task object description, and the step of inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain an action sequence decision result of the search and rescue robot includes: Inputting the video description text, the allowed action set and the search and rescue task prompt words into a preset natural language guidance model, and determining an execution stage of a search and rescue task and task objects of each execution stage through a preset thinking chain based on the stage task description and the task object description; Determining an execution action corresponding to each execution stage and an execution sequence of each execution action from the permission action set based on the execution stage and the task object; and combining the execution actions based on the execution sequence to obtain the action sequence decision result.
7. The utility model provides a search and rescue robot collaborative decision-making device based on language guide, its characterized in that, the device includes: the data acquisition module is used for acquiring real-time videos collected by the search and rescue robot, an allowed action set of the search and rescue robot and search and rescue task prompt words; The text generation module is used for inputting the real-time video into a preset video text coding model to obtain a video description text corresponding to the real-time video, wherein the video text coding model is obtained by training a preset model to be trained based on a preset search and rescue image sample and an image text sample; The decision generation module is used for inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain an action sequence decision result of the search and rescue robot, wherein the natural language guidance model is obtained by training a preset basic model based on a preset video text sample and an action sequence sample corresponding to the video text sample.
8. A language-guided search and rescue robot collaborative decision-making apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the language-guided search and rescue robot collaborative decision-making method according to any one of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the language-guided search and rescue robot collaborative decision-making method according to any one of claims 1 to 6.
10. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the steps of the language-guided search and rescue robot collaborative decision-making method according to any one of claims 1 to 6.

Description

Collaborative decision-making method of search and rescue robot based on language guidance Technical Field The application relates to the technical field of robots, in particular to a collaborative decision-making method of a search and rescue robot based on language guidance. Background With the rapid development of a series of technologies such as automatic control, artificial intelligence, 5G, high-performance calculation and the like, post-disaster rescue by a rescue robot is an important development direction of post-disaster rescue. In order to improve the post-disaster rescue effect, different special rescue robots can cooperate through an efficient decision method. The current rescue robot decision-making method can only distribute the rescue robot to the trained task, and can distribute the rescue robot to the task which is not needed by the current rescue environment, so that the rescue effect of post-disaster rescue by the current method is poor. Disclosure of Invention The application mainly aims to provide a collaborative decision-making method of a search and rescue robot based on language guidance, and aims to solve the technical problem that the rescue effect of post-disaster rescue by the existing method is poor. In order to achieve the above purpose, the application provides a collaborative decision-making method of a search and rescue robot based on language guidance, which comprises the following steps: acquiring real-time videos collected by a search and rescue robot, an allowed action set of the search and rescue robot and search and rescue task prompt words; inputting the real-time video into a preset video text coding model to obtain a video description text corresponding to the real-time video, wherein the video text coding model is obtained by training a preset model to be trained based on a preset search and rescue image sample and an image text sample; Inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain an action sequence decision result of the search and rescue robot, wherein the natural language guidance model is obtained by training a preset basic model based on a preset video text sample and an action sequence sample corresponding to the video text sample. In an embodiment, before the step of inputting the video description text, the allowed action set and the search and rescue task prompt word into a preset natural language guidance model to obtain the action sequence decision result of the search and rescue robot, the method further includes: Acquiring the basic model, the video text sample, the action sequence sample and the allowed action set, wherein the basic model is a pre-trained large language model; Inputting the video text sample into the basic model, determining a plurality of target actions from the allowed action set based on the basic model, and forming a target action sequence; and training the basic model through a preset reinforcement learning training paradigm based on the target action sequence and the action sequence sample to obtain the natural language guidance model. In an embodiment, the step of training the basic model through a preset reinforcement learning training paradigm based on the target action sequence and the action sequence sample to obtain the natural language guidance model includes: Calculating sequence similarity between the target action sequence and the action sequence sample; If the sequence similarity is higher than a preset similarity threshold, determining a reward value of the target action sequence based on the sequence similarity; if the sequence similarity is lower than the similarity threshold, determining a penalty value of the target action sequence based on the sequence similarity; And adjusting parameters of the basic model based on the reward value or the penalty value to obtain the natural language guidance model. In an embodiment, before the step of inputting the real-time video into a preset video text coding model to obtain the video description text corresponding to the real-time video, the method further includes: Acquiring a search and rescue image sample, an image text sample corresponding to the search and rescue image sample and the model to be trained, wherein the model to be trained is obtained based on a preset visual data set through pre-training; encoding the search and rescue image sample and the image text sample to obtain a search and rescue image code corresponding to the search and rescue image sample and an image text code corresponding to the image text sample; Selecting a target search and rescue image code from the search and rescue image codes, selecting a target image text code from the image text sample, and calculating a loss value based on the target search and rescue image code, the target image text code and a preset average loss function; and updating parameters of the model to be t