US-12619241-B2 - System and method for zero-shot object navigation using large language models

US12619241B2US 12619241 B2US12619241 B2US 12619241B2US-12619241-B2

Abstract

A method includes determining a specified object to locate within a surrounding environment. The method also includes causing a robot to capture an image and a depth map of the surrounding environment. The method further includes using a scene understanding model, predicting one or more rooms and one or more objects captured in the image. The method also includes updating a second map of the surrounding environment based on the predicted rooms, the predicted objects, the depth map, and a location of the robot. The method further includes determining a likelihood of the specified object being in a candidate room and a likelihood of the specified object being near a candidate object using a pre-trained large language model. The method also includes causing the robot to move to a next location for the robot to search for the specified object, based on the likelihoods and the second map.

Inventors

Yilin Shen
Kaiwen ZHOU
Hongxia Jin

Assignees

SAMSUNG ELECTRONICS CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20231103

Claims (20)

1 . A method comprising: determining a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects; causing a robot to capture an image and a depth map of the surrounding environment; using a scene understanding model, predicting one or more rooms and one or more objects captured in the image; updating a semantic map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot; determining a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being within a threshold distance of each of the candidate objects using a pre-trained large language model, wherein the pre-trained large language model receives as an input a natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model, and wherein the pre-trained large language model provides the determined likelihoods based on known associations between the specified object, the candidate rooms, and the candidate objects; and causing the robot to move to a next location for the robot to search for the specified object, based on the determined likelihoods and the semantic map of the surrounding environment.
2 . The method of claim 1 , wherein determining the specified object to locate within the surrounding environment comprises receiving a request from a user to locate the specified object within the surrounding environment.
3 . The method of claim 1 , wherein determining the likelihood of the specified object being in each of the candidate rooms and the likelihood of the specified object being within the threshold distance of each of the candidate objects using the pre-trained large language model comprises: inputting the natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model; and obtaining a response from the pre-trained large language model, the response comprising a likelihood score for each of the candidate rooms.
4 . The method of claim 1 , wherein causing the robot to move to the next location based on the determined likelihoods and the semantic map of the surrounding environment comprises using a probabilistic soft logic algorithm and one or more of the determined likelihoods to select a frontier among multiple frontiers identified in the semantic map.
5 . The method of claim 1 , wherein causing the robot to move to the next location comprises causing the robot to move to an unexplored location within a threshold distance of a first predicted object of the one or more predicted objects if the likelihood of the first predicted object being within the threshold distance of the specified object is greater than a threshold.
6 . The method of claim 5 , wherein causing the robot to move to the next location further comprises causing the robot to not move to the unexplored location within the threshold distance of the first predicted object if the likelihood of the first predicted object being within the threshold distance of the specified object is less than the threshold.
7 . The method of claim 1 , wherein causing the robot to move to the next location comprises causing the robot to move to an unexplored location in or within a threshold distance of a first predicted room of the one or more predicted rooms if the likelihood of the specified object being in or within the threshold distance of the first predicted room is greater than a threshold.
8 . An electronic device comprising: at least one processor configured to: determine a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects; cause a robot to capture an image and a depth map of the surrounding environment; using a scene understanding model, predict one or more rooms and one or more objects captured in the image; update a semantic map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot; determine a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being within a threshold distance of each of the candidate objects using a pre-trained large language model, wherein the pre-trained large language model receives as an input a natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model, and wherein the pre-trained large language model provides the determined likelihoods based on known associations between the specified object, the candidate rooms, and the candidate objects; and cause the robot to move to a next location for the robot to search for the specified object, based on the determined likelihoods and the semantic map of the surrounding environment.
9 . The electronic device of claim 8 , wherein to determine the specified object to locate within the surrounding environment, the at least one processor is configured to receive a request from a user to locate the specified object within the surrounding environment.
10 . The electronic device of claim 8 , wherein to determine the likelihood of the specified object being in each of the candidate rooms and the likelihood of the specified object being within the threshold distance of each of the candidate objects using the pre-trained large language model, the at least one processor is configured to: input the natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model; and obtain a response from the pre-trained large language model, the response comprising a likelihood score for each of the candidate rooms.
11 . The electronic device of claim 8 , wherein to cause the robot to move to the next location based on the determined likelihoods and the semantic map of the surrounding environment, the at least one processor is configured to use a probabilistic soft logic algorithm and one or more of the determined likelihoods to select a frontier among multiple frontiers identified in the semantic map.
12 . The electronic device of claim 8 , wherein to cause the robot to move to the next location, the at least one processor is configured to cause the robot to move to an unexplored location within a threshold distance of a first predicted object of the one or more predicted objects if the likelihood of the first predicted object being within the threshold distance of the specified object is greater than a threshold.
13 . The electronic device of claim 12 , wherein to cause the robot to move to the next location, the at least one processor is further configured to cause the robot to not move to the unexplored location within the threshold distance of the first predicted object if the likelihood of the first predicted object being within the threshold distance of the specified object is less than the threshold.
14 . The electronic device of claim 8 , wherein to cause the robot to move to the next location, the at least one processor is configured to cause the robot to move to an unexplored location in or within a threshold distance of a first predicted room of the one or more predicted rooms if the likelihood of the specified object being in or within the threshold distance of the first predicted room is greater than a threshold.
15 . A non-transitory machine-readable medium containing instructions that when executed cause at least one processor of an electronic device to: determine a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects; cause a robot to capture an image and a depth map of the surrounding environment; using a scene understanding model, predict one or more rooms and one or more objects captured in the image; update a semantic map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot; determine a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being within a threshold distance of each of the candidate objects using a pre-trained large language model, wherein the pre-trained large language model receives as an input a natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model, and wherein the pre-trained large language model provides the determined likelihoods based on known associations between the specified object, the candidate rooms, and the candidate objects; and cause the robot to move to a next location for the robot to search for the specified object, based on the determined likelihoods and the semantic map of the surrounding environment.
16 . The non-transitory machine-readable medium of claim 15 , wherein the instructions to determine the specified object to locate within the surrounding environment, comprise instructions to receive a request from a user to locate the specified object within the surrounding environment.
17 . The non-transitory machine-readable medium of claim 15 , wherein the instructions to determine the likelihood of the specified object being in each of the candidate rooms and the likelihood of the specified object being within the threshold distance of each of the candidate objects using the pre-trained large language model, comprise instructions to: input the natural language query comprising the specified object and each of the candidate rooms to the pre-trained large language model; and obtain a response from the pre-trained large language model, the response comprising a likelihood score for each of the candidate rooms.
18 . The non-transitory machine-readable medium of claim 15 , wherein the instructions to cause the robot to move to the next location based on the determined likelihoods and the semantic map of the surrounding environment, comprise instructions to use a probabilistic soft logic algorithm and one or more of the determined likelihoods to select a frontier among multiple frontiers identified in the semantic map.
19 . The non-transitory machine-readable medium of claim 15 , wherein the instructions to cause the robot to move to the next location, comprise instructions to cause the robot to move to an unexplored location within a threshold distance of a first predicted object of the one or more predicted objects if the likelihood of the first predicted object being within the threshold distance of the specified object is greater than a threshold.
20 . The non-transitory machine-readable medium of claim 19 , wherein the instructions to cause the robot to move to the next location, further comprise instructions to cause the robot to not move to the unexplored location within the threshold distance of the first predicted object if the likelihood of the first predicted object being within the threshold distance of the specified object is less than the threshold.

Description

CROSS-REFERENCE TO RELATED APPLICATION AND PRIORITY CLAIM This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application No. 63/466,212 filed on May 12, 2023, which is hereby incorporated by reference in its entirety. TECHNICAL FIELD This disclosure relates generally to object navigation. More specifically, this disclosure relates to a system and method for zero-shot object navigation using large language models. BACKGROUND Object navigation is a task in which an embodied agent must navigate to a specific goal object within an unknown environment. This task can be fundamental to other navigation-based embodied tasks because it enables the agent to interact with the goal object. Such object navigation tasks usually require large-scale training in visual environments with labeled objects. SUMMARY This disclosure provides a system and method for zero-shot object navigation using large language models. In a first embodiment, a method includes determining a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects. The method also includes causing a robot to capture an image and a depth map of the surrounding environment. The method further includes using a scene understanding model, predicting one or more rooms and one or more objects captured in the image. The method also includes updating a second map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot. The method further includes determining a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being near each of the candidate objects using a pre-trained large language model. In addition, the method includes causing the robot to move to a next location for the robot to search for the specified object, based on the determined likelihoods and the second map of the surrounding environment. In a second embodiment, an electronic device includes at least one processing device configured to determine a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects. The at least one processing device is also configured to cause a robot to capture an image and a depth map of the surrounding environment. The at least one processing device is further configured to using a scene understanding model, predict one or more rooms and one or more objects captured in the image. The at least one processing device is also configured to update a second map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot. The at least one processing device is further configured to determine a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being near each of the candidate objects using a pre-trained large language model. In addition, the at least one processing device is configured to cause the robot to move to a next location for the robot to search for the specified object, based on the determined likelihoods and the second map of the surrounding environment. In a third embodiment, a non-transitory machine-readable medium contains instructions that when executed cause at least one processor of an electronic device to determine a specified object to locate within a surrounding environment, the surrounding environment comprising multiple candidate rooms and multiple candidate objects. The non-transitory machine-readable medium also contains instructions that when executed cause the at least one processor to cause a robot to capture an image and a depth map of the surrounding environment. The non-transitory machine-readable medium further contains instructions that when executed cause the at least one processor to using a scene understanding model, predict one or more rooms and one or more objects captured in the image. The non-transitory machine-readable medium also contains instructions that when executed cause the at least one processor to update a second map of the surrounding environment based on the one or more predicted rooms, the one or more predicted objects, the depth map, and a location of the robot. The non-transitory machine-readable medium further contains instructions that when executed cause the at least one processor to determine a likelihood of the specified object being in each of the candidate rooms and a likelihood of the specified object being near each of the candidate objects using a pre-trained large language model. In addition, the non-transitory machine-readable medium contains instructions that when executed cause the at least one processor to cause the robot to move to a next location for the robot to search for the specified o