US-20260126798-A1 - SYSTEMS AND METHODS FOR MULTI-MODAL VISUAL REASONING USING KNOWLEDGE GRAPHS MAINTAINED OVER A PERIOD OF TIME

US20260126798A1US 20260126798 A1US20260126798 A1US 20260126798A1US-20260126798-A1

Abstract

A system for processing multi-modal data representing an environment to generate scene graphs of the environment is described. The system can obtain sensor data associated with a vehicle operating in the environment. In examples, the system can determine a set of features from the sensor data, including one or more objects and one or more agents present in the environment, and can generate a scene graph that represents the poses and velocities of these objects and agents relative to the environment. In some examples, based on generating the scene graph, the system can generate a knowledge graph by encoding the relationships among the identified objects and agents. In some examples, the system can generate a control signal, using attributes that represent the states of objects and agents in the knowledge graph, and provide this control signal to the vehicle in order to adjust or cause the operation of the vehicle.

Inventors

Aswanth Krishnan
Lakshya Priyadarshi
Sachin Kumar
Nagendra Nagaraja

Assignees

QPIAI INDIA PRIVATE LIMITED

Dates

Publication Date: 20260507
Application Date: 20250912
Priority Date: 20241105

Claims (20)

1 . A system for processing multi-modal data representing an environment during automated operation of a vehicle, the system comprising: one or more processors configured to: obtain first sensor data associated with a vehicle operating in an environment at a first point in time; generate a knowledge graph based on the first sensor data, the knowledge graph comprising (1) a plurality of nodes representing a plurality of objects and one or more agents in the environment at the first point in time, and (2) edges between respective pairs of the plurality of nodes representing relationships between objects or agents of the respective pairs of the plurality of nodes in the environment at the first point in time; in response to obtaining second sensor data associated with the vehicle at a second point in time, determine one or more changes to the relationships involving the one or more objects and the agents; update the knowledge graph based on the one or more changes to the relationships by incrementally modifying the edges and the plurality of nodes based on the determined one or more changes to the relationships involving the objects and the agents; and in response to updating the knowledge graph, provide a control signal to the vehicle to cause operation of a vehicle based on the knowledge graph based on determining the modification to the edges between the respective pairs of the plurality of nodes of the updated knowledge graph cause the updated knowledge graph to define a predefined relationship pattern indicating an unsafe condition is present in the environment.
2 . The system of claim 1 , wherein the one or more processors are further configured to: in response to obtaining the first sensor data, determine a set of features associated with the environment based on the first sensor data, the set of features comprising the one or more objects and the one or more agents; and generate at least one scene graph representing the environment based on the set of features, wherein the one or more processors configured to generate the knowledge graph are configured to: generate the knowledge graph based on the at least one scene graph.
3 . The system of claim 2 , wherein the first sensor data comprises a first portion generated during operation of a first sensor of the vehicle and a second portion generated during operation of a second sensor, wherein the one or more processors configured to generate the at least one scene graph are configured to: generate a first scene graph for the first portion of the first sensor data and a second scene graph for the second portion of the first sensor data.
4 . The system of claim 3 , wherein the one or more processors are further configured to: determine a correspondence between the first portion and the second portion of the first sensor data; and aggregate attributes from the first scene graph and the second scene graph based on the correspondence to determine a global scene graph, wherein the one or more processors configured to generate the knowledge graph are configured to: determine a composite representation of the environment based on attributes from the first scene graph and the second scene graph.
5 . The system of claim 1 , wherein the one or more processors are further configured to: track movement of the one or more objects and the one or more agents in the environment based on the one or more changes to the relationships; in response to tracking the movement of the one or more objects and the one or more agents, determining that operation of the vehicle does not satisfy one or more operational requirements at the second point in time or a third point in time; and generate the control signal to adjust the operation of the vehicle to satisfy the one or more operational requirements at the third point in time.
6 . The system of claim 5 , wherein the one or more operational requirements comprises operating below a threshold speed when the vehicle is within a predetermined distance from the one or more objects or the one or more agents.
7 . The system of claim 5 , wherein the one or more operational requirements comprises: operating the vehicle in accordance with a first path that is separated from a second path for objects or agents operating in the environment.
8 . The system of claim 5 , wherein the one or more processors configured to determine that operation of the vehicle does not satisfy one or more operational requirements are configured to: determine that the vehicle is operating in accordance with a first path at least in part overlaps with one or more second paths of the one or more objects or the one or more agents.
9 . A method for processing multi-modal data representing an environment during automated operation of a vehicle, the method comprising: obtaining first sensor data associated with a vehicle operating in an environment at a first point in time; generating a knowledge graph based on the first sensor data, the knowledge graph comprising (1) a plurality of nodes representing a plurality of objects and one or more agents in the environment at the first point in time, and (2) edges between respective pairs of the plurality of nodes representing relationships agents in the environment at the first point in time; in response to obtaining second sensor data associated with the vehicle at a second point in time, determining one or more changes to the relationships involving the objects and the agents; updating the knowledge graph based on the one or more changes to the relationships by incrementally modifying the edges and the plurality of nodes based on the determined one or more changes to the relationships involving the objects and the agents; and in response to updating the knowledge graph, providing a control signal to the vehicle to cause operation of a vehicle based on determining the modification to the edges between the respective pairs of the plurality of nodes of the updated knowledge graph cause the updated knowledge graph to define a predefined relationship pattern.
10 . The method of claim 9 , further comprising: in response to obtaining the first sensor data, determining a set of features associated with the environment based on the first sensor data, the set of features comprising the one or more objects and the one or more agents; and generating at least one scene graph representing the environment based on the set of features, wherein generating the knowledge graph comprises: generating the knowledge graph based on the at least one scene graph.
11 . The method of claim 10 , wherein the first sensor data comprises a first portion generated during operation of a first sensor of the vehicle and a second portion generated during operation of a second sensor, wherein generating the at least one scene graph comprises: generating a first scene graph for the first portion of the first sensor data and a second scene graph for the second portion of the first sensor data.
12 . The method of claim 11 , further comprising: determining a correspondence between the first portion and the second portion of the first sensor data; and aggregating attributes from the first scene graph and the second scene graph based on the correspondence to determine a global scene graph, wherein generating the knowledge graph comprises: determining a composite representation of the environment based on attributes from the first scene graph and the second scene graph.
13 . The method of claim 9 , further comprising: tracking movement of the one or more objects and the one or more agents in the environment based on the one or more changes to the relationships; in response to tracking the movement of the one or more objects and the one or more agents, determining that operation of the vehicle does not satisfy one or more operational requirements at the second point in time or a third point in time; and generating the control signal to adjust the operation of the vehicle to satisfy the one or more operational requirements at the third point in time.
14 . The method of claim 13 , wherein the one or more operational requirements comprises: operating below a threshold speed when the vehicle is within a predetermined distance from the one or more objects or the one or more agents.
15 . The method of claim 13 , wherein the one or more operational requirements comprises: operating the vehicle in accordance with a first path that is separated from a second path for objects or agents operating in the environment.
16 . The method of claim 13 , wherein determining that operation of the vehicle does not satisfy one or more operational requirements comprises: determining that the vehicle is operating in accordance with a first path at least in part overlaps with one or more second paths of the one or more objects or the one or more agents.
17 . One or more non-transitory computer-readable mediums storing instructions thereon that, when executed by one or more processors, cause the one or more processors to: obtain first sensor data associated with a device operating in an environment at a first point in time; generate a knowledge graph based on the first sensor data, the knowledge graph comprising (1) a plurality of nodes representing a plurality of objects and one or more agents in the environment at the first point in time, and (2) edges between respective pairs of the plurality of nodes representing relationships between objects or agents of the respective pairs of the plurality of nodes in the environment at the first point in time; in response to obtaining second sensor data associated with the device at a second point in time, determine one or more changes to the relationships involving the objects and the agents; update the knowledge graph based on the one or more changes to the relationships by incrementally modifying the edges and the plurality of nodes based on the determined one or more changes to the relationships involving the objects and the agents; and in response to updating the knowledge graph, provide a control signal to the device to cause operation of a device based on determining the modification to the edges between the respective pairs of the plurality of nodes of the updated knowledge graph cause the updated knowledge graph to define a predefined relationship pattern.
18 . The one or more non-transitory computer-readable mediums of claim 17 , wherein the instructions further cause the one or more processors to: in response to obtaining the first sensor data, determine a set of features associated with the environment based on the first sensor data, the set of features comprising the one or more objects and the one or more agents; and generate at least one scene graph representing the environment based on the set of features, wherein the instructions that cause the one or more processors to generate the knowledge graph cause the one or more processors to: generate the knowledge graph based on the at least one scene graph.
19 . The one or more non-transitory computer-readable mediums of claim 18 , wherein the first sensor data comprises a first portion generated during operation of a first sensor of the device and a second portion generated during operation of a second sensor, wherein the instructions that cause the one or more processors to generate the at least one scene graph cause the one or more processors to: generate a first scene graph for the first portion of the first sensor data and a second scene graph for the second portion of the first sensor data.
20 . The one or more non-transitory computer-readable mediums of claim 19 , wherein the instructions further cause the one or more processors to: determine a correspondence between the first portion and the second portion of the first sensor data; and aggregate attributes from the first scene graph and the second scene graph based on the correspondence to determine a global scene graph, wherein the instructions that cause the one or more processors to generate the knowledge graph cause the one or more processors to: determine a composite representation of the environment based on attributes from the first scene graph and the second scene graph.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to Indian Provisional Application No. 202441084635, filed Nov. 5, 2024, the entirety of which is incorporated by reference herein. BACKGROUND Various automated vehicle systems and remote monitoring systems deployed in a variety of environments (e.g., drivable highways, warehouses, etc.) rely upon the acquisition and interpretation of multi-modal sensor data to perceive and interact with their surrounding environment. These systems commonly utilize data generated from devices such as cameras, lidar, radar, and ultrasonic sensors to detect objects and agents in proximity to a vehicle. Information from these heterogeneous sources is often fused in real time to generate structured representations, including object lists, semantic maps, or hierarchical graphs, that facilitate environmental understanding crucial to navigation and operation. As automated vehicle deployments and remote monitoring systems increase in scope and complexity, significant technical challenges are arising. The high volume and diversity of sensor data require intensive computational resources for real time processing, particularly when deriving semantic relationships and dynamic parameters such as pose and velocity for each detected object or agent in a scene. The frequent need to transfer, store, and synchronize environmental data and contextual models results in substantial demands on both onboard memory and network bandwidth, especially during collaborative operations or remote diagnostic analyses. These pressures can lead to latency issues, increased energy consumption, and higher risks of data bottlenecks in resource constrained embedded platforms. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component may be labeled in every drawing. In the drawings: FIG. 1 is a block diagram illustrating an environment for processing multi-modal data representing an environment to generate scene graphs and/or knowledge graphs, in accordance with one or more embodiments; FIG. 2 is a block diagram illustrating a process for generating a dynamic scene graph from multi-modal inputs and integrating the dynamic scene graph as an update to a knowledge graph, in accordance with one or more embodiments; FIG. 3 is a block diagram illustrating a process for scene graph generation for knowledge graph generation or updates, in accordance with one or more embodiments; FIG. 4 is a block diagram illustrating a process for continuous learning and knowledge evolution to maintain a knowledge graph, in accordance with one or more embodiments; FIG. 5 is a flow diagram illustrating a process for executing cascaded visual reasoning tasks by decomposing a complex task into subtasks and processing the tasks in parallel, in accordance with one or more embodiments; FIG. 6 is a flowchart illustrating the decision-making process for task processing in a distributed visual reasoning system, in accordance with one or more embodiments; FIG. 7 is a flowchart illustrating a process for task processing by an analytics server, in accordance with one or more embodiments; FIG. 8 is a block diagram illustrating a process of aggregating scene graphs from multiple edge devices to generate a consolidated global scene graph in a distributed visual reasoning system, in accordance with one or more embodiments; FIG. 9 is a flowchart illustrating a method for processing multi-modal data representing an environment to generate scene graphs of the environment, in accordance with one or more embodiments; FIG. 10 is a flowchart illustrating a method for generating a knowledge graph from multi-modal sensor data and providing a control signal for vehicle operation, in accordance with one or more embodiments; and FIG. 11 is a flowchart illustrating a method for processing multi-modal data representing an environment during automated operation of a vehicle, in accordance with one or more embodiments. DETAILED DESCRIPTION Below are detailed descriptions of various concepts related to, and approaches, methods, apparatuses, and systems for implementing the various techniques described herein. The various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the described concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes. In examples, a system is provided for processing multi-modal data representing an environment to generate scene graphs of the environment. The system can operate in connection with one or more mobile platforms such as autonomous vehicles, aerial drones, automated guided vehicles (AGVs), or stationary monitoring stations (e.g., remote monitoring systems)