US-20260125075-A1 - SYSTEMS AND METHODS FOR MULTI-MODAL VISUAL REASONING USING DYNAMIC SCENE GRAPHS AND KNOWLEDGE GRAPHS

US20260125075A1US 20260125075 A1US20260125075 A1US 20260125075A1US-20260125075-A1

Abstract

A system for processing multi-modal data representing an environment to generate scene graphs of the environment is described. The system can obtain sensor data associated with a vehicle operating in the environment. In examples, the system can determine a set of features from the sensor data, including one or more objects and one or more agents present in the environment, and can generate a scene graph that represents the poses and velocities of these objects and agents relative to the environment. In some examples, based on generating the scene graph, the system can generate a knowledge graph by encoding the relationships among the identified objects and agents. In some examples, the system can generate a control signal, using attributes that represent the states of objects and agents in the knowledge graph, and provide this control signal to the vehicle in order to adjust or cause the operation of the vehicle.

Inventors

Aswanth Krishnan
Lakshya Priyadarshi
Sachin Kumar
Nagendra Nagaraja

Assignees

QPIAI INDIA PRIVATE LIMITED

Dates

Publication Date: 20260507
Application Date: 20250912
Priority Date: 20241105

Claims (20)

1 . A system for processing multi-modal data representing an environment to generate scene graphs of the environment, the system comprising: one or more processors configured to: obtain first sensor data associated with a vehicle operating in an environment, the first sensor data comprising a first portion associated with a first sensor and a second portion associated with a second sensor; determine a set of features associated with the environment based on the first sensor data, the set of features comprising one or more objects and one or more agents; in response to determining the set of features, generate a scene graph comprising a first plurality of nodes and representing poses and velocities of the one or more objects and the one or more agents relative to the environment; in response to generating the scene graph, generate a knowledge graph based on the scene graph and stored contextual information, the knowledge graph representing relationships involving the one or more objects and the one or more agents in the environment for use in strategic planning of the vehicle as represented by the first plurality of nodes of the scene graph; generate a second scene graph based on second sensor data associated with the vehicle operating in the environment, the second scene graph comprising a second plurality of nodes and representing updated poses and velocities of the one or more objects and the one or more agents relative to the environment; revise the knowledge graph by updating the relationships involving the one or more objects and the one or more agents in the environment based on the updated poses and velocities of the one or more objects and the one or more agents as represented by the second plurality of nodes of the second scene graph; generate a control signal configured to adjust an operation of the vehicle based on the updated relationships of the updated knowledge graph using one or more first attributes representing first states of the one or more objects or one or more second attributes representing second states of the one or more agents from the knowledge graph; and provide the control signal to the vehicle to cause the adjusted operation of the vehicle.
2 . The system of claim 1 , wherein the one or more processors are further configured to: obtain third sensor data associated with the vehicle operating in the environment, the third sensor data comprising a third portion associated with the first sensor and a fourth portion associated with the second sensor, the third sensor data generated after the second sensor data is generated; and update at least one relationship represented by the knowledge graph based on the third sensor data.
3 . The system of claim 2 , wherein the control signal comprises a first control signal, and wherein the one or more processors are further configured to: generate a second control signal configured to adjust the operation of the vehicle in response to updating the knowledge graph based on the third sensor data.
4 . The system of claim 1 , wherein the one or more processors are further configured to: determine that the first states of the one or more objects indicates a relationship that violates an operating parameter of the environment; and in response to determining that the relationship violates the operating parameter, determine to generate the control signal to adjust operation of the vehicle.
5 . The system of claim 4 , wherein the one or more processors configured to determine to generate the control signal are configured to: determine to adjust the operation of the vehicle by reducing a speed of the vehicle from a first speed to a second speed, and generate the control signal to cause the vehicle to operate at the second speed.
6 . The system of claim 4 , wherein the vehicle is operating in accordance with a first path, and wherein the one or more processors configured to determine to generate the control signal are configured to: determine to adjust the operation of the vehicle by transitioning operation of the vehicle from a first path to a second path, and generate the control signal to cause the vehicle to operate in accordance with the second path.
7 . The system of claim 6 , wherein the one or more processors are further configured to: generate the second path based on the operating parameter associated with the relationship.
8 . A method comprising: obtaining first sensor data associated with a vehicle operating in an environment, the first sensor data comprising a first portion associated with a first sensor and a second portion associated with a second sensor; determining a set of features associated with the environment based on the first sensor data, the set of features comprising one or more objects and one or more agents; in response to determining the set of features, generating a scene graph comprising a first plurality of nodes and representing poses and velocities of the one or more objects and the one or more agents relative to the environment; in response to generating the scene graph, generating a knowledge graph based on the scene graph and stored contextual information, the knowledge graph representing relationships involving the one or more objects and the one or more agents in the environment for use in strategic planning of the vehicle as represented by the first plurality of nodes of the scene graph; generating a second scene graph based on second sensor data associated with the vehicle operating in the environment, the second scene graph comprising a second plurality of nodes and representing updated poses and velocities of the one or more objects and the one or more agents relative to the environment; revising the knowledge graph by updating the relationships involving the one or more objects and the one or more agents in the environment based on the updated poses and velocities of the one or more objects and the one or more agents as represented by the second plurality of nodes of the second scene graph; generating a control signal configured to adjust an operation of the vehicle based on the updated relationships of the updated knowledge graph using one or more first attributes representing first states of the one or more objects or one or more second attributes representing second states of the one or more agents from the knowledge graph; and providing the control signal to the vehicle to cause the adjusted operation of the vehicle.
9 . The method of claim 8 , obtaining third sensor data associated with the vehicle operating in the environment, the third sensor data comprising a third portion associated with the first sensor and a fourth portion associated with the second sensor, the third sensor data generated after the second sensor data is generated; and updating at least one relationship represented by the knowledge graph based on the third sensor data.
10 . The method of claim 9 , wherein the control signal comprises a first control signal, the method further comprising: generating a second control signal configured to adjust the operation of the vehicle in response to updating the knowledge graph based on the third sensor data.
11 . The method of claim 8 , further comprising: determining that the first states of the one or more objects indicates a relationship that violates an operating parameter of the environment; and in response to determining that the relationship violates the operating parameter, determining to generate the control signal to adjust operation of the vehicle.
12 . The method of claim 11 , wherein determining to generate the control signal comprises: determining to adjust the operation of the vehicle by reducing a speed of the vehicle from a first speed to a second speed, and generating the control signal to cause the vehicle to operate at the second speed.
13 . The method of claim 11 , wherein the vehicle is operating in accordance with a first path, and wherein determining to generate the control signal comprises: determining to adjust the operation of the vehicle by transitioning operation of the vehicle from a first path to a second path, and generating the control signal to cause the vehicle to operate in accordance with the second path.
14 . The method of claim 13 , further comprising: generating the second path based on the operating parameter associated with the relationship.
15 . One or more non-transitory computer-readable mediums storing instructions thereon that, when executed by one or more processors, cause the one or more processors to: obtain first sensor data associated with a device operating in an environment, the first sensor data comprising a first portion associated with a first sensor and a second portion associated with a second sensor; determine a set of features associated with the environment based on the sensor data, the set of features comprising one or more objects and one or more agents; in response to determining the set of features, generate a scene graph comprising a first plurality of nodes and representing poses and velocities of the one or more objects and the one or more agents relative to the environment; in response to generating the scene graph, generate a knowledge graph based on the scene graph and stored contextual information, the knowledge graph representing relationships involving the one or more objects and the one or more agents in the environment for use in strategic planning of the device as represented by the first plurality of nodes of the scene graph; generate a second scene graph based on second sensor data associated with the vehicle operating in the environment, the second scene graph comprising a second plurality of nodes and representing updated poses and velocities of the one or more objects and the one or more agents relative to the environment; revise the knowledge graph by updating the relationships involving the one or more objects and the one or more agents in the environment based on the updated poses and velocities of the one or more objects and the one or more agents as represented by the second plurality of nodes of the second scene graph; generate a control signal configured to adjust an operation of the device based on updated relationships of the updated knowledge graph using one or more first attributes representing first states of the one or more objects or one or more second attributes representing second states of the one or more agents from the knowledge graph; and provide the control signal to the device to cause the adjusted operation of the device.
16 . The one or more non-transitory computer-readable mediums of claim 15 , wherein the instructions further configured to cause the one or more processors to: obtain third sensor data associated with the device operating in the environment, the third sensor data comprising a third portion associated with the first sensor and a fourth portion associated with the second sensor, the second third sensor data generated after the second sensor data is generated; and update at least one relationship represented by the knowledge graph based on third sensor data.
17 . The one or more non-transitory computer-readable mediums of claim 16 , wherein the control signal comprises a first control signal, and wherein the instructions further cause the one or more processors to: generate a second control signal configured to adjust the operation of the device in response to updating the knowledge graph.
18 . The one or more non-transitory computer-readable mediums of claim 15 , wherein the instructions further cause the one or more processors to: determine that the first states of the one or more objects indicates a relationship that violates an operating parameter of the environment; and in response to determining that the relationship violates the operating parameter, determine to generate the control signal to adjust operation of the device.
19 . The one or more non-transitory computer-readable mediums of claim 18 , wherein the instructions that cause the one or more processors to determine to generate the control signal cause the one or more processors to: determine to adjust the operation of the device by reducing a speed of the device from a first speed to a second speed, and generate the control signal to cause the device to operate at the second speed.
20 . The one or more non-transitory computer-readable mediums of claim 18 , wherein the device is operating in accordance with a first path, and wherein the instructions that cause the one or more processors to determine to generate the control signal cause the one or more processors to: determine to adjust the operation of the device by transitioning operation of the device from a first path to a second path, and generate the control signal to cause the device to operate in accordance with the second path.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to Indian Provisional Application No. 202441084635, filed Nov. 5, 2024, the entirety of which is incorporated by reference herein. BACKGROUND Various automated vehicle systems and remote monitoring systems deployed in a variety of environments (e.g., drivable highways, warehouses, etc.) rely upon the acquisition and interpretation of multi-modal sensor data to perceive and interact with their surrounding environment. These systems commonly utilize data generated from devices such as cameras, lidar, radar, and ultrasonic sensors to detect objects and agents in proximity to a vehicle. Information from these heterogeneous sources is often fused in real time to generate structured representations, including object lists, semantic maps, or hierarchical graphs, that facilitate environmental understanding crucial to navigation and operation. As automated vehicle deployments and remote monitoring systems increase in scope and complexity, significant technical challenges are arising. The high volume and diversity of sensor data require intensive computational resources for real time processing, particularly when deriving semantic relationships and dynamic parameters such as pose and velocity for each detected object or agent in a scene. The frequent need to transfer, store, and synchronize environmental data and contextual models results in substantial demands on both onboard memory and network bandwidth, especially during collaborative operations or remote diagnostic analyses. These pressures can lead to latency issues, increased energy consumption, and higher risks of data bottlenecks in resource constrained embedded platforms. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings are not intended to be drawn to scale. Like reference numbers and designations in the various drawings indicate like elements. For purposes of clarity, not every component may be labeled in every drawing. In the drawings: FIG. 1 is a block diagram illustrating an environment for processing multi-modal data representing an environment to generate scene graphs and/or knowledge graphs, in accordance with one or more embodiments; FIG. 2 is a block diagram illustrating a process for generating a dynamic scene graph from multi-modal inputs and integrating the dynamic scene graph as an update to a knowledge graph, in accordance with one or more embodiments; FIG. 3 is a block diagram illustrating a process for scene graph generation for knowledge graph generation or updates, in accordance with one or more embodiments; FIG. 4 is a block diagram illustrating a process for continuous learning and knowledge evolution to maintain a knowledge graph, in accordance with one or more embodiments; FIG. 5 is a flow diagram illustrating a process for executing cascaded visual reasoning tasks by decomposing a complex task into subtasks and processing the tasks in parallel, in accordance with one or more embodiments; FIG. 6 is a flowchart illustrating the decision-making process for task processing in a distributed visual reasoning system, in accordance with one or more embodiments; FIG. 7 is a flowchart illustrating a process for task processing by an analytics server, in accordance with one or more embodiments; FIG. 8 is a block diagram illustrating a process of aggregating scene graphs from multiple edge devices to generate a consolidated global scene graph in a distributed visual reasoning system, in accordance with one or more embodiments; FIG. 9 is a flowchart illustrating a method for processing multi-modal data representing an environment to generate scene graphs of the environment, in accordance with one or more embodiments; FIG. 10 is a flowchart illustrating a method for generating a knowledge graph from multi-modal sensor data and providing a control signal for vehicle operation, in accordance with one or more embodiments; and FIG. 11 is a flowchart illustrating a method for processing multi-modal data representing an environment during automated operation of a vehicle, in accordance with one or more embodiments. DETAILED DESCRIPTION Below are detailed descriptions of various concepts related to, and approaches, methods, apparatuses, and systems for implementing the various techniques described herein. The various concepts introduced above and discussed in greater detail below may be implemented in any of numerous ways, as the described concepts are not limited to any particular manner of implementation. Examples of specific implementations and applications are provided primarily for illustrative purposes. In examples, a system is provided for processing multi-modal data representing an environment to generate scene graphs of the environment. The system can operate in connection with one or more mobile platforms such as autonomous vehicles, aerial drones, automated guided vehicles (AGVs), or stationary monitoring stations (e.g., remote monitoring systems)