CN-114972727-B - System and method for multimodal neurosymbol scene understanding

CN114972727BCN 114972727 BCN114972727 BCN 114972727BCN-114972727-B

Abstract

A system for image processing includes a first sensor configured to capture at least one or more images, a second sensor configured to capture sound information, a processor in communication with the first sensor and the second sensor, wherein the processor is programmed to receive the one or more images and sound information, extract one or more data features associated with the images and sound information with an encoder, output metadata to a spatiotemporal reasoning engine via a decoder, wherein the metadata is derived with the decoder and the one or more data features, determine one or more scenes with the spatiotemporal reasoning engine and the metadata, and output control commands in response to the one or more scenes.

Inventors

J. FRANCIS
A. Altramari
C. Shelton
S. Munir

Assignees

罗伯特·博世有限公司

Dates

Publication Date: 20260505
Application Date: 20220228
Priority Date: 20210226

Claims (20)

1. A system for image processing, comprising: A first sensor configured to capture at least one or more images; a second sensor configured to capture sound information; a processor in communication with the first sensor and the second sensor, wherein the processor is programmed to: receiving the one or more images and the sound information; Preprocessing the one or more images and sound information, wherein the preprocessing comprises converting the one or more images and the sound information into a unified structure or class; Extracting, with an encoder, one or more data features associated with the image and sound information; Outputting metadata to a spatio-temporal inference engine via a decoder, wherein the spatio-temporal inference engine is configured to capture relationships of multi-modal sensors to help determine various scenarios, thereby capturing relationships of the first sensor and the second sensor with the metadata, wherein the spatio-temporal inference engine is configured to interpret large data sets into meaningful concepts of different levels of abstraction, including abstracting individual points in time as longitudinal time intervals, computing trends and gradients from a series of resulting measurements, and detecting different types of patterns, and wherein metadata is derived with the decoder and the one or more data features; Determining one or more scenes associated with the image and sound information using a spatio-temporal inference engine and metadata, and A control command is output in response to the one or more scenes.
2. The system of claim 1, wherein the spatiotemporal inference engine is in communication with a domain ontology database and utilizes the domain ontology database to determine the one or more scenarios.
3. The system of claim 2, wherein the domain ontology database includes information indicating one or more scenarios utilizing the metadata.
4. The system of claim 2, wherein the domain ontology database is stored at a remote server in communication with the processor.
5. The system of claim 1, wherein the system includes a third sensor configured to capture temperature information, and the processor is in communication with the third sensor and receives the temperature information and extracts the associated one or more data features from the temperature information.
6. The system of claim 1, wherein the processor is further programmed to fuse one or more data features associated with the image and sound information prior to outputting the metadata.
7. The system of claim 1, wherein the processor is further programmed to separately extract one or more data features associated with the image and sound information to the plurality of decoders.
8. The system of claim 1, wherein the decoder is associated with a machine learning network.
9. A system for image processing, comprising: a first sensor configured to capture a first set of information indicative of an environment; a second sensor configured to capture a second set of information indicative of an environment; a processor in communication with the first sensor and the second sensor, wherein the processor is programmed to: Receiving a first set of information and a second set of information indicative of an environment; preprocessing a first set of information and a second set of information indicative of an environment, wherein the preprocessing includes converting the first set of information and the second set of information into a unified structure or class; extracting, with an encoder, one or more data features associated with the first set of information and the second set of information; Outputting metadata to a spatio-temporal inference engine via a decoder, wherein the spatio-temporal inference engine is configured to capture relationships of multi-modal sensors to help determine various scenarios, thereby capturing relationships of the first sensor and the second sensor with the metadata, wherein the spatio-temporal inference engine is configured to interpret large data sets into meaningful concepts of different levels of abstraction, including abstracting individual points in time as longitudinal time intervals, computing trends and gradients from a series of resulting measurements, and detecting different types of patterns, and wherein the metadata is derived with the decoder and one or more data features; determining one or more scenes associated with the first and second sets of information using a spatio-temporal inference engine and metadata, and A control command is output in response to the one or more scenes.
10. The system of claim 9, wherein the first set of information and the second set of information have different types of data.
11. The system of claim 9, wherein the first sensor comprises a temperature sensor, a pressure sensor, a vibration sensor, a humidity sensor, or a carbon dioxide sensor.
12. The system of claim 9, wherein the processor is further programmed to pre-process the first and second sets of information indicative of the environment prior to extracting the one or more data features with the encoder.
13. The system of claim 9, wherein the system comprises a fusion module to fuse fused data sets from the first information set and the second information set.
14. The system of claim 13, wherein the metadata is extracted from the fused dataset.
15. A system for image processing, comprising: a first sensor configured to capture a first set of information indicative of an environment; a second sensor configured to capture a second set of information indicative of an environment; a processor in communication with the first sensor and the second sensor, wherein the processor is programmed to: Receiving a first set of information and a second set of information indicative of an environment; preprocessing a first set of information and a second set of information indicative of an environment, wherein the preprocessing includes converting the first set of information and the second set of information into a unified structure or class; Extracting one or more data features associated with a first set of information and a second set of information indicative of an environment; Outputting metadata to a spatio-temporal inference engine via a decoder, wherein the spatio-temporal inference engine is configured to capture relationships of multi-modal sensors to help determine various scenarios, thereby capturing relationships between the first sensor and the second sensor with the metadata, wherein the spatio-temporal inference engine is configured to interpret large data sets into meaningful concepts of different levels of abstraction, including abstracting individual points in time as longitudinal time intervals, computing trends and gradients from a series of resulting measurements, and detecting different types of patterns, and wherein the metadata is indicative of one or more data features; determining one or more scenes associated with the first and second information sets using metadata, and A control command is output in response to the one or more scenes.
16. The system of claim 15, wherein the system comprises a decoder configured to utilize a machine learning network.
17. The system of claim 15, wherein the first set of information and the second set of information have different types of data.
18. The system of claim 15, wherein the first sensor comprises a temperature sensor, a pressure sensor, a vibration sensor, a humidity sensor, or a carbon dioxide sensor.
19. The system of claim 15, wherein the system comprises a fusion module to fuse fused data sets from the first information set and the second information set.
20. The system of claim 19, wherein the fused dataset is sent to a machine learning model to output metadata associated with the fused dataset.

Description

System and method for multimodal neurosymbol scene understanding Technical Field The present disclosure relates to image processing using sensors such as cameras, radars, microphones, and the like. Background The system may be capable of performing scene understanding. Scene understanding may refer to the ability of a system to infer objects and their participating events based on semantic relationships of the objects with other objects in the environment and/or the geographic space or temporal structure of the environment itself. The basic goal for a scene understanding task is to generate a statistical model that can predict (e.g., classify) high-level semantic events given some observations of context in the scene. Viewing of the scene context may be enabled by using sensor devices placed in various locations that allow the sensor to obtain context information from the scene in the form of sensor modalities, such as video recordings, acoustic patterns, ambient temperature time series information, and so forth. Given such information from one or more modalities (e.g., sensors), the system can classify events initiated by entities in the scene. Disclosure of Invention According to one embodiment, a system for image processing includes a first sensor configured to capture at least one or more images, a second sensor configured to capture sound information, a processor in communication with the first sensor and the second sensor, wherein the processor is programmed to receive one or more of the image and sound information, extract one or more data features associated with the image and sound information with an encoder, output metadata to a spatiotemporal inference engine via a decoder, wherein the metadata is derived with the decoder and the one or more data features, determine one or more scenes with the spatiotemporal inference engine and the metadata, and output control commands in response to the one or more scenes. According to a second embodiment, a system for image processing includes a first sensor configured to capture a first set of information indicative of an environment, a second sensor configured to capture a second set of information indicative of an environment, and a processor in communication with the first sensor and the second sensor. The processor is programmed to receive a first set of information and a second set of information indicative of an environment, extract, with the encoder, one or more data features associated with the image and sound information, output metadata to the spatio-temporal inference engine via the decoder, wherein the metadata is derived with the decoder and the one or more data features, determine one or more scenes with the spatio-temporal inference engine and the metadata, and output control commands in response to the one or more scenes. According to a third embodiment, a system for image processing includes a first sensor configured to capture a first set of information indicative of an environment, a second sensor configured to capture a second set of information indicative of an environment, and a processor in communication with the first sensor and the second sensor. The processor is programmed to receive a first set of information and a second set of information indicative of an environment, extract one or more data features associated with the first set of information and the second set of information indicative of the environment, output metadata indicative of the one or more data features, determine one or more scenes using the metadata, and output control commands in response to the one or more scenes. Drawings FIG. 1 shows a schematic diagram of a monitoring arrangement; fig. 2 is an overview system diagram of a wireless system according to an embodiment of the present disclosure; FIG. 3A is a first embodiment of a computing pipeline; FIG. 3B is an alternative embodiment of a computational pipeline that utilizes fusion of sensor data; FIG. 4 is an illustration of an example scene captured from one or more video cameras and sensors. Detailed Description Embodiments of the present disclosure are described herein. However, it is to be understood that the disclosed embodiments are merely exemplary and that other embodiments may take various forms and alternatives. The figures are not necessarily to scale, some features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the embodiments. As will be appreciated by one of ordinary skill in the art, the various features illustrated and described with reference to any one of the figures may be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combination of features illustrated provides representa