JP-2026075217-A - Video analysis system, video analysis program, and learning system

JP2026075217AJP 2026075217 AJP2026075217 AJP 2026075217AJP-2026075217-A

Abstract

[Problem] To provide a technology that can facilitate the understanding of spatiotemporal relationships between objects in video. [Solution] The video analysis system includes a Large Language Model (LLM). The spatiotemporal scene graph is a scene graph that spatially and temporally shows the relationships between objects in the video. Graph features are the features of the spatiotemporal scene graph. The input information input to the LLM includes at least graph features and prompts. The descriptive information output from the LLM is information that provides a description of the video in response to the prompts. The LLM is pre-trained to take input information and output descriptive information. The video analysis system receives prompts from the user regarding the target video and obtains descriptive information about the target video by inputting input information about the target video into the LLM. [Selection Diagram] Figure 1

Inventors

コウゼン
小堀訓成

Assignees

トヨタ自動車株式会社

Dates

Publication Date: 20260508
Application Date: 20241022

Claims (12)

One or more processors, It comprises one or more storage devices for storing a Large Language Model (LLM), A spatiotemporal scene graph is a scene graph that shows the relationships between objects in a video both spatially and temporally. The graph features are the features of the spatiotemporal scene graph, The input information input to the LLM includes at least the graph features and prompts. The explanatory information output from the LLM is information that provides an explanation of the video in response to the prompt. The LLM is pre-trained to take the input information as input and output the explanatory information, The one or more processors described above are: Upon receiving the aforementioned prompt from the user regarding the target video, The input information relating to the target video is obtained, A video analysis system configured to acquire explanatory information about the target video by inputting the input information about the target video into the LLM.
The video analysis system according to claim 1, The video analysis system is configured such that the one or more processors are further configured to present to the user text information or audio information corresponding to the descriptive information relating to the target video.
A video analysis system according to claim 1 or 2, The video features are the features of the aforementioned video, The input information includes the video features, the graph features, and the prompt in the video analysis system.
A video analysis system according to claim 1 or 2, The one or more storage devices are further configured to store a graph structure encoder that has been trained to take the spatiotemporal scene graph as input and output the graph features, The one or more processors further include: The spatiotemporal scene graph relating to the aforementioned target video is obtained, A video analysis system configured to acquire graph features relating to the target video by inputting the spatiotemporal scene graph relating to the target video to the graph structure encoder.
A video analysis program that includes a Large Language Model (LLM), A spatiotemporal scene graph is a scene graph that shows the relationships between objects in a video both spatially and temporally. The graph features are the features of the spatiotemporal scene graph, The input information input to the LLM includes at least the graph features and prompts. The explanatory information output from the LLM is information that provides an explanation of the video in response to the prompt. The LLM is pre-trained to take the input information as input and output the explanatory information, The aforementioned video analysis program is executed by a computer. The process of receiving the aforementioned prompt from the user regarding the target video, A process for obtaining the input information relating to the target video, A video analysis program that causes the computer to perform a process of acquiring descriptive information about the target video by inputting the input information about the target video into the LLM.
The video analysis program according to claim 5, The video analysis program further causes the computer to perform a process of presenting the user with text information or audio information corresponding to the explanatory information regarding the target video.
A video analysis program according to claim 5 or 6, The video features are the features of the aforementioned video, The input information is a video analysis program that includes the video features, the graph features, and the prompt.
A learning system for training Large Language Models (LLMs), One or more processors, The system comprises one or more storage devices for storing the LLM, A spatiotemporal scene graph is a scene graph that shows the relationships between objects in a video both spatially and temporally. Graph features are the feature quantities of each node in the spatiotemporal scene graph. Text features are text features that describe the attributes of each node in the spatiotemporal scene graph. A consistent graph feature is a graph feature such that the correlation between the graph feature and the text feature for each node is at or above a predetermined level. The input information input to the LLM includes at least the consistent graph features and prompts. The explanatory information output from the LLM is information that provides an explanation of the video in response to the prompt. The one or more processors described above are: The aforementioned consistent graph features are obtained, A learning system configured to perform an LLM learning process that learns the LLM to take the input information as input and output the explanatory information as input, based on the aforementioned matched graph features.
A learning system according to claim 8, The video features are the features of the aforementioned video, The input information includes the video features, the matched graph features, and the prompt in a learning system.
A learning system according to claim 8, The one or more storage devices are further configured to store a graph structure encoder that has been trained to take the spatiotemporal scene graph as input and output the graph features, The one or more processors further include: A matching process is performed to train the graph structure encoder so that the correlation between the graph features and text features for each node is at or above the predetermined level. A learning system configured to acquire the graph features obtained by the graph structure encoder after the matching process as the matched graph features.
A learning system according to any one of claims 8 to 10, The aforementioned LLM learning process includes a first learning process, The first learning process is a learning system that performs instruction learning on the LLM so that the LLM can recognize the correspondence between the consistent graph features and the text features.
A learning system according to claim 11, The LLM learning process further includes a second learning process following the first learning process. The input information further includes the text features, The second learning process is a learning system that performs instruction learning on the LLM to take the input information as input and output the explanatory information as output.

Description

This disclosure relates to a technology for analyzing video footage and obtaining explanatory information about said footage. Patent Document 1 discloses an object detection system. The object detection system includes an object detection model pre-generated by machine learning to detect objects from an image. The object detection system detects objects from an image by utilizing this object detection model. Non-patent document 1 discloses a graph structure encoder that encodes a spatiotemporal scene graph and obtains graph tokens, which are feature quantities of the graph. Japanese Patent Publication No. 2024-76159 Seongjun Yun et al., Graph Transformer Networks, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019. This is a conceptual diagram illustrating the outline of the video analysis system according to the embodiment of this disclosure.This is a block diagram illustrating the various processes for acquiring various feature quantities (tokens) according to the embodiment.This is a conceptual diagram illustrating the alignment process in the learning phase according to the embodiment.This is a conceptual diagram illustrating the overview of the LLM learning process in the learning phase according to the embodiment.This is a conceptual diagram illustrating the first learning process in the LLM learning process according to the embodiment.This is a conceptual diagram illustrating the second learning process in the LLM learning process according to the embodiment.This is a block diagram illustrating the inference phase according to the embodiment.This is a block diagram showing an example of the hardware configuration of a video analysis system according to an embodiment. Embodiments of this disclosure will be described with reference to the attached drawings. 1. Overview Figure 1 is a conceptual diagram illustrating the overview of the video analysis system 1 according to this embodiment. The video analysis system 1 acquires video VIDs captured by a camera or the like and performs analysis of the video VIDs. More specifically, the video analysis system 1 includes a Large Language Model (LLM) 500. The user inputs a prompt PPT (PowerPoint presentation) into the video analysis system 1, instructing it on a task related to the video VID (Video ID). The video analysis system 1 receives the prompt PPT input by the user and inputs it into the LLM 500. The LLM 500 responds to the prompt PPT by outputting a response. This response is descriptive information STR (Explanatory Information STR), which provides an explanation of the video VID. The video analysis system 1 then presents the user with text or audio information corresponding to the descriptive information STR output from the LLM 500. That is, in response to the prompt PPT input by the user, the video analysis system 1 presents the descriptive information STR regarding the video VID to the user in text or audio format. Here, we consider understanding the spatiotemporal relationships between objects (instances) within a video VID. While technologies such as Multi-Modal LLM and Video Transformer are known, they have been insufficient in understanding the spatiotemporal relationships between objects within a video VID. This embodiment proposes a technology that can facilitate the understanding of spatiotemporal relationships between objects within a video VID. According to this embodiment, a "spatiotemporal scene graph (ST-SG) 220" is used to facilitate understanding of spatiotemporal relationships between objects within the video VID. The spatiotemporal scene graph 220 is a scene graph that spatially and temporally represents the scenes shown in the video VID, and is generated from the video VID. More specifically, the video VID contains a series of temporally consecutive frames (images). The scene graph representing the scenes shown in each frame shows the objects within each frame and the relationships between objects (e.g., positional relationships, action relationships, etc.). Nodes in the scene graph correspond to objects within the frame. Edges in the scene graph indicate relationships between nodes (i.e., between objects) (e.g., positional relationships, action relationships, etc.). The spatiotemporal scene graph 220 is formed by linking the scene graphs obtained for each frame along a time axis. The spatiotemporal scene graph 220 can also be described as a scene graph that spatially and temporally shows the objects within the video VID and the relationships between them. Thus, the spatiotemporal scene graph 220 generated from the video VID shows the spatiotemporal relationships between objects within the video VID. The LLM 500 is configured to output explanatory information STR, which describes the spatiotemporal relationships between objects within the video VID, by also referring to the feature information of the spatiotemporal scene graph 220. In other words, the video analysis system 1 is configured to obtain explanator