CN-120689804-B - VR scene intelligent recognition method based on AI

CN120689804BCN 120689804 BCN120689804 BCN 120689804BCN-120689804-B

Abstract

The invention belongs to the field of data processing, and discloses an AI-based VR scene intelligent identification method which comprises the steps of 1, when a user uses a VR head display, obtaining an image frame I, a head gesture H and a spatial audio A displayed by the VR head display, processing I, H and A to obtain a time alignment triplet sequence, wherein the time alignment triplet comprises an image frame, a head gesture and an audio direction, 2, respectively obtaining image characteristics of each image frame, obtaining fusion semantic characteristics of each time point based on the image characteristics, 3, constructing a spatial semantic map based on the fusion semantic characteristics and the audio direction, 4, performing graph propagation operation on nodes in the spatial semantic map to obtain embedded representation of each node in the spatial semantic map, 5, obtaining a response node set based on the embedded representation, and 6, updating the embedded representation of the nodes in the spatial semantic map based on the response node set. The invention improves the complete understanding capability of scenes.

Inventors

LIANG YINGTAO
LIANG YINGHONG

Assignees

广州玖的文化科技有限公司

Dates

Publication Date: 20260512
Application Date: 20250729

Claims (7)

1. An AI-based VR scene intelligent recognition method is characterized by comprising the following steps: Step 1, when a user uses a VR head display, obtaining an image frame I, a head gesture H and a spatial audio A which are displayed by the VR head display, and processing I, H and A to obtain a time alignment triplet sequence, wherein the time alignment triplet comprises the image frame, the head gesture and the audio direction; step 2, respectively acquiring image features of each frame of image frame, and acquiring fusion semantic features of each time point based on the image features; step 3, constructing a spatial semantic map based on the fusion semantic features and the audio direction; step 4, performing graph propagation operation on the nodes in the spatial semantic graph to obtain embedded representation of each node in the spatial semantic graph; step 5, acquiring a response node set based on the embedded representation; Step 6, updating the embedded representation of the nodes in the spatial semantic graph based on the response node set; The step 2 comprises the following steps: From image frames Extracting visual feature vectors from the image ; Pose the head Conversion to a three-dimensional gaze direction unit vector ; Representing the fused semantic features of the time point p as ; The step 3 comprises the following steps: Bonding of The depth information carried by the method is used for back projecting the gazing target in the image frame to a three-dimensional space to obtain the space position of the gazing target ; Fused semantic features at time point t And For node construction elements, defining a space semantic graph initial node set ; Acquiring potential event nodes according to audio direction ; Based on And Construction of spatial semantic atlas ; Edge set of spatial semantic graph The calculation process of the weight of the edge in (a) comprises the following steps: calculating the weight of the edge based on the spatial positions, semantic features and gaze weights of the two nodes corresponding to the edge, comprising: Edge set for spatial semantic graph Edges between nodes i and j Weights of (2) Is defined as: Wherein: And Representing nodes respectively And j, spatial position; representing the fusion semantic features of the node i; the semantic features are fused for the node j; And Gaze weights for nodes i and j, respectively; Weight parameters of space and semantics respectively; The factor is co-amplified for gaze.
2. The AI-based VR scene intelligent recognition method as claimed in claim 1, wherein the head pose H is measured by an IMU of the VR head display, and the image frame I is acquired from a video channel of the VR head display.
3. The AI-based VR scene intelligent recognition method of claim 1, wherein processing I, H and a to obtain a time aligned triplet sequence comprises: The image frame obtained at the time point t is expressed as For at the time point The head pose obtained Linear interpolation resampling is performed to calculate at the time point Upper head pose ; For a window half width centered on time point t All spatial audio within the time window of (2) are subjected to direction vector weighted average to obtain the global audio direction Will global audio direction Projecting to a local coordinate system to obtain a final normalized audio direction ; The time aligned triplet at time point t is The triplet sequence is expressed as T represents the length of the time series of the continuous process, The starting point in time of the time series of the continuous process is indicated.
4. The AI-based intelligent VR scene recognition method of claim 1, wherein the potential event nodes are obtained from audio directions Comprising: According to the direction of the audio In the current spatial position of the user Constructing a hypothetical sound source position as a starting point , For a preset distance constant, in Generating potential event nodes 。
5. The AI-based VR scene intelligent recognition method of claim 1, wherein step 4 comprises: During the graph propagation process, opposite sides Introduction of type weight correction term for information propagation : Based on And updating the nodes.
6. The AI-based intelligent VR scene recognition method of claim 5, wherein step 5 comprises: To embed the degree of activation Construction of candidate regions for an index ; Will be Above a set threshold Is included in the candidate region ; For each node i in the candidate region, a priority weighting factor of audio direction driving is further introduced ; Will be Ordering from big to small, and taking the front Top-K nodes as a response node set And for each node Assigning a response grade , For deciding what response actions the system takes.
7. The AI-based intelligent VR scene recognition method of claim 6, wherein step 6 includes: respectively calculating a consistency discrimination value of each node in the response node set; The embedded representation of the node is updated based on the consistency discrimination value.

Description

VR scene intelligent recognition method based on AI Technical Field The invention relates to the field of data processing, in particular to an AI-based VR scene intelligent identification method. Background With the widespread use of virtual reality (VirtualReality, VR) technology in the fields of education, medicine, games, industrial simulations, etc., there is an increasing need for users to interact with virtual content in real time in an immersive environment. In order to improve the intelligent level of the VR system, researchers try to introduce artificial intelligence technology to realize intelligent understanding and automatic identification of VR scenes. The core goal of the direction is to enable the system to have the capability of identifying the virtual scene where the user is located, including the content such as environment type, semantic object, space structure, interaction event and the like, so as to support the high-order functions such as intelligent prompt, automatic navigation, behavior prediction and the like. The prior art mainly focuses on two aspects, namely an image-based target detection and classification method, wherein the judgment of objects or environmental labels in a scene is realized by extracting and identifying features of image frames in the current field of view of a user, and a voice or interaction log-based behavior reasoning method, wherein the simple semantic scene induction is performed by analyzing voice input, action sequences and the like of the user. Although these methods achieve a certain effect in static or low complexity scenes, there are still many limitations in VR environments with complex, high dynamic and rich spatial structures. Firstly, most of traditional image recognition models are based on single-frame image input, lack of modeling capability for continuous visual angle change and gaze track of a user, and difficulty in achieving timing consistency semantic understanding, so that when the user turns, moves or traverses a plurality of areas rapidly, semantic jump is generated by the system, and stability is poor. Second, existing methods often ignore or fail to fully exploit highly structured three-dimensional spatial information in VR, such as scene boundaries, spatial relationships between objects, etc., which makes it difficult for the system to identify spatial semantic structures (e.g., "exit behind corner") or paths of occurrence of dynamic events. Third, although some studies introduce voice or audio data as an aid, they are limited to content semantic recognition (e.g., voice-to-text), and do not utilize VR-specific spatial audio characteristics for direction sensing and event localization, so that key contents such as "where a sound comes from" and "what scene change the sound corresponds to" cannot be identified. In a comprehensive view, the current technology lacks a unified recognition system capable of integrating three types of characteristics of time, space and semantics, and is difficult to meet the high-level intelligent recognition requirements in immersive, multi-mode and dynamic VR scenes. Disclosure of Invention The invention aims to disclose an AI-based VR scene intelligent recognition method, which solves the technical problems pointed out in the background art. In order to achieve the above purpose, the invention adopts the following technical scheme: The invention provides an AI-based VR scene intelligent identification method, which comprises the following steps: Step 1, when a user uses a VR head display, obtaining an image frame I, a head gesture H and a spatial audio A which are displayed by the VR head display, and processing I, H and A to obtain a time alignment triplet sequence, wherein the time alignment triplet comprises the image frame, the head gesture and the audio direction; step 2, respectively acquiring image features of each frame of image frame, and acquiring fusion semantic features of each time point based on the image features; step 3, constructing a spatial semantic map based on the fusion semantic features and the audio direction; step 4, performing graph propagation operation on the nodes in the spatial semantic graph to obtain embedded representation of each node in the spatial semantic graph; step 5, acquiring a response node set based on the embedded representation; and 6, updating the embedded representation of the nodes in the spatial semantic graph based on the response node set. Furthermore, the head gesture H is measured by an IMU of the VR head display, and the image frame I is acquired from a video channel of the VR head display. Further, processing I, H and a to obtain a time aligned triplet sequence, including: Representing the image frame obtained at the time point t as I t, performing linear interpolation resampling on the head pose H t′ obtained at the time point t', and calculating the head pose H t at the time point t; Carrying out direction vector weighted average on all spatia