CN-122020408-A - Virtual event detection method based on multi-mode data, computer equipment and medium

CN122020408ACN 122020408 ACN122020408 ACN 122020408ACN-122020408-A

Abstract

The application relates to the technical field of computers and provides a virtual event detection method based on multi-mode data, computer equipment and a medium. The method comprises the steps of extracting first modal characteristics from a first modal data stream to obtain a first modal input, extracting second modal characteristics from a second modal data stream to obtain a second modal input, carrying out characteristic fusion to obtain a multi-modal input, carrying out time sequence modeling on the multi-modal input by utilizing an LSTM branch to obtain an LSTM modeling characteristic, carrying out time sequence modeling by utilizing a transducer branch to obtain a transducer modeling characteristic, carrying out time sequence characteristic fusion to obtain a time sequence characteristic, carrying out characteristic mapping on the time sequence characteristic by utilizing a plurality of classification heads to obtain a plurality of prediction results, calculating respective classification probabilities of a plurality of categories based on the plurality of prediction results, and then outputting classification probabilities of virtual event categories.

Inventors

Request for anonymity
Request for anonymity

Assignees

深圳市固胜智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (20)

1. A method for detecting a virtual event based on multimodal data, the method comprising: Acquiring at least two modal data for representing virtual content, wherein the modal data at least comprises one of video data and audio data; Performing first modal feature extraction on a first modal data stream through a first modal feature extraction network to obtain a first modal input, and performing second modal feature extraction on a second modal data stream through a second modal feature extraction network to obtain a second modal input, wherein the first modal data stream and the second modal data stream are synchronous in time sequence; Performing feature fusion on the first modal input and the second modal input to obtain multi-modal input, then performing time sequence modeling on the multi-modal input by using a first time sequence modeling branch to obtain a first time sequence modeling feature, and performing time sequence modeling on the multi-modal input by using a second time sequence modeling branch to obtain a second time sequence modeling feature; Performing time sequence feature fusion on the first time sequence modeling feature and the second time sequence modeling feature so as to obtain time sequence features corresponding to the multi-mode input; performing feature mapping on the time sequence features by using one or more classification heads so as to obtain one or more prediction results corresponding to the one or more classification heads one by one; based on the one or more prediction results, calculating respective classification probabilities of one or more categories corresponding to the one or more classification heads one by one, and outputting category information and/or classification probabilities of virtual event categories.
2. The method of claim 1, wherein the plurality of classification heads satisfy a constraint defining a sum of classification probabilities for each of the plurality of categories to be 1.
3. The method of claim 1, wherein the plurality of categories include the virtual event category, a scene category, a behavior category.
4. The method according to claim 1, wherein the method further comprises: when the classification probability of the virtual event category is larger than a preset threshold value, selecting a predicted result with highest probability from predicted results corresponding to the virtual event category as an event result, and when the respective classification probabilities of the plurality of categories are smaller than the preset threshold value, determining that no result exists.
5. The method of claim 1, wherein the plurality of classification heads consists of a fully connected layer, and wherein the plurality of classification heads are used to perform feature mapping to a class space according to task needs.
6. The method of claim 1, wherein the first timing modeling feature is an LSTM modeling feature and the second timing modeling feature is a Transformer modeling feature, wherein timing feature fusion is performed on the LSTM modeling feature and the Transformer modeling feature to obtain the timing feature, and wherein timing feature fusion is performed on the LSTM modeling feature and the Transformer modeling feature using a direct stitching algorithm, a weighted fusion algorithm, or an attention mechanism fusion algorithm to obtain the timing feature.
7. The method of claim 1, wherein the first timing modeling feature is an LSTM modeling feature and the second timing modeling feature is a fransformer modeling feature, wherein the LSTM branches are configured to concurrently perform sequential data-look-ahead and sequential data-look-ahead processing on the multimodal input, thereby utilizing both past and future timing information included in the multimodal input.
8. The method of claim 1, wherein the first timing modeling feature is an LSTM modeling feature and the second timing modeling feature is a transducer modeling feature, wherein the transducer branch is configured to encode the multimodal input into a timing feature representation including global information to obtain a timing pattern and a dependency relationship included in the multimodal input.
9. The method of claim 1, wherein feature fusing the first modality input and the second modality input to obtain the multi-modality input comprises feature fusing the first modality input and the second modality input to obtain the multi-modality input using a direct stitching algorithm, a weighted fusion algorithm, or an attention mechanism fusion algorithm.
10. The method of claim 1, wherein the first modality data stream is a video data stream continuously acquired in real time by a video sensor, the first modality feature extraction network is a video feature extraction network, the video data stream is composed of game picture frame images including virtual characters, virtual articles, virtual vehicles, and virtual scenes, and the video feature extraction network includes a spatial feature extraction branch for single frame feature extraction and a temporal feature extraction branch for motion change feature extraction between different frames.
11. The method of claim 1, wherein the second modality data stream is an audio data stream acquired by an audio sensor, the second modality feature extraction network is an audio feature extraction network, the audio data stream is composed of sounds of triggering events including weapon sounds, animal sounds, character voices, and virtual scene sounds, and the audio feature extraction network is used for raw audio waveform feature extraction and audio data to spectrogram conversion.
12. The method of claim 1, wherein the second modality data stream is a telemetry data stream acquired in real time by an in-tool, the second modality feature extraction network is a telemetry feature extraction network, the telemetry data stream includes motion data including pitch angle, roll angle, yaw angle, and acceleration, and the telemetry feature extraction network includes a multi-layer perceptron and a convolutional neural network having a one-dimensional convolutional kernel.
13. The method of claim 1, wherein the combination of the first modality data stream and the second modality data stream is a combination of a video data stream and an audio data stream, or a combination of a video data stream and a telemetry data stream.
14. The method according to claim 1, wherein the method further comprises: performing third-mode feature extraction on the third-mode data stream through a third-mode feature extraction network so as to obtain third-mode input; Feature fusion is carried out on the first modal input, the second modal input and the third modal input to obtain the multi-modal input, Wherein the combination of the first modality data stream, the second modality data stream, and the third modality data stream is a combination of a video data stream, an audio data stream, and a telemetry data stream.
15. The method according to claim 1, wherein the first modality feature extraction network and/or the second modality feature extraction network are trained by means of machine learning.
16. The method of claim 1, wherein the first modality data stream is time aligned with the second modality data stream either before feature extraction or after feature extraction.
17. The method of claim 1, wherein at least one of the first timing modeling branch and the second timing modeling branch is implemented using a recursive structure, an attention structure, a convolution structure, or a combination thereof.
18. The method of claim 1, wherein the first timing modeling branch is different from a time receptive field corresponding to the second timing modeling branch.
19. The method of claim 1, wherein the one or more classification heads are configured to output prediction results of mutually exclusive or non-mutually exclusive categories, respectively.
20. The method of claim 1, wherein the classification probability of the virtual event category is output after smoothing or thresholding.

Description

Virtual event detection method based on multi-mode data, computer equipment and medium Technical Field The present application relates to the field of computer technologies, and in particular, to a method for detecting a virtual event based on multi-mode data, a computer device, and a medium. Background In the technical field of virtual reality, augmented reality, game interaction, simulated driving, etc., it is often desirable to accurately and rapidly identify a virtual event that is occurring and then provide the user with a corresponding interactive experience, such as providing the user with cues through haptic feedback, such as through the vibration function of a seat, force feedback of a steering wheel or pedal, etc. But is subject to various possible interference factors that may result in the inability to correctly and timely identify virtual events. For example, when a horseshoe sound of a horse riding is recognized depending on the sound, the footfall sound of a person walking may be confused with the horseshoe sound when the horse advances at a slower speed, since both may behave similarly in audio analysis, which may lead to erroneous judgment of the system. For another example, when the identity information or the motion information of the virtual object is identified by relying on the image information, the virtual object may be affected by ambient light or an obstruction, which may also cause erroneous judgment of the system. While some software provides a built-in interface to output telemetry, the information source of telemetry relies on the built-in interface and is limited by external hardware interface standards, so telemetry may be difficult to provide enough associated information to accurately and quickly identify an occurring virtual event. For example, telemetry data may be the speed and direction of movement of the player's body, etc., but depending on this information alone it is difficult to distinguish whether the player is walking in a non-riding state or is slowly advancing in a riding state. Therefore, the application provides a method, computer equipment and medium for detecting the virtual event based on the multi-mode data, which overcomes the defects of single-mode data, is favorable for accurately and quickly judging the virtual event, improves the accuracy of multi-mode signal identification through optimization of an algorithm and a model, and is favorable for improving the interactive experience of users. Disclosure of Invention In a first aspect, the present application provides a method for detecting a virtual event based on multimodal data. The method comprises the steps of obtaining at least two types of modal data used for representing virtual content, wherein the modal data at least comprise one of video data and audio data, conducting first modal feature extraction on a first modal data stream through a first modal feature extraction network to obtain first modal input, conducting second modal feature extraction on a second modal data stream through a second modal feature extraction network to obtain second modal input, conducting feature fusion on the first modal data stream and the second modal data stream to obtain multi-modal input, conducting time sequence modeling on the multi-modal input through a first time sequence modeling branch to obtain first time sequence modeling features, conducting time sequence modeling on the multi-modal input through a second time sequence modeling branch to obtain second time sequence modeling features, conducting time sequence feature fusion on the first time sequence modeling features and the second time sequence modeling features to obtain time sequence features corresponding to the multi-modal input, conducting time sequence feature mapping on the time sequence features through one or more classification heads to obtain time sequence features corresponding to one or more classification heads, conducting one or more prediction results corresponding to one or more classification results, and calculating one or more prediction results based on one or more classification results. According to the method and the device for identifying the multi-modal time sequence, the defect of single-modal data is overcome, the virtual event can be accurately and rapidly judged, the multi-modal time sequence characteristic identification based on deep learning is realized through optimization of an algorithm and a model, the local time sequence mode and the global time sequence mode are considered, the accuracy of multi-modal signal identification is improved, and the interactive experience of a user can be improved. In a possible implementation manner of the first aspect of the present application, the plurality of classification heads satisfy a constraint condition, where the constraint condition defines a sum of classification probabilities of the plurality of classes to be 1. In a possible implementation manner of the first aspect of