US-20260129390-A1 - SPATIAL AUDIO RECOVERY APPARATUS, SPATIAL AUDIO RECOVERING METHOD, AND PROGRAM

US20260129390A1US 20260129390 A1US20260129390 A1US 20260129390A1US-20260129390-A1

Abstract

A spatial audio restoration device of an embodiment includes a video feature amount calculation unit that calculates a video feature amount on the basis of video information, an audio feature amount calculation unit that calculates an audio feature amount on the basis of audio information that is a monaural sound corresponding to the video information, and a coefficient calculation unit that calculates a high-order Ambisonics coefficient on the basis of the video feature amount and the audio feature amount.

Inventors

Kimitaka Tsutsumi
Kenta Imaizumi

Assignees

NTT, INC.

Dates

Publication Date: 20260507
Application Date: 20221003

Claims (8)

1 . A spatial audio restoration device comprising: one or more processors configured to: calculate a video feature amount on a basis of video information; calculate an audio feature amount on a basis of audio information that is a monaural sound corresponding to the video information; and calculate a high-order Ambisonics coefficient on a basis of the video feature amount and the audio feature amount.
2 . The spatial audio restoration device according to claim 1 , wherein the high-order Ambisonics coefficient corresponds to virtual sound source information that is a sound field formed by a virtual sound source independent of the video information.
3 . The spatial audio restoration device according to claim 2 , wherein the spatial audio restoration device comprises one or more processors configured to: generate the virtual sound source information on a basis of the video feature amount and the audio feature amount, and encode the virtual sound source information into the higher-order Ambisonics coefficient.
4 . The spatial audio restoration device according to claim 2 , wherein the spatial audio restoration device comprises one or more processors configured to: update the high-order Ambisonics coefficient on a basis of the video feature amount, the audio feature amount, the virtual sound source information, and an auxiliary variable, update the auxiliary variable on a basis of the updated high-order Ambisonics coefficient and the virtual sound source information, and update the virtual sound source information on a basis of the updated auxiliary variable.
5 . The spatial audio restoration device according to claim 4 , wherein when a number of update operations by the one or more processors of the spatial audio restoration device is equal to or more than a predetermined threshold value, the one or more processors of the spatial audio restoration device are configured to end the update operation.
6 . The spatial audio restoration device according to claim 1 , wherein the spatial audio restoration device comprises one or more processors configured to: decode output audio information from the high-order Ambisonics coefficients; and update a parameter of a neural network of the spatial audio restoration device on a basis of a comparison result between the output audio information or the higher order Ambisonics coefficient and teacher audio information.
7 . A spatial audio restoration method comprising: calculating a video feature amount on a basis of video information; calculating an audio feature amount on a basis of audio information that is a monaural sound corresponding to the video information; and calculating a high-order Ambisonics coefficient on a basis of the video feature amount and the audio feature amount.
8 . A non-transitory computer readable medium storing one or more programs, that upon execution by a computer, cause the computer to function as a spatial audio restoration device that performs operations comprising: calculating a video feature amount on a basis of video information; calculating an audio feature amount on a basis of audio information that is a monaural sound corresponding to the video information; and calculating a high-order Ambisonics coefficient on a basis of the video feature amount and the audio feature amount.

Description

TECHNICAL FIELD Embodiments relate to a spatial audio restoration device, a spatial audio restoration method, and a program. BACKGROUND ART A spatial audio restoration technique for virtually restoring a spatial audio formed in a real space using headphones or a plurality of speakers has been studied. As examples of spatial audio restoration techniques, wavefront synthesis techniques and Ambisonics are known. By the wavefront synthesis technique and Ambisonics, spatial audio is accurately restored on the basis of a sound field observed at a sound collection point. However, a large-scale microphone array is necessary to observe an accurate sound field. Thus, it may be difficult to observe an accurate sound field. As a method for restoring a spatial audio without observing an accurate sound field, a method for outputting a coefficient of first-order Ambisonics using a neural network and using an omnidirectional video and an optical flow, and monaural sound as inputs has been proposed. CITATION LIST Patent Literature Patent Literature 1: JP 10-294999 A Non Patent Literature Non Patent Literature 1: P. Morgado et al., “Self-supervised generation of spatial audio for 360 video”, in proc. NeurIPS 2018, pp. 360-370, 2018. SUMMARY OF INVENTION Technical Problem However, by the proposed method, after the monaural sound is separated into a plurality of sound sources corresponding to a video, each of the plurality of separated sound sources is subjected to sound image localization. Thus, in the proposed method, the following problem occurs in restoring the rich spatial audio corresponding to the video. There is an upper limit to the number of sound sources separated corresponding to the video.It is difficult to model the effect of reverberation.Individual modules are required to achieve procedures such as sound source separation, which increases memory volume. The present invention has been made in view of the above circumstances, and an object thereof is to provide a means for restoring a rich spatial audio corresponding to a video. Solution to Problem A spatial audio restoration device according to one aspect includes a video feature amount calculation unit, an audio feature amount calculation unit, and a coefficient calculation unit. The video feature amount calculation unit calculates a video feature amount on the basis of video information. The audio feature amount calculation unit calculates an audio feature amount on the basis of audio information that is a monaural sound corresponding to the video information. The coefficient calculation unit calculates a high-order Ambisonics coefficient on the basis of the video feature amount and the audio feature amount. Advantageous Effects of Invention According to an embodiment, it is possible to provide a means for restoring a rich spatial audio corresponding to a video. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a diagram illustrating an example of a configuration of a spatial audio restoration system according to the first embodiment. FIG. 2 is a block diagram illustrating an example of a hardware configuration of a spatial audio restoration device according to the first embodiment. FIG. 3 is a block diagram illustrating an example of a configuration of a learning function of the spatial audio restoration device according to the first embodiment. FIG. 4 is a block diagram illustrating an example of a configuration of a restoration function of the spatial audio restoration device according to the first embodiment. FIG. 5 is a flowchart illustrating an example of a learning operation in the spatial audio restoration device according to the first embodiment. FIG. 6 is a flowchart illustrating an example of a restoration operation in the spatial audio restoration device according to the first embodiment. FIG. 7 is a block diagram illustrating an example of a configuration of a learning function of a spatial audio restoration device according to the second embodiment. FIG. 8 is a block diagram illustrating an example of a configuration of a restoration function of the spatial audio restoration device according to the second embodiment. FIG. 9 is a flowchart illustrating an example of a learning operation in the spatial audio restoration device according to the second embodiment. FIG. 10 is a flowchart illustrating an example of a restoration operation in the spatial audio restoration device according to the second embodiment. FIG. 11 is a block diagram illustrating an example of a configuration of a learning function of a spatial audio restoration device according to the third embodiment. FIG. 12 is a block diagram illustrating an example of a configuration of a restoration function of the spatial audio restoration device according to the third embodiment. FIG. 13 is a flowchart illustrating an example of a learning operation in the spatial audio restoration device according to the third embodiment. FIG. 14 is a flowchart illustrating an example of a restoration operation in