US-12627943-B2 - System and method for headphone equalization and room adjustment for binaural playback in augmented reality

US12627943B2US 12627943 B2US12627943 B2US 12627943B2US-12627943-B2

Abstract

A system is provided. The system includes an analyzer for determining a plurality of binaural room impulse responses, and a loudspeaker signal generator for generating at least two loudspeaker signals depending on the plurality of binaural room impulse responses and depending on the audio source signal of at least one audio source. The analyzer is configured to determine the plurality of the binaural room impulse responses such that each of the plurality of the binaural room impulse responses considers an effect that results from a headphone being worn by a user.

Inventors

Thomas Sporer

Assignees

Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V.

Dates

Publication Date: 20260512
Application Date: 20230124
Priority Date: 20200731

Claims (10)

1 . A system, comprising: an analyzer for determining a plurality of binaural room impulse responses; and a loudspeaker signal generator for generating at least two loudspeaker signals depending on the plurality of binaural room impulse responses and depending on an audio source signal of at least one audio source, wherein the analyzer is configured to determine the plurality of binaural room impulse responses such that each of the plurality of binaural room impulse responses considers an effect that results from a headphone being worn by a user, wherein the headphone comprises two headphone capsules and at least one microphone for conducting a measurement of sound in each of the two headphone capsules, wherein the at least one microphone for measuring the sound is arranged in each of the two headphone capsules, wherein the analyzer is configured to determine the plurality of binaural room impulse responses by using the measurement of the at least one microphone in each of the two headphone capsules, wherein the at least one microphone in each of the two headphone capsules is configured to, prior to reproduction of the at least two loudspeaker signals by the headphone, generate one or more recordings of a sound situation in a reproduction room, determine an estimation of a raw audio signal of the at least one audio source from the one or more recordings, and determine a binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room, and wherein the at least one microphone in each of the two headphone capsules is configured to, during reproduction of the at least two loudspeaker signals by the headphone, generate one or more further recordings of the sound situation in the reproduction room, subtract an augmented signal from these one or more further recordings, determine the estimation of the raw audio signal from one or more audio sources, and determine the binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room.
2 . The system according to claim 1 , wherein the analyzer is configured to determine acoustical room characteristics of the reproduction room and adapt the plurality of binaural room impulse responses depending on the acoustical room characteristics.
3 . The system according to claim 1 , wherein the at least one microphone is arranged in each of the two headphone capsules for measuring the sound close to an entrance of an ear canal of the user.
4 . The system according to claim 1 , wherein the system comprises one or more further microphones outside of the two headphone capsules for measuring the sound situation in the reproduction room.
5 . The system according to claim 4 , wherein the headphone comprises a bracket, and wherein at least one of the one or more further microphones is arranged on the bracket.
6 . The system according to claim 1 , wherein the loudspeaker signal generator is configured to generate the at least two loudspeaker signals by each of the plurality of binaural room impulse responses being convoluted with the audio source signal.
7 . The system according to claim 1 , wherein the analyzer is configured to determine at least one of the plurality of binaural room impulse responses depending on a movement of the headphone.
8 . The system according to claim 7 , wherein the system comprises a sensor to determine a movement of the headphone.
9 . A method, comprising: determining a plurality of binaural room impulse responses; generating at least two loudspeaker signals depending on the plurality of binaural room impulse responses and depending on an audio source signal of at least one audio source, wherein the plurality of binaural room impulse responses is determined such that each of the plurality of binaural room impulse responses considers an effect that results from a headphone being worn by a user, wherein the headphone comprises two headphone capsules and at least one microphone for conducting a measurement of sound in each of the two headphone capsules, wherein the at least one microphone for measuring the sound is arranged in each of the two headphone capsules, and wherein the plurality of binaural room impulse responses is determined by using the measurement of the at least one microphone in each of the two headphone capsules; generating, by the at least one microphone in each of the two headphone capsules, prior to reproduction of the at least two loudspeaker signals by the headphone, one or more recordings of a sound situation in a reproduction room, determining an estimation of a raw audio signal of at least one audio source from the one or more recordings, and determining a binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room; and generating, by the at least one microphone in each of the two headphone capsules, during reproduction of the at least two loudspeaker signals by the headphone, one or more further recordings of the sound situation in the reproduction room, subtracting an augmented signal from these one or more further recordings, determining the estimation of the raw audio signal from one or more audio sources, and determining the binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room.
10 . A non-transitory computer readable medium storing a computer program that when executed by a computer performs: determining a plurality of binaural room impulse responses; generating at least two loudspeaker signals depending on the plurality of binaural room impulse responses and depending on an audio source signal of at least one audio source, wherein the plurality of binaural room impulse responses is determined such that each of the plurality of binaural room impulse responses considers an effect that results from a headphone being worn by a user, wherein the headphone comprises two headphone capsules and at least one microphone for conducting a measurement of sound in each of the two headphone capsules, and wherein the at least one microphone for measuring the sound is arranged in each of the two headphone capsules, wherein the plurality of binaural room impulse responses is determined by using the measurement of the at least one microphone in each of the two headphone capsules; generating, by the at least one microphone in each of the two headphone capsules, prior to reproduction of the at least two loudspeaker signals by the headphone, one or more recordings of a sound situation in a reproduction room, determining an estimation of a raw audio signal of at least one audio source from the one or more recordings, and determining a binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room; and generating, by the at least one microphone in each of the two headphone capsules, during reproduction of the at least two loudspeaker signals by the headphone, one or more further recordings of the sound situation in the reproduction room, subtracting an augmented signal from these one or more further recordings, determining the estimation of the raw audio signal from one or more audio sources, and determining the binaural room impulse response of the plurality of binaural room impulse responses for the at least one audio source in the reproduction room.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS This application is a continuation of copending International Application No. PCT/EP2021/071151, filed Jul. 28, 2021, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP 20 188 945.8, filed Jul. 31, 2020, which is incorporated herein by reference in its entirety. BACKGROUND OF THE INVENTION The present invention relates to headphone equalization and room adaption for binaural reproduction in augmented reality (AR). Selective hearing (SH) refers to the capability of listeners to direct their attention to a certain sound source or to a plurality of sound sources in their auditory scene. In turn, this implies that the focus of the listeners to uninteresting sources is reduced. As such, human listeners are capable to communicate in loud environments as well. This usually utilizes different aspects: when hearing with two ears, there are direction-dependent time and level differences and direction-dependent different spectral coloring of the sound. Through the latter, even when hearing with one ear, the sense of hearing is able to determine the direction of a sound source and to separate different sound sources therewith. Temporal and level differences alone are not sufficient to determine the exact position of a sound source: The locations with the same temporal and level difference are located on a hyperboloid. The resulting ambiguity of the location determination is called cone-of-confusion. In rooms, each sound source is reflected by boundary surfaces. Each of these so-called mirror sources is located on a further hyperboloid. The human sense of hearing combines the information about the direct sound and the associated reflections to a hearing event and resolves the ambiguity of the cone-of-confusion through this. At the same time, the reflections belonging to a sound source increase the perceived loudness of the sound source. In addition, in the case of natural sound sources, particularly speech, signal portions of different frequencies are temporally coupled. In binaural hearing, all of these aspects are used together. Furthermore, loud sources of disturbance that are well localizable can be actively ignored, so to speak. In the literature, the concept of selective hearing is related to other terms such as assisted listening [1], virtual and amplified auditory environments [2]. Assisted listening is a broader term that includes virtual, amplified and SH applications. According to the conventional technology, classical hearing devices mostly operate in a monaural manner, i.e. signal processing for the right and left ears is fully independent with respect to frequency response and dynamic compression. As a consequence, time, level, and frequency differences between the ear signals are lost. Modern, so-called binaural hearing devices couple the correction factors of the two hearing devices. Often, they have several microphones, however, it is usually only the microphone with the “most speech-like” signal that is selected, but explicit beamforming is not computed. In complex hearing situations, desired and undesired sound signals are amplified in the same way, and a focus on desired sound components is therefore not supported. In the field of hands-free devices, e.g. for telephones, several microphones are already used today, and so-called beams are computed from the individual microphone signals: sound coming from the direction of the beam is amplified, sound from other directions is reduced. Today's methods learn the constant sound in the background (e.g. engine and wind noise in the car), learn loud disturbances that are well localizable through a further beam, and subtract these from the use signal (example: generalized side lobe canceller). Sometimes, telephone systems use detectors that detect the static properties of speech, suppressing everything that is not structured like speech. In hands-free devices, only a mono signal is transmitted in the end, losing in the transmission path the spatial information that would be interesting to capture the situation and, in particular, to provide the illusion as if “one was there”, particularly if several speakers have a mutual call. By suppressing non-speech signals, important information about the acoustical environment of the conversation partner is lost, which can hinder the communication. By nature, human beings are able to “selectively hear” and consciously focus on individual sound sources in their surroundings. An automatic system for selective hearing by means of artificial intelligence (AI) has to learn the underlying concepts first. Automatic decomposition of acoustical scenes (scene decomposition) first needs detection and classification of all active sound sources, followed by separation so as to be able to further process, amplify, or weaken them as separate audio objects. The research field of auditory scene analysis tries to detect and c