EP-4420115-B1 - AUDIO MASKING OF SPEECH

EP4420115B1EP 4420115 B1EP4420115 B1EP 4420115B1EP-4420115-B1

Inventors

STOTTAN, THOMAS
Hatheier, Thomas
SONTACCHI, ALOIS

Dates

Publication Date: 20260513
Application Date: 20221018

Claims (15)

A method for masking a speech signal in a zone-based audio system (1), wherein a speech signal to be masked is received and a masking signal is generated based on said speech signal, the method comprising: acquiring said speech signal to be masked in an audio zone (I, II); characterized in that the method comprises the following method steps: transforming (105) said acquired speech signal into spectral bands; swapping (125) spectral values of at least two spectral bands; generating a masking signal adapted to said speech signal based on said swapped spectral values; and outputting said masking signal for said speech signal in another audio zone (I, II).
The method of claim 1, wherein generating a masking signal based on said swapped spectral values comprises: generating a wide-band noise signal; transforming (150) said generated wide-band noise signal into the frequency domain; and multiplying (155) said frequency representation of said noise signal by a frequency representation of said speech signal taking into account said swapped spectral values.
The method of claim 2, said frequency representation of said speech signal generated by interpolating (135) the spectral values of the bands after swapping (125) of spectral values.
The method of one of the previous claims, further comprising: estimating a background noise spectrum; comparing spectral values of said speech signal with the background noise spectrum; and taking into account only spectral values of said speech signal larger than the corresponding spectral values of said background noise spectrum.
The method of one of the previous claims, said transformation (105) of said acquired speech signal to spectral bands occurring for blocks of said speech signal and by means of a Mel-filter bank (110) and, optionally, a temporal smoothing (120) of the spectral values occurs for the Mel bands.
The method of one of the previous claims, presenting, when outputting in said other audio zone (I, II), said noise signal spatially by means of multi-channel playback, preferably by multiplication with binaural spectra of an acoustic transfer function.
The method of claim 6, said noise signal in said other audio zone (I, II) being spatially output such that it appears to originate from the dominant direction of the speaker of said speech signal to be masked.
The method of one of the previous claims, further comprising: determining a point of time ( t i,distract ) relevant for speech intelligibility in said speech signal; generating (245) a distraction signal for said determined point of time ( t i,distract ) ; and outputting said distraction signal at said determined point of time ( t i,distract ) as a further masking signal in said other audio zone (I, II).
The method of claim 8, said point of time ( t i,distract ) relevant for speech intelligibility determined by means of extreme values of a spectral function of said speech signal, said spectral function being determined based on an addition of optionally averaged spectral values along the frequency axis (235).
The method of claim 8 or 9, said point of time ( t i,distract ) relevant for speech intelligibility verified by means of parameters of said speech signal such as zero-crossing rate (205), short-term energy (215), and/or spectral center of gravity (220).
The method of one of the claims 8 to 10, said distraction signal for said determined point of time ( t i,distract ) selected randomly from a set of predetermined distraction signals and/or adapted to said speech signal regarding a spectral characteristic and/or its energy (265).
The method of one of the claims 1 to 11, presenting, when outputting by means of multi-channel playback in said other audio zone (I, II), said masking signal spatially, preferably by multiplication (270) with binaural spectra of an acoustic transfer function (280), said masking signal being spatially output in said other audio zone (I, II) such that it appears to originate from a random direction and/or in the vicinity of a listener's head in said other audio zone (I, II).
An apparatus (A, B) for the generation of a masking signal in a zone-based audio system (1) that receives a speech signal to be masked and generates said masking signal based on said speech signal, comprising: means (105) for transforming said acquired speech signal to spectral bands; characterized in that the apparatus comprises the following means: means (125) for swapping spectral values of at least two spectral bands; and means for generating a masking signal adapted to said speech signal based on said swapped spectral values.
The apparatus of claim 13, further comprising: means for determining a point of time ( t i,distract ) relevant for speech intelligibility in said speech signal; means (245) for generating a distraction signal for said relevant point of time ( t i,distract ) ; and means (270) for adding the noise signal and said distraction signal and for outputting the sum signal as a masking signal; and/or means for generating a multi-channel representation of said masking signal that allows for a spatial playback of said masking signal.
A zone-based audio system (1) with a plurality of audio zones (I, II), one audio zone (I, II) comprising at least one microphone (5) for acquiring a speech signal and another audio zone (I, II) comprising at least one loudspeaker (4), wherein said microphone (5) and loudspeakers (4) are preferable arranged in headrests (3) of seats (2) for occupants of a vehicle, and wherein said audio system (1) comprises an apparatus (A, B) for generating a masking signal according to claims 13 to 14 that obtains a speech signal from a microphone (5) of said one audio zone (I, II) and sends said masking signal to said one or more loudspeakers (4) of said other audio zone (I, II).

Description

The present disclosure relates to the generation of a masking signal for speech in a zone-based audio system. Modern communication technologies and their ever-increasing coverage enable communication to take place almost anywhere, for example, in the form of telephone conversations. In public spaces, other people can often overhear such conversations and understand their content. This is particularly problematic when the conversations are confidential, private, or business-related. Such a scenario exists on public transportation, such as trains or airplanes, but also in private vehicles, such as taxis or rented limousines. In these cases, in addition to the speaker, other people are in fixed positions, for example, in assigned seats. Often, such seats have an associated audio system or at least components thereof. For example, speakers for individual audio playback may be provided in these seats, perhaps integrated into headrests, which is also known as a zone-based audio system. Besides telephone conversations, the problem of unwanted eavesdropping can also occur in conversations between people. For example, two passengers in the back of a taxi might be discussing a confidential topic, and the driver might not want to overhear them. It is known from current technology that unwanted eavesdropping can be reduced by playing loud noise. However, this increases the noise level for everyone involved and is perceived as an unpleasant disturbance that can also affect attention and reaction time, which is particularly undesirable in road traffic. This document addresses the technical task of generating a masking signal in a zone-based audio system that reduces unwanted eavesdropping on a conversation without causing any unpleasant disturbance. The problem is solved by the features of the independent claims. Advantageous embodiments are described in the dependent claims. <1a> According to a first aspect, a method for masking a speech signal in a zone-based audio system is disclosed. The method comprises capturing a speech signal to be masked in an audio zone, e.g., by means of one or more conveniently placed microphones, which may, for example, be located in the headrest of a seat. The speech signal may originate from the local speaker of a telephone conversation or belong to a conversation between people present. The captured speech signal is then transformed into spectral bands, which can be done, for example, using an FFT and Mel filters. Furthermore, the method involves swapping spectral values of at least two spectral bands, thereby altering the spectral structure of the speech signal without changing its overall energy content. Subsequently, a (preferably broadband) noise signal is generated based on the swapped spectral values. The generated noise signal exhibits The US 2012/016665 A1 Figure 1 shows a device for generating a masking signal, wherein a CPU analyzes the speech utterance rate of a received audio signal. The CPU then copies the received audio signal into a multitude of audio signals and performs the following processing on each of the audio signals. Specifically, the CPU divides each of the audio signals into frames based on a frame length determined by the speech utterance rate. A reversal process is performed on each of the frames to replace one waveform of the frame with an inverted waveform, and a windowing process is performed to achieve smooth transitions between the frames. Subsequently, the CPU randomly reorders the sequence of the frames and mixes the multiple audio signals to generate a masking audio signal. The DE 10 2014 214052 A1 This document describes a method and a corresponding device for generating a shielded listening zone within a vehicle. It details a method for masking a target speech signal. The method comprises identifying the target speech signal, which is reproduced in a first listening zone of the vehicle, and generating a first masking sound signal based on the identified target speech signal. Furthermore, the method includes binauralizing the first masking sound signal to generate a binaural masking sound signal. Finally, the method includes reproducing the binaural masking sound signal via at least two loudspeakers in a second listening zone of the vehicle, where the target speech signal is to be masked. In "Aircraft noise and speech intelligibility in an outdoor living space" a "Partial Loudness Model" is developed that predicts the loudness of a signal in the presence of a masking sound signal, taking into account masking across spectral bands and the effects of temporal masking of time-varying noise. While it exhibits a certain similarity to the spectrum of the speech signal, it does not completely match it, as the spectral structure of the speech signal is no longer fully preserved due to the band swapping. Such a noise signal with a similar but not identical spectrum to the speech signal is well-suited as a masking signal for the speech signal. It should also be