EP-4439557-B1 - OWN-VOICE SUPPRESSION IN WEARABLES

EP4439557B1EP 4439557 B1EP4439557 B1EP 4439557B1EP-4439557-B1

Inventors

DONLEY, JACOB RYAN
Tourbabin, Vladimir

Dates

Publication Date: 20260506
Application Date: 20240326

Claims (15)

A method comprising: capturing sound from a local area using an acoustic sensor array of a headset, the captured sound including noise from the local area and a voice of a user of the headset; determining a first set of statistical properties associated with the noise, based on portions of the captured sound that do not include the voice of the user; determining a second set of statistical properties associated with the voice of the user, based on portions of the captured sound that include the voice of the user; generating a first sound filter based on a target transfer function, the first set of statistical properties, and the second set of statistical properties; generating a second sound filter based on the target transfer function and the second set of statistical properties, but not the first set of statistical properties; applying the first sound filter and the second sound filter to different parts of an audio signal, generated based on the captured sound, to form audio content in which the voice of the user is suppressed; and presenting the audio content to the user.
The method of claim 1, wherein the target transfer function is an array transfer function, ATF, associated with a target sound source.
The method of claim 2, further comprising: estimating a direction of arrival of the target sound source; and determining the ATF based on the direction of arrival.
The method of any preceding claim, wherein the first set of statistical properties comprises a first covariance matrix representing a spatial covariance of the noise, and wherein the second set of statistical properties comprises a second covariance matrix representing a spatial covariance of the voice of the user.
The method of any preceding claim, wherein generating the first sound filter comprises weighting the first set of statistical properties relative to the second set of statistical properties; optionally, wherein a weight of the second set of statistical properties used to generate the first sound filter is less than a weight of the second set of statistical properties used to generate the second sound filter, such that the voice of the user is less suppressed in a part of the audio signal to which the first sound filter is applied compared to a part of the audio signal to which the second sound filter is applied.
The method of any preceding claim, wherein the first sound filter and the second sound filter are applied during beamforming of signals produced by the acoustic sensor array; and/or wherein the first sound filter and the second sound filter are applied as post-filters after beamforming of signals produced by the acoustic sensor array.
The method of and preceding claim, wherein: the audio signal comprises a sequence of audio frames which include the voice of the user, the first sound filter is applied to a first set of frames in the sequence of audio frames, and the second sound filter is applied to a second set of frames in the sequence of audio frames, the second set of frames being located before or after the first set of frames; optionally, wherein the first sound filter and the second sound filter are generated through weighting the first set of statistical properties relative to the second set of statistical properties.
An audio system comprising: an acoustic sensor array configured to capture sound from a local area, the captured sound including noise from the local area and a voice of a user; a transducer array; and an audio controller configured to: determine a first set of statistical properties associated with the noise, based on portions of the captured sound that do not include the voice of the user; determine a second set of statistical properties associated with the voice of the user, based on portions of the captured sound that include the voice of the user; generate a first sound filter based on a target transfer function, the first set of statistical properties, and the second set of statistical properties; generate a second sound filter based on the target transfer function and the second set of statistical properties, but not the first set of statistical properties; apply the first sound filter and the second sound filter to different parts of an audio signal, generated based on the captured sound, to form audio content in which the voice of the user is suppressed; and present the audio content to the user through the transducer array.
The audio system of claim 8, wherein the target transfer function is an array transfer function, ATF, associated with a target sound source.
The audio system of claim 9, wherein the audio controller is further configured to: estimate a direction of arrival of the target sound source; and determine the ATF based on the direction of arrival.
The audio system of any of claims 8 to 10, wherein the first set of statistical properties comprises a first covariance matrix representing a spatial covariance of the noise, and wherein the second set of statistical properties comprises a second covariance matrix representing a spatial covariance of the voice of the user.
The audio system of any of claims 8 to 11, wherein to generate the first sound filter, the audio controller is configured to weight the first set of statistical properties relative to the second set of statistical properties; optionally, wherein a weight of the second set of statistical properties used to generate the first sound filter is less than a weight of the second set of statistical properties used to generate the second sound filter, such that the voice of the user is less suppressed in a part of the audio signal to which the first sound filter is applied compared to a part of the audio signal to which the second sound filter is applied.
The audio system of any of claims 8 to 12, wherein the audio controller is configured to apply the first sound filter and the second sound filter during beamforming of signals produced by the acoustic sensor array; and/or wherein the audio controller is configured to apply the first sound filter and the second sound filter as post-filters after beamforming of signals produced by the acoustic sensor array.
The audio system of any of claims 8 to 13, wherein: the audio signal comprises a sequence of audio frames which include the voice of the user, the audio controller is configured to apply the first sound filter to a first set of frames in the sequence of audio frames, and the audio controller is configured to apply the second sound filter to a second set of frames in the sequence of audio frames, the second set of frames being located before or after the first set of frames.
A non-transitory computer-readable storage medium storing instructions which, when executed by one or more processors of an audio system, cause the audio system to: capture sound from a local area using an acoustic sensor array, the captured sound including noise from the local area and a voice of a user; determine a first set of statistical properties associated with the noise, based on portions of the captured sound that do not include the voice of the user; determine a second set of statistical properties associated with the voice of the user, based on portions of the captured sound that include the voice of the user; generate a first sound filter based on a target transfer function, the first set of statistical properties, and the second set of statistical properties; generate a second sound filter based on the target transfer function and the second set of statistical properties, but not the first set of statistical properties; apply the first sound filter and the second sound filter to different parts of an audio signal, generated based on the captured sound, to form audio content in which the voice of the user is suppressed; and present the audio content to the user.

Description

FIELD OF THE INVENTION The present disclosure generally relates to suppressing a voice, and specifically relates to own-voice suppression in wearables. BACKGROUND Conventional headsets may enhance a far-field talker but also tend to amplify a voice of a user of the headset. The resulting amplification of the user's voice back to the user can be jarring and reduce the quality of the user's listening experience. US 2023/0083192 A1 discloses an approach for own voice suppression in a hearing device. SUMMARY Described herein is an audio system and corresponding techniques for controlling the audio system to enhance sound from a target sound source, and also suppress a user's own voice, when presenting audio content to the user. The audio system may be integrated into a wearable device. The wearable device may be, e.g., a headset, an in-ear device, a wristwatch, etc. The audio system may generate the audio content using separate sound filters (e.g., spatial filters). The sound filters can be generated through tracking statistical properties (e.g., spatial covariance across an acoustic sensor array) associated with ambient noise and statistical properties associated with the user's voice. In accordance with a first aspect, there is provided a method of own-voice suppression. The method comprises capturing sound from a local area using an acoustic sensor array of a headset, the captured sound including noise from the local area and a voice of a user of the headset. The method further includes determining a first set of statistical properties associated with the noise, based on portions of the captured sound that do not include the voice of the user; and determining a second set of statistical properties associated with the voice of the user, based on portions of the captured sound that include the voice of the user. The method further includes generating a first sound filter based on a target transfer function, the first set of statistical properties, and the second set of statistical properties; and generating a second sound filter based on the target transfer function and the second set of statistical properties, but not the first set of statistical properties. The method further includes applying the first sound filter and the second sound filter to different parts of an audio signal, generated based on the captured sound, to form audio content in which the voice of the user is suppressed; and presenting the audio content to the user. The method described in the preceding paragraph may further include at least one of the following features, either alone or in a combination of two or more features: (i) the target transfer function is an array transfer function (ATF) associated with a target sound source; (ii) estimating a direction of arrival of the target sound source in combination with determining the ATF based on the direction of arrival; (iii) the first set of statistical properties comprises a first covariance matrix representing a spatial covariance of the noise, and the second set of statistical properties comprises a second covariance matrix representing a spatial covariance of the voice of the user; (iv) generating the first sound filter comprises weighting the first set of statistical properties relative to the second set of statistical properties; (v) a weight of the second set of statistical properties used to generate the first sound filter is less than a weight of the second set of statistical properties used to generate the second sound filter, such that the voice of the user is less suppressed in a part of the audio signal to which the first sound filter is applied compared to a part of the audio signal to which the second sound filter is applied; (vi) the first sound filter and the second sound filter are applied during beamforming of signals produced by the acoustic sensor array; (vii) the first sound filter and the second sound filter are applied as post-filters after beamforming of signals produced by the acoustic sensor array; (viii) the audio signal comprises a sequence of audio frames which include the voice of the user, the first sound filter is applied to a first set of frames in the sequence of audio frames, and the second sound filter is applied to a second set of frames in the sequence of audio frames, the second set of frames being located before or after the first set of frames; and/or (ix) generating the first sound filter and the second sound filter through weighting the first set of statistical properties relative to the second set of statistical properties. In accordance with a second aspect, there is provided an audio system including an acoustic sensor, a transducer array, and an audio controller. The acoustic sensor array is configured to capture sound from a local area, the captured sound including noise from the local area and a voice of a user. The audio controller is configured to determine a first set of statistical properties associated with the noise, based on portion