US-12626688-B2 - Method and system for speech detection and speech enhancement
Abstract
A method of speech detection, speech enhancement, and training in a speech detection and speech enhancement unit. The method comprising receiving input audio segments, and determining an acoustic environment based on input audio auxiliary information, extracting T-F-domain features from the received input audio segments, determining if each of the received input audio segments is speech by inputting the T-F domain features into a speech detection classifier trained for the determined acoustic environment, determining, when one of the received input audio segments is speech, if the received audio segment is noisy speech by inputting the T-F domain features into a noise classifier using a statistical generative model representing the probability distributions of the T-F domain features of noisy speech trained for the determined acoustic environment, and applying a noise reduction mask on the received input audio segments according to the determination of the received audio segment is noisy speech.
Inventors
- Anna KIM
- Eamonn Shaw
Assignees
- Pexip AS
Dates
- Publication Date
- 20260512
- Application Date
- 20220629
- Priority Date
- 20210630
Claims (5)
- 1 . A method of speech detection and speech enhancement in a speech detection and speech enhancement unit of Multipoint Conferencing Node (MCN), comprising: receiving input audio segments from at least one videoconferencing participant; determining an acoustic environment based on auxiliary information of the at least one videoconferencing participant, wherein determining an acoustic environment based on auxiliary information comprises: analyzing a video image to count a number of participants, and identifying a type of videoconferencing endpoint, wherein the acoustic environment is determined prior to processing the input audio segments; extracting Time-Frequency (T-F) domain features from the received input audio segments; determining if each of the received input audio segments is speech by inputting the T-F domain features into a speech detection classifier trained for the determined acoustic environment; determining, when one of the received input audio segments is speech, if the received audio segment is noisy speech by inputting the T-F domain features into a noise classifier using a statistical generative model representing the probability distributions of the T-F domain features of noisy speech trained for the determined acoustic environment; and applying a noise reduction mask on the received input audio segments according to the determination of the received audio segment is noisy speech, wherein applying a noise reduction mask comprises applying a composite noise reduction mask calculated as CM(τ,k)=αERM(τ,k)+βEBM(τ,k), where ERM is an estimated ratio mask, EBM is an estimated binary mask generated using a Bayesian classifier, and α and β are weights tuned for the determined acoustic environment, wherein the auxiliary information of the at least one videoconferencing participant comprises at least one of a number of participants in a video image received from the at least one videoconferencing participant, and a specification of a videoconferencing endpoint received from the at least one videoconferencing participant, and wherein the acoustic environment comprises meeting room with video conferencing endpoint, home office, and public space.
- 2 . The method of claim 1 , wherein the speech detection and speech enhancement unit is trained according the method of claim 1 .
- 3 . The method of claim 1 , wherein the noise classifier is a Bayesian classifier.
- 4 . The method of claim 1 , wherein the noise reduction mask is a composite noise reduction mask.
- 5 . The method of claim 4 , wherein the composite noise reduction mask is based on an estimated binary mask (EBM) generated using the Bayesian classifier.
Description
CROSS-REFERENCE TO RELATED APPLICATION This application is related to and claimed priority to Norwegian Patent Application No. 20210874, filed Jun. 30, 2021, entitled METHOD AND SYSTEM FOR SPEECH DETECTION AND SPEECH ENHANCEMENT, the entirety of which is incorporated herein by reference. FIELD The present invention relates to detecting and enhancing speech in a multipoint videoconferencing session, in particular a method of speech detection and speech enhancement in a speech detection and speech enhancement unit of Multipoint Conferencing Node (MCN) and a method of training the same. BACKGROUND Transmission of audio and moving pictures in real-time is employed in several applications like e.g. video conferencing, team collaboration software, net meetings and video telephony. Terminals and endpoints being able to participate in a conference may be traditional stationary video conferencing endpoints, external devices, such as mobile and computer devices, smartphones, tablets, personal devices and PCs, and browser-based video conferencing terminals. Video conferencing systems allow for simultaneous exchange of audio, video and data information among multiple conferencing sites. For performing multipoint video conferencing, there usually is a Multipoint Conferencing Node (MCN) that provides switching and layout functions to allow the endpoints and terminals of multiple sites to intercommunicate in a conference. Such nodes may also be referred to as Multipoint Control Units (MCUs), Multi Control Infrastructure (MCI), Conference Nodes and Collaborations Nodes (CNs). MCU is the most common used term, and has traditionally has been associated with hardware dedicated to the purpose, however, the functions of an MCN could just as well be implemented in software installed on general purpose severs and computers, so in the following, all kinds of nodes, devices and software implementing features, services and functions providing switching and layout functions to allow the endpoints and terminals of multiple sites to intercommunicate in a conference, including (but not excluding) MCUs, MCIs and CNs are from now on referred to as MCNs. Audio quality represents a key aspect of the video conferencing experience. One major challenge is the heterogeneous and dynamic nature of the audio environment of the various conference participants. In a home office, some participants may use headsets with directional microphones that are closed to the mouth, while others may use the built-in speaker and microphone on a laptop computer. In a typical meeting room that is equipped with a video conferencing endpoint, a speaker with multiple microphones is often placed in the middle of the table, where participants are sitting at different distances to the shared audio unit. There may also be participants who are connected to the conference via a smartphone that is either handheld or using an external headset with microphone. These diverse physical setups mean that different types of noise may be picked up during the course of the conference. Relying on the user to constantly or consciously mute and unmute the microphone can be cumbersome and tiring, while disrupting the flow of the conversation. Effectively reducing the impact of noise in a video conference setting is a challenging problem. Depending on the source and the environment, the noise contributions may be stationary or non-stationary, bursty, narrow or wideband, or contain speech-like harmonics. One challenge is to reduce or eliminate noises that are disturbances in the meeting. Some noise can be minimized by audio devices on the client side which have active background noise cancellation. Non-verbal sounds produced by the speaker, such as coughing, sneezing and heavy breathing are generally undesirable but can't be removed in the same manner. Means to differentiate speech from noise, i.e. reliable speech detection, is therefore needed. In addition, speech corrupted by noise becomes less intelligible. The extent of degradation depends on the amount and the type of noise contributions. How to enhance the quality and intelligibility of speech when noise is present is the other challenge to be addressed. Known solutions a for speech detection, also known as voice activity detection (VAD), and speech enhancement are typically implemented on the client side, i.e. where speech is generated or perceived. The GSM standard features audio codecs that support VAD for better bandwidth utilization. Various enhancement algorithms implementations can be found in various high performance headsets. Recent advances in machine learning also lead to a large number of deep learning neural network (DNN) based speech processing algorithms. The Opus audio codec for example improves VAD and supports classification of speech and music using RNN (recurrent neural networks) in its Opus 1.3 release. Microsoft have also been actively pursuing real-time de-noising of speech using neural networks (venturebeat.co