US-20260128046-A1 - SOURCE SEPARATING MULTI-STREAM AUDIO CODEC WITH CUSTOMIZABLE FILTERING AND MIXING

US20260128046A1US 20260128046 A1US20260128046 A1US 20260128046A1US-20260128046-A1

Abstract

A method of operating an audio transmitting endpoint includes receiving an audio signal at a transmitting endpoint, using a trained neural network, source separating the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal, adjusting an allocation of bandwidth for at least one of the first and second source streams based on at least one of entropy and dynamism of the at least one of the first and second streams, applying at least one transformation operation to at least one of the first second source streams, vector quantizing the first and second source streams to generate first and second codewords, and transmitting, in one or more packets, the first and second codewords, or respective indexes thereof, to a remote endpoint.

Inventors

Yusuf Ziya ISIK
Christopher Rowen
Samer Lutfi Hijazi
Xuehong Mao

Assignees

CISCO TECHNOLOGY, INC.

Dates

Publication Date: 20260507
Application Date: 20241105

Claims (20)

1 . A method comprising: receiving an audio signal at a transmitting endpoint; using a trained neural network, source separating the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal; adjusting an allocation of bandwidth for at least one of the first source stream and the second source stream based on at least one of entropy and dynamism of the at least one of the first source stream and the second source stream; applying at least one transformation operation to at least one of the first source stream and the second source stream; vector quantizing the first source stream to generate first codewords; vector quantizing the second source stream to generate second codewords; and transmitting, in one or more packets, the first codewords and the second codewords, or respective indexes thereof, to a remote endpoint.
2 . The method of claim 1 , further comprising: using the trained neural network, source separating the audio signal to generate a third source stream comprising third embedding vectors representative of a third source of the audio signal; and at least one of selecting the third source stream to not send to the remote endpoint and adjusting an allocation of bandwidth for the third source stream.
3 . The method of claim 2 , further comprising allocating an increased amount of bandwidth to at least one of the first source stream and the second source stream in response to at least one of the third source stream not being sent to the remote endpoint and the allocation of bandwidth for the third source stream being reduced.
4 . The method of claim 1 , further comprising adjusting an amount of bandwidth allocated to at least one of the first source stream and the second source stream in response to receiving information regarding a channel condition between the transmitting endpoint and the remote endpoint.
5 . The method of claim 1 , wherein applying the at least one transformation operation to the at least one of the first source stream and the second source stream comprises controlling at least one of volume, tone, and temporal and spectral characteristics for at least one of the first source stream and the second source stream.
6 . The method of claim 5 , wherein applying the at least one transformation operation to the at least one of the first source stream and the second source stream is in response to input received via a user interface.
7 . The method of claim 1 , further comprising passing the at least one of the first source stream and the second source stream through a classifier.
8 . The method of claim 7 , wherein the first source of the audio signal comprises a main speaker and wherein the second source of the audio signal comprises at least one of a far-field talker, human noises, background noises, and music.
9 . The method of claim 1 , further comprising: in response to receiving user input at the remote endpoint, selecting at least one of a reproduced first source stream and a reproduced second source stream to not play back as audio.
10 . The method of claim 1 , further comprising: in response to receiving user input at the remote endpoint, applying at least one transformation operation to at least one of a reproduced first source stream and a reproduced second source stream, wherein the at least one transformation operation is executed in an embedding domain prior to decoding the reproduced first source stream and the reproduced second source stream.
11 . A device comprising: an interface configured to enable network communications; a memory; and one or more processors coupled to the interface and the memory, and configured to: receive an audio signal; using a trained neural network, source separate the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal; adjust an allocation of bandwidth for at least one of the first source stream and the second source stream based on at least one of entropy and dynamism of the at least one of the first source stream and the second source stream; apply at least one transformation operation to at least one of the first source stream and the second source stream; vector quantize the first source stream to generate first codewords; vector quantize the second source stream to generate second codewords; and transmit, in one or more packets, the first codewords and the second codewords, or respective indexes thereof, to a remote endpoint.
12 . The device of claim 11 , wherein the one or more processors are configured to: using the trained neural network, source separate the audio signal to generate a third source stream comprising third embedding vectors representative of a third source of the audio signal; and at least one of select the third source stream to not send to the remote endpoint and adjust an allocation of bandwidth for the third source stream.
13 . The device of claim 12 , wherein the one or more processors are configured to allocate an increased amount of bandwidth for at least one of the first source stream and the second source stream in response to at least one of the third source stream not being sent to the remote endpoint and the allocation of bandwidth for the third source stream being reduced.
14 . The device of claim 11 , wherein the one or more processors are configured to adjust an amount of bandwidth allocated for at least one of the first source stream and the second source stream in response to receiving information regarding a channel condition between the device and the remote endpoint.
15 . The device of claim 11 , wherein the one or more processors are configured to apply the at least one transformation operation to the at least one of the first source stream and the second source stream by controlling at least one of volume, tone and temporal and spectral characteristics for at least one of the first source stream and the second source stream.
16 . The device of claim 15 , wherein the one or more processors are configured to apply the at least one transformation operation to the at least one of the first source stream and the second source stream is in response to input received via a user interface.
17 . One or more non-transitory computer readable storage media encoded with instructions that, when executed by a processor, cause the processor to: receive an audio signal at a transmitting endpoint; using a trained neural network, source separate the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal; adjust an allocation of bandwidth for at least one of the first source stream and the second source stream based on at least one of entropy and dynamism of the at least one of the first source stream and the second source stream; apply at least one transformation operation to at least one of the first source stream and the second source stream; vector quantize the first source stream to generate first codewords; vector quantize the second source stream to generate second codewords; and transmit, in one or more packets, the first codewords and the second codewords, or respective indexes thereof, to a remote endpoint.
18 . The one or more non-transitory computer readable storage media of claim 17 , wherein the instructions are configured to: using the trained neural network, source separate the audio signal to generate a third source stream comprising third embedding vectors representative of a third source of the audio signal; and at least one of select the third source stream to not send to the remote endpoint and adjust an allocation of bandwidth for the third source stream.
19 . The one or more non-transitory computer readable storage media of claim 18 , wherein the instructions are configured to allocate an increased amount of bandwidth for at least one of the first source stream and the second source stream in response to at least one of the third source stream not being sent to the remote endpoint and the allocation of bandwidth for the third source stream being reduced.
20 . The one or more non-transitory computer readable storage media of claim 17 , wherein the instructions are configured to adjust an amount of bandwidth allocated for at least one of the first source stream and the second source stream in response to receiving information regarding a channel condition between the transmitting endpoint and the remote endpoint.

Description

TECHNICAL FIELD The present disclosure relates to audio processing, and more particularly to audio processing using a neural audio coding-decoding system that performs source separation. BACKGROUND Audio coding/decoding (codec) systems play a role in real-time communication technologies, aiming to preserve audio content quality and intelligibility while minimizing bit consumption. The integration of machine learning techniques and the development of end-to-end neural codecs have driven advancements in bitrate reduction and audio quality. In addition to encoding the audio signal, there is a role for audio enhancement in extensively utilized real-time communication solutions. Deep neural networks have shown promising results in addressing the challenges of audio enhancement in noisy and reverberant environments. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a system block diagram of a neural network audio codec system according to an example embodiment. FIG. 2 is a diagram depicting end-to-end training of a neural network audio codec system according to an example embodiment. FIG. 3 is a block diagram of a transmitting endpoint of a neural network audio codec system according to an example embodiment. FIG. 4 shows a user interface for controlling a transmitting endpoint or a receiving endpoint of a neural network audio codec system according to an example embodiment. FIG. 5 is a block diagram of a receiving endpoint of a neural network audio codec system according to an example embodiment. FIG. 6 is a flowchart depicting a series of operations that may be executed by a transmitting endpoint in connection with source separating and control logic, according to an example embodiment. FIG. 7 is a block diagram of a computing device that may be configured to host location management function logic, and to perform techniques described herein, according to an example embodiment. DETAILED DESCRIPTION Overview A method of operating an audio transmitting endpoint is disclosed. The method may include receiving an audio signal at a transmitting endpoint, using a trained neural network, source separating the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal, adjusting an allocation of bandwidth for at least one of the first source stream and the second source stream based on at least one of entropy and dynamism of the at least one of the first source stream and the second source stream, applying at least one transformation operation to at least one of the first source stream and the second source stream, vector quantizing the first source stream to generate first codewords, vector quantizing the second source stream to generate second codewords, and transmitting, in one or more packets, the first codewords and the second codewords, or respective indexes thereof, to a remote endpoint. A device is also described and includes an interface configured to enable network communications, a memory, and one or more processors coupled to the interface and the memory, and configured to: receive an audio signal, using a trained neural network, source separate the audio signal to generate a first source stream comprising first embedding vectors representative of a first source of the audio signal and a second source stream comprising second embedding vectors representative of a second source of the audio signal, adjust an allocation of bandwidth for at least one of the first source stream and the second source stream based on at least one of entropy and dynamism of the at least one of the first source stream and the second source stream, apply at least one transformation operation to at least one of the first source stream and the second source stream, vector quantize the first source stream to generate first codewords, vector quantize the second source stream to generate second codewords, and transmit, in one or more packets, the first codewords and the second codewords, or respective indexes thereof, to a remote endpoint. Example Embodiments Neural Network Audio Codec System Reference is first made to FIG. 1. FIG. 1 shows a block diagram of a neural audio encoder/decoder (codec) system 100. The neural audio codec system 100 includes a transmit side 102 and a receive side 104, which may be separate at devices that are in communication with each other via network 106. The network 106 may be a combination of (wired or wireless) local area networks, (wired or wireless) wide area networks, public switched telephone network (PSTN), etc. At the transmit side 102, there is an audio encoder 110 and a vector quantizer 112. The vector quantizer uses a codebook 114. The audio encoder 110 receives an input audio stream (that includes speech as well as artifacts and impairments, such as background noise). The audio encoder 110 may use a deep neural ne