EP-4476727-B1 - OPTIMIZING PERSONAL VAD FOR ON-DEVICE SPEECH RECOGNITION

EP4476727B1EP 4476727 B1EP4476727 B1EP 4476727B1EP-4476727-B1

Inventors

DING, Shaojin
RIKHYE, Rajeev
LIANG, QIAO
HE, YANZHANG
WANG, QUAN
NARAYANAN, ARUN
O'MALLEY, TOM
MCGRAW, Ian

Dates

Publication Date: 20260506
Application Date: 20230317

Claims (15)

A personal voice activity detector, VAD, (300) comprising: a stack of multi-headed self-attention blocks (340) configured to: receive, as input, a sequence of acoustic frames (110) corresponding to an utterance; and generate, as output, a reference speaker embedding (342) for the utterance; a feature-wise linear modulation, FiLM, generator (330) configured to: receive, as input, a target speaker embedding (412) for a target speaker; and generate, as output, FiLM parameters (335) comprising a scaling vector (332) and a shifting vector (334) based on the target speaker embedding (412); a FiLM layer (350) configured to: receive, as input, the reference speaker embedding (342) and the FiLM parameters (335); and generate, as output, an affine transformation output (352) that scales and shifts the reference speaker embedding (342) based on the FiLM parameters (335); and a classifier (360) configured to generate a classification output (362) indicating whether the utterance was spoken by the target speaker based on the affine transformation output (352).
The personal VAD (300) of claim 1, wherein the classification output (362) comprises at least one of: a target speaker token; a non-target speaker token; or a non-speech token.
The personal VAD (300) of claim 1 or 2, further comprising a speaker pre-net (310) configured to: receive, as input, the sequence of acoustic frames (110); and generate, as output, a speaker information embedding (312) extracted from the sequence of acoustic frames (110), optionally wherein the FiLM generator (330) is further configured to: receive, as input, a cosine similarity (322) between the target speaker embedding (412) and the speaker information embedding (312); and generate, as output, the FiLM parameters (335) comprising the scaling vector (332) and the shifting vector (334) based on the cosine similarity (322), and/or, wherein the speaker pre-net (310) comprises a stack of multi-headed self-attention layers comprising one or more Conformer layers.
The personal VAD (300) of any of claims 1-3, wherein the stack of multi-headed self-attention blocks (340) comprises one or more Conformer layers.
The personal VAD (300) of any of claims 1-4, wherein the classifier (360) comprises a fully-connected layer, and/or wherein the personal VAD (300) operates in a streaming fashion.
The personal VAD (300) of any of claims 1-5, further comprising a pre-trained text-independent speaker recognition model (410) configured to: receive, as input, enrollment utterances (402) spoken by the target speaker; and generate, as output, the target speaker embedding (412) for the target speaker based on the enrollment utterances (402).
The personal VAD (300) of any of claims 1-6, wherein the personal VAD (300) is trained on training data comprising: an enrollment training utterance paired with the target speaker embedding (412); and a non-enrollment training utterance not paired with any corresponding target speaker embedding.
A computer-implemented method (500) when executed by data processing hardware (710) causes the data processing hardware (710) to perform operations comprising: receiving, as input to a personal voice activity detector, VAD, (300), a sequence of acoustic frames (110) corresponding to an utterance; generating, using a stack of multi-headed self-attention blocks (340) of the personal VAD (300), a reference speaker embedding (342) for the utterance; receiving, as input to a feature-wise linear modulation, FiLM, generator (330) of the personal VAD (300), a target speaker embedding (412) for a target speaker; generating, using the FiLM generator (330), FiLM parameters (335) comprising a scaling vector (332) and a shifting vector (334) based on the target speaker embedding (412); generating, using a FiLM layer (350) of the personal VAD (300), an affine transformation output (352) that scales and shifts the reference speaker embedding (342) based on the FiLM parameters (335); and generating, using a classifier (360) of the personal VAD (300), a classification output (362) indicating whether the utterance was spoken by the target speaker based on the affine transformation output (352).
The computer-implemented method (500) of claim 8, wherein the classification output (362) comprises at least one of: a target speaker token; a non-target speaker token; or a non-speech token.
The computer-implemented method (500) of claim 8 or 9, wherein the operations further comprise generating, using a speaker pre-net (310) of the personal VAD (300), a speaker information embedding (312) extracted from the sequence of acoustic frames (110, optionally wherein the operations further comprise generating, using the FiLM generator (330), the FiLM parameters (335) based on a cosine similarity (322) between the target speaker embedding (412) and the speaker information embedding (312), and/or wherein the speaker pre-net (310) comprises a stack of multi-headed self-attention layers comprising one or more Conformer layers.
The computer-implemented method (500) of any of claims 8-10, wherein the stack of multi-headed self-attention blocks (340) comprises one or more Conformer layers.
The computer-implemented method (500) of any of claims 8-11, wherein the classifier (360) comprises a fully-connected layer.
The computer-implemented method (500) of any of claims 8-12, wherein the personal VAD (300) operates in a streaming fashion.
The computer-implemented method (500) of any of claims 8-13, wherein the operations further comprise: receiving, as input to a pre-trained text-independent speaker recognition model (410), enrollment utterances (402) spoken by the target speaker; and generating, using the pre-trained text-independent speaker recognition model (410), the target speaker embedding (412) for the target speaker based on the enrollment utterances (402).
The computer-implemented method (500) of any of claims 8-14, wherein the personal VAD (300) is trained on training data comprising: an enrollment training utterance paired with the target speaker embedding (412); and a non-enrollment training utterance not paired with any corresponding target speaker embedding.

Description

TECHNICAL FIELD This disclosure relates to optimizing a personal voice activity detector for on-device speech recognition. BACKGROUND Speech-enabled devices have increased in popularity over the past several years. One challenge for speech-enabled devices is the ability to discern between background noise from the surrounding environment and speech directed towards the device. In some instances, speech-enabled devices further determine whether speech directed towards the device was spoken by a particular user or another user. This ability allows the device to decide whether to further process the audio (e.g., to process a command or query) or simply to ignore the received audio. The ability for the device to discern between the background noise and speech spoken by a particular user becomes even more difficult when considering latency and computational constraints of certain speech enabled devices in a production environment. The following documents have been cited during examination of the present patent: SHAOJIN DING ET AL: "Personal VAD: Speaker-Conditioned Voice Activity Detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 August 2019 (2019-08-12);IVAN MEDENNIKOV ET AL: "Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario", ARXIV ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 27 July 2020 (2020-07-27);NAOKI MAKISHIMA ET AL: "Enrollment-less training for personalized voice activity detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 June 2021 (2021-06-23);MAENG JOON GYU ET AL: "Personality Enhancement for Speaker-dependent Voice Activity Detection", 2021 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION;TECHNOLOGY CONVERGENCE (ICTC), IEEE, 20 October 2021 (2021-10-20), pages 535-538, DOI: 10.1109/ICTC52510.2021.9621038;O'MALLEY TOM ET AL: "A Conformer-Based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation", 2021 IEEE AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING WORKSHOP (ASRU), IEEE, 13 December 2021 (2021-12-13), pages 304-311, DOI: 10.1109/ASRU51503.2021.9687942; andETHAN PEREZ ET AL: "FiLM: Visual Reasoning with a General Conditioning Layer", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 22 September 2017 (2017-09-22) SUMMARY One aspect of the disclosure provides a personal voice activity detector (VAD). The personal VAD includes a stack of multi-headed self-attention blocks configured to receive, as input, a sequence of acoustic frames corresponding to an utterance and generate, as output, a reference speaker embedding for the utterance. The personal VAD also includes a feature-wise linear modulation (FiLM) generator configured to receive, as input, a target speaker embedding for a target speaker and generate, as output, FiLM parameters that include a scaling vector and a shifting vector based on the target speaker embedding. The personal VAD also includes a FiLM layer configured to receive, as input, the reference speaker embedding and the FiLM parameters and generate, as output, an affine transformation output that scales and shifts the reference speaker embedding based on the FiLM parameters. The personal VAD also includes a classifier configured to generate a classification output indicating whether the utterance was spoken by the target speaker based on the affine transformation output. Implementations of the disclosure may include one or more of the following optional features. In some implementations, the classification includes at least one of a target speaker token, a non-target speaker token, or a non-speech token. In some examples, the personal VAD further includes a speaker pre-net configured to receive the sequence of acoustic frames as input and generate, as output, a speaker information embedding extracted from the sequence of acoustic frames. In these examples, the FiLM generator may be further configured to receive, as input, a cosine similarity between the target speaker embedding and the speaker information embedding and generate, as output, the FiLM parameters that include the scaling vector and the shifting vector based on the cosine similarity. Here, the speaker pre-net includes a stack of multi-headed self-attention layers that include one or more Conformer layers. The stack of multi-headed self-attention blocks may include one or more Conformer layers. In some implementations, the classifier includes a fully-connected layer. The personal VAD may operate in a streaming fashion. In some examples, the personal VAD further includes a pre-trained text-independent speaker recognition model configured to receive enrollment utterances spoken by the target speaker as input and generate, as output, the target speaker embedding for the target speaker based on the enrollment utter