EP-4295517-B1 - HYBRID MULTILINGUAL TEXT-DEPENDENT AND TEXT-INDEPENDENT SPEAKER VERIFICATION

EP4295517B1EP 4295517 B1EP4295517 B1EP 4295517B1EP-4295517-B1

Inventors

Chojnacka, Roza
PELECANOS, JASON
WANG, QUAN
LOPEZ MORENO, IGNACIO

Dates

Publication Date: 20260506
Application Date: 20220309

Claims (14)

A computer-implemented method (400) for speaker verification when executed on data processing hardware (510) causes the data processing hardware (510) to perform operations comprising: receiving audio data (120) corresponding to an utterance (119) captured by a user device (102), the utterance (119) comprising a predetermined hotword followed by a query specifying an action to perform; processing, using a text-dependent speaker verification (TD-SV) model (212), a first portion (121) of the audio data (120) that characterizes the predetermined hotword to generate a text-dependent evaluation vector (214) representing voice characteristics of the utterance (119) of the hotword; generating one or more text-dependent confidence scores (215) each indicating a likelihood that the text-dependent evaluation vector (214) matches a respective one of one or more text-dependent reference vectors (252), each text-dependent reference vector (252) associated with a respective one of one or more different enrolled users (10) of the user device (102); determining whether any of the one or more text-dependent confidence scores (215) satisfy a confidence threshold; and one of: when one of the text-dependent confidence scores (215) satisfy the confidence threshold: identifying a speaker of the utterance (119) as the respective enrolled user (10) that is associated with the text-dependent reference vector (252) corresponding to the text-dependent confidence score (212) that satisfies the confidence threshold; and initiating performance of the action specified by the query without performing speaker verification on a second portion (122) of the audio data (120) that characterizes the query following the predetermined hotword; or when none of the one or more text-dependent confidence scores (215) satisfy the confidence threshold, providing an instruction to a text-independent speaker verifier (220), the instruction when received by the text-independent speaker verifier (220), causing the text-independent speaker verifier (220) to: process, using a text-independent speaker verification (TI-SV) model (222), the second portion (122) of the audio data (120) that characterizes the query to generate a text-independent evaluation vector (224); generate one or more text-independent confidence scores (225) each indicating a likelihood that the text-independent evaluation vector (224) matches a respective one of one or more text-independent reference vectors (254), each text-independent reference vector (254) associated with a respective one of the one or more different enrolled users (10) of the user device (102); and determine, based on the one or more text-dependent confidence scores (215) and the one or more text-independent confidence scores (225), whether the identity of the speaker that spoke the utterance (119) includes any of the one or more different enrolled users (10) of the user device (102); wherein: the TD-SV model (212) and the TI-SV model (222) are trained on a plurality of training data sets (310), each training data set (310) associated with a different respective language or dialect and comprising corresponding training utterances (320) spoken in the respective language or dialect by different speakers, each corresponding training utterance (320) comprising a text-dependent portion characterizing the predetermined hotword and a text-independent portion characterizing a query statement that follows the predetermined hotword; the TD-SV model (212) is trained on the text-dependent portion of each corresponding training utterance (320) in each training data set (310) of the plurality of training data sets (310); and the TI-SV model (222) is trained on the text-independent portion of each corresponding training utterance (320) in each training data set (310) of the plurality of training data sets (310).
The computer-implemented method (400) of claim 1, wherein: each of the one or more different enrolled users (10) of the user device (102) has permissions for accessing a different respective set of personal resources; and performance of the action specified by the query requires access to the respective set of personal resources associated with the respective enrolled user (10) identified as the speaker of the utterance (119).
The computer-implemented method (400) of claim 1 or 2, wherein: the data processing hardware (510) executes the TD-SV model (212) and resides on the user device (102); and the text-independent speaker verifier (220) executes the TI-SI model and resides on a distributed computing system (111) in communication with the user device (102) via a network.
The computer-implemented method (400) of claim 3, wherein, when none of the one or more text-dependent confidence scores (215) satisfy the confidence threshold, providing the instruction to the text-independent speaker verifier (220) comprises transmitting the instruction and the one or more text-dependent confidence scores (215) from the user device (102) to the distributed computing system (111).
The computer-implemented method (400) of claim 1 or 2, wherein the data processing hardware (510) resides on one of the user device (102) or a distributed computing system (111) in communication with the user device (102) via a network, the data processing hardware (510) executing both the TD-SV model (212) and the TI-SV model (222).
The computer-implemented method (400) of any of claims 1-5, wherein the TI-SV model (222) is more computationally intensive than the TD-SV model (212).
The computer-implemented method (400) of any of claims 1-6, wherein the operations further comprise: detecting, using a hotword detection model (110), the predetermined hotword in the audio data (120) that precedes the query, wherein the first portion (121) of the audio data (120) that characterizes the predetermined hotword is extracted by the hotword detection model (110).
The computer-implemented method (400) of any preceding claim, wherein the corresponding training utterances (320) spoken in the respective language or dialect associated with at least one of the training data sets (310) pronounce the predetermined hotword differently than the corresponding training utterances (320) of the other training data sets (310).
The computer-implemented method (400) of any preceding claim, wherein the TI-SV model (222) is trained on the text-dependent portion of at least one corresponding training utterance (320) in one or more of the plurality of training data sets (310).
The computer-implemented method (400) of any preceding claim, wherein the query statements characterized by the text-independent portions of the training utterances (320) comprise variable linguistic content.
The computer-implemented method (400) of any of claims 1-10, wherein, when generating the text-independent evaluation vector (224), the text-independent speaker verifier (220) uses the TI-SV model (222) to process both the first portion (121) of the audio data (120) that characterizes the predetermined hotword and the second portion (122) of the audio data (120) that characterizes the query.
The computer-implemented method (400) of any of claims 1-11, wherein each of the one or more text-dependent reference vectors (252) is generated by the TD-SV model (212) in response (160) to receiving one or more previous utterances (119) of the predetermined hotword spoken by the respective one of the one or more different enrolled users (10) of the user device (102).
The computer-implemented method (400) of any of claims 1-12, wherein each of the one or more text-independent reference vectors (254) is generated by the TI-SV model (222) in response to receiving one or more previous utterances (119) spoken by the respective one of the one or more different enrolled users (10) of the user device (102).
A system (100) comprising: data processing hardware (510); and memory hardware (720) in communication with the data processing hardware (510), the memory hardware (720) storing instructions that when executed on the data processing hardware (510) cause the data processing hardware (510) to perform the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to hybrid multilingual text-dependent and text-independent speaker verification. BACKGROUND In a speech-enabled environment, such as a home or automobile, a user may access information and/or control various functions using voice input. The information and/or functions may be personalized for a given user. It may therefore be advantageous to identify a given speaker from among a group of speakers associated with the speech-enabled environment. Speaker verification (e.g., voice authentication) provides an easy way for a user of a user device to gain access to the user device. Speaker verification allows the user to unlock, and access, the user's device by speaking an utterance without requiring the user manually enter (e.g., via typing) a passcode to gain access to the user device. However, the existence of multiple different languages, dialects, accents, and the like, presents certain challenges for speaker verification. WO2020117639A2 describes automatically updating a speaker embedding for a particular user based on previous utterances by the particular user. Additionally or alternatively, it describes verifying a particular user spoke a spoken utterance using output generated by both a text independent speaker recognition model as well as a text dependent speaker recognition model. Furthermore, it describes additionally or alternatively including prefetching content for several users associated with a spoken utterance prior to determining which user spoke the spoken utterance. EP3373294A1 describes the training of a language-independent speaker verification model, and in particular the training of a text-dependent model trained to identify a speaker based on utterance of a predetermined hotword. SUMMARY The invention is defined in the appended claims. One aspect of the present disclosure provides a computer-implemented method for speaker verification that when executed on data processing hardware causes the data processing to perform operations that include receiving audio data corresponding to an utterance captured by a user device. The utterance includes a predetermined hotword followed by a query specifying an action to perform. The operations also include processing, using a text-dependent speaker verification (TD-SV) model, a first portion of the audio data that characterizes the predetermined hotword to generate a text-dependent evaluation vector representing voice characteristics of the utterance of the hotword and generating one or more text-dependent confidence scores. Each text-dependent confidence score indicates a likelihood that the text-dependent evaluation vector matches a respective one of one or more text-dependent reference vectors, and each text-dependent reference vector is associated with a respective one of one or more different enrolled users of the user device. The operations further include determining whether any of the one or more text-dependent confidence scores satisfy a confidence threshold. When one of the text-dependent confidence scores satisfy the confidence threshold, the operations include identifying a speaker of the utterance as the respective enrolled user that is associated with the text-dependent reference vector corresponding to the text-dependent confidence score that satisfies the confidence threshold, and initiating performance of the action specified by the query without performing speaker verification on a second portion of the audio data that characterizes the query following the hotword. When none of the one or more text-dependent confidence scores satisfy the confidence threshold, the operations include providing an instruction to a text-independent speaker verifier. The instruction, when received by the text-independent speaker verifier, causes the text-independent speaker verifier to process, using a text-independent speaker verification (TI-SV) model, the second portion of the audio data that characterizes the query to generate a text-independent evaluation vector. The operations further include generating one or more text-independent confidence scores each indicating a likelihood that the text-independent evaluation vector matches a respective one of one or more text-independent reference vectors. Each text-independent reference vector is associated with a respective one of the one or more different enrolled users of the user device. The operations also include determining, based on the one or more text-dependent confidence scores and the one or more text-independent confidence scores, whether the identity of the speaker that spoke the utterance includes any of the one or more different enrolled users of the user device. Implementations of the disclosure may include one or more of the following optional features. In some implementations each of the one or more different enrolled users of the user device has permissions for accessing a different respective set of personal resources, and performance of the action