EP-4463852-B1 - APPARATUS AND METHOD FOR SPEAKING VERIFICATION FOR VOICE ASSISTANT

EP4463852B1EP 4463852 B1EP4463852 B1EP 4463852B1EP-4463852-B1

Inventors

KIM, MYUNGJONG
KI, TAEYEON
TSENG, CINDY SUSHEN
PONAKALA, SRINIVASA RAO
APSINGEKAR, Vijendra Raj

Dates

Publication Date: 20260506
Application Date: 20230601

Claims (15)

A method comprising: obtaining, using at least one processing device of an electronic device, audio data; identifying, using the at least one processing device, an utterance of a wake word or wake phrase in the audio data; generating, using the at least one processing device, an embedding vector based on the utterance from the audio data; accessing, using the at least one processing device, a set of previously-generated vectors representing previous utterances of the wake word or wake phrase; performing, using the at least one processing device, clustering on the embedding vector and the set of previously-generated vectors to identify a cluster including the embedding vector, the identified cluster associated with a speaker; updating, using the at least one processing device, a speaker vector associated with the speaker based on the embedding vector; determining, using the at least one processing device and a speaker verification model, a similarity score between the updated speaker vector and the embedding vector; and determining, using the at least one processing device and based on the similarity score, whether a speaker providing the utterance matches the speaker associated with the identified cluster.
The method of Claim 1, wherein determining whether the speaker providing the utterance matches the speaker associated with the identified cluster includes comparing, using the at least one processing device, the similarity score to a similarity threshold value.
The method of Claim 2, wherein a first value is used for the similarity threshold value if a number of the clustered utterances for the speaker is less than a threshold number of utterances.
The method of Claim 3, wherein a second value greater than the first value is used for the similarity threshold value if a number of the clustered utterances for the speaker is equal to or greater than the threshold number of utterances.
The method of Claim 4, wherein: when the first value is used for the similarity threshold value, the determining of the similarity score is performed using keyword verification to detect similarities between the wake word or wake phrase in the speaker vector and in the embedding vector; and when the second value is used for the similarity threshold value, the determining of the similarity score is performed using speaker verification to detect similarities between the speaker providing the utterance and the speaker associated with the identified cluster.
The method of Claim 1, further comprising: determining, using the at least one processing device, a probability that the utterance is an intentional utterance; and performing the clustering, using the at least one processing device, in response to the probability being greater than an intention threshold value.
The method of Claim 6, further comprising: training, using the at least one processing device, a false-trigger mitigation model to determine the probability that the utterance is an intentional utterance, wherein training the false-trigger mitigation model includes: providing, using the at least one processing device, one or more of a set of full-length audio samples, a set of chunk-wise utterances, and audio samples including one or more wake words or phrases to the false-trigger mitigation model.
The method of Claim 6, further comprising: in response to the probability being greater than the intention threshold value, determining, using the at least one processing device, an audio quality score associated with the utterance; and performing the clustering, using the at least one processing device, in response to the audio quality score being greater than an audio quality threshold.
The method of Claim 8, further comprising: training, using the at least one processing device, an audio quality classification model to determine the audio quality score associated with the utterance, wherein training the audio quality classification model includes: providing, using the at least one processing device, clean audio sample data of a first predetermined duration and noisy audio sample data of a second predetermined duration to the audio quality classification model.
The method of Claim 1, further comprising: training, using the at least one processing device, the speaker verification model, wherein training the speaker verification model includes: providing a sample embedding vector based on an utterance from sample audio data, wherein the utterance includes the wake word or wake phrase; providing, using the at least one processing device, a sample speaker vector; receiving, using the at least one processing device from the speaker verification model, a result including a similarity score between the sample speaker vector and the sample embedding vector; and using a loss function and modifying the speaker verification model based on the result.
An apparatus comprising: at least one processing device configured to: obtain audio data; identify an utterance of a wake word or wake phrase in the audio data; generate an embedding vector based on the utterance from the audio data; access a set of previously-generated vectors representing previous utterances of the wake word or wake phrase; perform clustering on the embedding vector and the set of previously-generated vectors to identify a cluster including the embedding vector, the identified cluster associated with a speaker; update a speaker vector associated with the speaker based on the embedding vector; determine, using a speaker verification model, a similarity score between the updated speaker vector and the embedding vector; and determine, based on the similarity score, whether a speaker providing the utterance matches the speaker associated with the identified cluster.
The apparatus of Claim 11, wherein, to determine whether the speaker providing the utterance matches the speaker associated with the identified cluster, the at least one processing device is further configured to compare the similarity score to a similarity threshold value.
The apparatus of Claim 12, wherein a first value is used for the similarity threshold value if a number of the clustered utterances for the speaker is less than a threshold number of utterances.
The apparatus of Claim 13, wherein a second value greater than the first value is used for the similarity threshold value if a number of the clustered utterances for the speaker is equal to or greater than the threshold number of utterances.
A computer readable medium containing instructions that when executed cause at least one processor to perform the method according to any one of claims 1-10.

Description

Technical Field This disclosure relates generally to machine learning systems. More specifically, this disclosure relates to a system and method for speaker verification for a voice assistant. Background Art Voice assistants such as BIXBY, SIRI, and ALEXA allow for voice enrollment by users as a step to collect a user voice profile. This involves requesting users to record user voice samples, which can be a tedious task and a stiff barrier against integrating new users. Consequently, to improve user experience and new user registration completion rates, a recent trend has been to not require users to carry out voice enrollment. Although this eliminates the tedious enrollment task, it can cause degradation of voice wake-up performance and increase the number of invalid wake-ups, which refer to instances where a voice assistant is activated although the user has not requested or intended for the voice assistant to do so. An exemplary approach for speaker verification is disclosed in US 2021/0326421 A1. Disclosure of Invention Solution to Problem This disclosure relates to a system and method for speaker verification for a voice assistant. An embodiment of the disclosure may provide a method. The method includes obtaining, using at least one processing device of an electronic device, audio data. The method also includes identifying, using the at least one processing device, an utterance of a wake word or wake phrase in the audio data. The method further includes generating, using the at least one processing device, an embedding vector based on the utterance from the audio data. The method also includes accessing, using the at least one processing device, a set of previously-generated vectors representing previous utterances of the wake word or wake phrase. The method further includes performing, using the at least one processing device, clustering on the embedding vector and the set of previously-generated vectors to identify a cluster including the embedding vector, where the identified cluster is associated with a speaker. The method also includes updating, using the at least one processing device, a speaker vector associated with the speaker based on the embedding vector. The method further includes determining, using the at least one processing device and a speaker verification model, a similarity score between the updated speaker vector and the embedding vector. In addition, the method includes determining, using the at least one processing device and based on the similarity score, whether a speaker providing the utterance matches the speaker associated with the identified cluster. An embodiment of the disclosure may provide an apparatus. The apparatus includes at least one processing device configured to obtain audio data. The at least one processing device is also configured to identify an utterance of a wake word or wake phrase in the audio data. The at least one processing device is further configured to generate an embedding vector based on the utterance from the audio data. The at least one processing device is also configured to access a set of previously-generated vectors representing previous utterances of the wake word or wake phrase. The at least one processing device is further configured to perform clustering on the embedding vector and the set of previously-generated vectors to identify a cluster including the embedding vector, where the identified cluster is associated with a speaker. The at least one processing device is also configured to update a speaker vector associated with the speaker based on the embedding vector. The at least one processing device is further configured to determine, using a speaker verification model, a similarity score between the updated speaker vector and the embedding vector. In addition, the at least one processing device is configured to determine, based on the similarity score, whether a speaker providing the utterance matches the speaker associated with the identified cluster. An embodiment of the disclosure may provide a computer readable medium. The computer readable medium contains instructions that when executed cause at least one processor to obtain audio data. The medium also contains instructions that when executed cause the at least one processor to identify an utterance of a wake word or wake phrase in the audio data. The medium further contains instructions that when executed cause the at least one processor to generate an embedding vector based on the utterance from the audio data. The medium also contains instructions that when executed cause the at least one processor to access a set of previously-generated vectors representing previous utterances of the wake word or wake phrase. The medium further contains instructions that when executed cause the at least one processor to perform clustering on the embedding vector and the set of previously-generated vectors to identify a cluster including the embedding vector, where the identified cluster is