EP-4742236-A2 - SPEAKER DIARIZATON

EP4742236A2EP 4742236 A2EP4742236 A2EP 4742236A2EP-4742236-A2

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speaker diarization are disclosed. In one aspect, a method includes the actions of receiving audio data corresponding to an utterance. The actions further include determining that the audio data includes an utterance of a predefined hotword spoken by a first speaker. The actions further include identifying a first portion of the audio data that includes speech from the first speaker. The actions further include identifying a second portion of the audio data that includes speech from a second, different speaker. The actions further include transmitting the first portion of the audio data that includes speech from the first speaker and suppressing transmission of the second portion of the audio data that includes speech from the second, different speaker.

Inventors

KRACUN, Aleksander
ROSE, RICHARD CAMERON

Assignees

Google LLC

Dates

Publication Date: 20260513
Application Date: 20180829

Claims (11)

A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising: obtaining previously collected speech data for a particular person; receiving audio data comprising: a first portion corresponding to a first utterance; and a second portion corresponding to a second utterance; processing, using a speaker identification model trained to recognize speech spoken by a particular person: the first portion of the audio data to identify that the particular person spoke the first utterance; and the second portion of the audio data to identify that a speaker other than the particular person spoke the second utterance; processing the audio data to generate a transcription of the first utterance and the second utterance; and updating the transcription by labeling the transcription with a first label indicating that the first utterance was spoken by the particular person.
The method of claim 1, wherein updating the transcription further comprises updating the transcription by labeling the transcription with a second label indicating that the second utterance was spoken by the speaker other than the particular person.
The method of any preceding claim, wherein the second label does not uniquely identify the speaker that spoke the second utterance.
The method of any preceding claim, wherein the operations further comprise transmitting the updated transcription to a computing device associated with the particular person.
The method of any preceding claim, wherein the operations further comprise transmitting the updated transcription to a computing device associated with the speaker that spoke the second utterance.
The method of any preceding claim, wherein processing the first portion of the audio data using the speaker identification model comprises: determining that the first utterance comprises a predefined hotword; and identifying the particular person as the speaker that spoke the first utterance based on determining that the first utterance comprises the predefined hotword.
The method of any preceding claim, wherein the operations further comprise: identifying an application running in a foreground of a computing device that captured the audio data, wherein updating the transcription is based on the identified application running in the foreground of the computing device.
The method of any preceding claim, wherein updating the transcription further comprises updating the transcription by deleting a corresponding portion of the transcription that includes the second portion of the audio data.
The method of claim 8, wherein the operations further comprise updating the audio data to remove the second portion of the audio data.
The method of any preceding claim, wherein the data processing hardware resides on a server.
A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising the method of any preceding claim.

Description

TECHNICAL FIELD This specification generally relates to automated speech recognition. BACKGROUND The reality of a speech-enabled home or other environment - that is, one in which a user need only speak a query or command out loud and a computer-based system will field and answer the query and/or cause the command to be performed - is upon us. A speech-enabled environment (e.g., home, workplace, school, etc.) can be implemented using a network of connected microphone devices distributed throughout the various rooms or areas of the environment. Through such a network of microphones, a user has the power to orally query the system from essentially anywhere in the environment without the need to have a computer or other device in front of him/her or even nearby. For example, while cooking in the kitchen, a user might ask the system "how many milliliters in three cups?" and, in response, receive an answer from the system, e.g., in the form of synthesized voice output. Alternatively, a user might ask the system questions such as "when does my nearest gas station close," or, upon preparing to leave the house, "should I wear a coat today?" Further, a user may ask a query of the system, and/or issue a command, that relates to the user's personal information. For example, a user might ask the system "when is my meeting with John?" or command the system "remind me to call John when I get back home." SUMMARY For a speech-enabled system, the users' manner of interacting with the system is designed to be primarily, if not exclusively, by means of voice input. Consequently, the system, which potentially picks up all utterances made in the surrounding environment including those not directed to the system, must have some way of discerning when any given utterance is directed at the system as opposed, e.g., to being directed at an individual present in the environment. One way to accomplish this is to use a hotword, which by agreement among the users in the environment, is reserved as a predetermined word that is spoken to invoke the attention of the system. In an example environment, the hotword used to invoke the system's attention are the words "OK computer." Consequently, each time the words "OK computer" are spoken, it is picked up by a microphone, conveyed to the system, which may perform speech recognition techniques or use audio features and neural networks to determine whether the hotword was spoken and, if so, awaits an ensuing command or query. Accordingly, utterances directed at the system take the general form [HOTWORD] [QUERY], where "HOTWORD" in this example is "OK computer" and "QUERY" can be any question, command, declaration, or other request that can be speech recognized, parsed and acted on by the system, either alone or in conjunction with the server via the network. A speech-enabled system may use the utterance of a hotword as an indication of a user's intention to interact with a system. In the case where the speech-enabled system detects speech from different users, the system processes and transmits audio data that includes speech from a user who initially speaks a hotword and will limit processing and suppress transmission of audio data that includes speech from other users who did not speak the hotword. The system may use a hotworder to identify the portion of the audio data that includes a hotword. A speaker diarization module may analyze the portion of the audio data that includes the hotword to identify characteristics of the user's speech and identify subsequently received audio data that includes speech from the same user. The speaker diarization module may analyze other subsequently received speech audio and identify audio portions where the speaker is not the same speaker as the hotword speaker. The system may remove those portions spoken by other users because the other users did not express their intention to interact with the system by speaking the hotword. By removing those portions spoken by other users, the system preserves the privacy of other users who may be unintentionally interacting with the speech-enabled system. According to an innovative aspect of the subject matter described in this application, a method for speaker diarization includes the actions of receiving, by a computing device, audio data corresponding to an utterance; determining that the audio data includes an utterance of a predefined hotword spoken by a first speaker; identifying a first portion of the audio data that includes speech from the first speaker; identifying a second portion of the audio data that includes speech from a second, different speaker; and based on determining that the audio data includes an utterance of the predefined hotword spoken by the first speaker, transmitting the first portion of the audio data that includes speech from the first speaker and suppressing transmission of the second portion of the audio data that includes speech from the second, different speaker. These and other implem