EP-4497130-B1 - INTENDED QUERY DETECTION USING E2E MODELING FOR CONTINUED CONVERSATION

EP4497130B1EP 4497130 B1EP4497130 B1EP 4497130B1EP-4497130-B1

Inventors

Chang, Shuo-yiin
STROHMAN, TREVOR
ARUMUGAM, Guru Prakash
WU, Zelin
SAINATH, TARA N
LI, BO
LIANG, QIAO
STAMBLER, ADAM
UPADHYAY, Shyam
FARUQUI, MANAAL

Dates

Publication Date: 20260506
Application Date: 20230320

Claims (15)

A computer-implemented method (500) that when executed on data processing hardware (610) causes the data processing hardware (610) to perform operations comprising: receiving, as input to a speech recognition model (200), audio data (110, 222) corresponding to a spoken utterance (106, 146, 148); performing, using the speech recognition model (200), speech recognition on the audio data (110, 222) by, at each of a plurality of time steps: encoding, using an audio encoder (220), the audio data (110, 222) corresponding to the spoken utterance (106, 146, 148) into a corresponding audio encoding (224); and decoding, using a speech recognition joint network (240), the corresponding audio encoding (224) encoded by the audio encoder (220) at the corresponding time step into a probability distribution (242) over possible output labels for the spoken utterance (106, 146, 148) at the corresponding time step; and at each of the plurality of time steps, determining, using an intended query (IQ) joint network (230) configured to receive a label history representation (350) associated with a sequence of non-blank symbols (252) output by a final softmax layer (250), an intended query decision (212) indicating whether or not the spoken utterance (106, 146, 148) comprises a query intended for a digital assistant (105), wherein: the speech recognition model (200) comprises the audio encoder (220), the speech recognition joint network (240), and a prediction network (230), the prediction network (230) configured to receive the sequence of non-blank symbols (252) output by the final softmax layer (250) and generate the label history representation (350) at each of the plurality of time steps; the speech recognition model (200) is trained during a first training stage by optimizing the audio encoder (220), the speech recognition joint network (240), and the prediction network (230) using a regular label sequence of wordpieces; and the IQ joint network (230) is initialized with the speech recognition joint network (240) during a second training stage by freezing the audio encoder (220) and the prediction network (230) and fine-tuning the IQ joint network (230) with an expanded label sequence of both word pieces and IQ tokens (452, 454) to teach the IQ joint network (230) to learn how to predict a distribution of IQ tokens indicating whether or not an input utterance comprises a query intended for the digital assistant interface (105).
The method (500) of claim 1, wherein generating the label history representation (350) for the corresponding sequence of non-blank symbols (252) comprises: for each non-blank symbol (252) in the sequence of non-blank symbols (252) received as input at each of the plurality of time steps: generating, by the prediction network (230), using a shared embedding matrix (304), an embedding (306) of the corresponding non-blank symbol (252); assigning, by the prediction network (230), a respective position vector (308) to the corresponding non-blank symbol (252); and weighting, by the prediction network (230), the embedding (306) proportional to a similarity between the embedding (306) and the respective position vector (308); and generating, as output from the prediction network (230), a single embedding vector (350) at the corresponding time step, the single embedding vector (350) based on a weighted average of the weighted embeddings (318), the single embedding vector (350) comprising the label history representation (350).
The method (500) of claim 2, wherein the prediction network (230) comprises a multi-headed attention mechanism (302), the multi-headed attention mechanism (302) sharing the shared embedding matrix (304) across each head of the multi-headed attention mechanism (302).
The method (500) of any of claims 1-3, wherein the audio data (110, 222) corresponding to a spoken utterance (106, 146, 148) is received during a current dialog session between a user (102) and the digital assistant interface (105).
The method (500) of any of claims 1-4, wherein the output labels comprise wordpieces, words, phonemes, or graphemes.
The method (500) of any of claims 1-5, wherein the audio encoder (220) comprises a causal encoder comprising one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers.
The method (500) of any of claims 1-6, wherein the speech recognition model (200) is trained using Hybrid Autoregressive Transducer Factorization.
The method (500) of any of claims 1-7, wherein the operations further comprise, when the intended query decision (212) indicates that the spoken utterance (106, 146, 148) comprises a query intended for the digital assistant interface (105), providing a response to the received spoken utterance (106, 146, 148), or wherein the operations further comprise, when the intended query decision (212) indicates that the spoken utterance (106, 146, 148) does not comprise a query intended for the digital assistant interface (105), discarding the received spoken utterance (106, 146, 148).
A system (100) comprising: data processing hardware (610); and memory hardware (620) in communication with the data processing hardware (610), the memory hardware (620) storing instructions that when executed on the data processing hardware (610) cause the data processing hardware (610) to perform operations comprising: receiving, as input to a speech recognition model (200), audio data (110, 222) corresponding to a spoken utterance (106, 146, 148); performing, using the speech recognition model (200), speech recognition on the audio data (110, 222) by, at each of a plurality of time steps: encoding, using an audio encoder (220), the audio data (110, 222) corresponding to the spoken utterance (106, 146, 148) into a corresponding audio encoding (224); and decoding, using a speech recognition joint network (240), the corresponding audio encoding (224) encoded by the audio encoder (220) at the corresponding time step into a probability distribution (242) over possible output labels for the spoken utterance (106, 146, 148) at the corresponding time step; and at each of the plurality of time steps, determining, using an intended query (IQ) joint network configured to receive a label history representation (350) associated with a sequence of non-blank symbols (252) output by a final softmax layer (250), an intended query decision (212) indicating whether or not the spoken utterance (106, 146, 148) comprises a query intended for a digital assistant interface (105), wherein: the speech recognition model (200) comprises the audio encoder (220), the speech recognition joint network (240), and a prediction network (230), the prediction network (230) configured to receive the sequence of non-blank symbols (252) output by the final softmax layer (250) and generate the label history representation (350) at each of the plurality of time steps; the speech recognition model (200) is trained during a first training stage by optimizing the audio encoder (220), the speech recognition joint network (240), and the prediction network (230) using a regular label sequence of wordpieces; and the IQ joint network (230) is initialized with the speech recognition joint network (240) during a second training stage by freezing the audio encoder (220) and the prediction network (230) and fine-tuning the IQ joint network (230) with an expanded label sequence of both word pieces and IQ tokens (452, 454) to teach the IQ joint network (230) to learn how to predict a distribution of IQ tokens indicating whether or not an input utterance comprises a query intended for the digital assistant interface (105).
The system (100) of claim 9, wherein generating the label history representation (350) for the corresponding sequence of non-blank symbols (252) comprises: for each non-blank symbol (252) in the sequence of non-blank symbols (252) received as input at each of the plurality of time steps: generating, by the prediction network (230), using a shared embedding matrix (304), an embedding (306) of the corresponding non-blank symbol (252); assigning, by the prediction network (230), a respective position vector (308) to the corresponding non-blank symbol (252); and weighting, by the prediction network (230), the embedding (306) proportional to a similarity between the embedding (306) and the respective position vector (308); and generating, as output from the prediction network (230), a single embedding vector (350) at the corresponding time step, the single embedding vector (350) based on a weighted average of the weighted embeddings (312), the single embedding vector (350) comprising the label history representation (350).
The system (100) of claim 10, wherein the prediction network (230) comprises a multi-headed attention mechanism (302), the multi-headed attention mechanism (302) sharing the shared embedding matrix (304) across each head of the multi-headed attention mechanism (302).
The system (100) of any of claims 9-11, wherein the audio data (110, 222) corresponding to a spoken utterance (106, 146, 148) is received during a current dialog session between a user (1020 and the digital assistant interface (105).
The system (100) of any of claims 9-12, wherein the output labels comprise wordpieces, words, phonemes, or graphemes, or wherein the audio encoder (220) comprises a causal encoder comprising one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers.
The system (100) of any of claims 9-13, wherein the speech recognition model (200) is trained using Hybrid Autoregressive Transducer Factorization.
The system (100) of any of claims 9-14, wherein the operations further comprise, when the intended query decision (212) indicates that the spoken utterance (106, 146, 148) comprises a query intended for the digital assistant interface (105), providing a response to the received spoken utterance (106, 146, 148), or wherein the operations further comprise, when the intended query decision (212) indicates that the spoken utterance (106, 146, 148) does not comprise a query intended for the digital assistant interface (105), discarding the received spoken utterance (106, 146, 148).

Description

TECHNICAL FIELD This disclosure relates to intended query detection using end-to-end (E2E) modeling for continued conversation. BACKGROUND A speech-enabled environment permits a user to only speak a query or command out loud and a digital assistant will field and answer the query and/or cause the command to be performed. A speech-enabled environment (e.g., home, workplace, school, etc.) can be implemented using a network of connected microphone devices distributed throughout various rooms and/or areas of the environment. Through such a network of microphones, a user has the power to orally query the digital assistant from essentially anywhere in the environment without the need to have a computer or other device in front of him/her or even nearby. For example, while cooking in the kitchen, a user might invoke the digital assistant, using a hotword such as "Okay Computer, please set a timer for 20-minutes" and, in response, the digital assistant will confirm that the timer has been set (e.g., in a form of a synthesized voice output) and then alert (e.g., in the form of an alarm or other audible alert from an acoustic speaker) the user once the timer lapses after 20-minutes. Often, the user may issue a follow-on query to the digital assistant However, requiring the user to repeat the hotword to address the digital assistant places a cognitive burden on the user and interrupts the flow of a continued conversation. Prior art document MARTIN RADFAR ET AL: "FANS: Fusing ASR and NLU for on-device SLU" discloses fusion of ASR and SLU, with both models being trained jointly. The intent is predicted from the output of the audio encoder. SUMMARY One aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations as defined in claim 1. The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims. DESCRIPTION OF DRAWINGS FIGS. 1A and 1B are schematic views of an example system including an automatic speech recognition (ASR) system that includes an intended query detector for transcribing spoken utterances.FIG. 2 is a schematic view of an example ASR system integrating an intended query detector.FIG. 3 a schematic view of an example tied and reduced prediction network of the ASR system of FIG. 2.FIG. 4A depicts an example long-form transcribed training utterance.FIG. 4B depicts an example annotated transcribed training utterance for the long-form transcribed training utterance of FIG. 4A.FIG. 5 is a flowchart of an example arrangement of operations for a computer-implemented method of executing an intended query detection model in an ASR system.FIG. 6 is a schematic view of an example computing device that may be used to implement the systems and methods described herein. Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION A user's manner of interacting with an assistant-enabled device is designed to be primarily, if not exclusively, by means of voice input. Consequently, the assistant-enabled device must have some way of discerning when any given utterance in a surrounding environment is directed toward the device as opposed to being directed to an individual in the environment or originating from a non-human source (e.g., a television or music player). One way to accomplish this is to use a hotword, which by agreement among the users in the environment, is reserved as a predetermined word(s) that is spoken to invoke the attention of the device. In an example environment, the hotword used to invoke the assistant's attention are the words "OK computer." Consequently, each time the words "OK computer" are spoken, it is picked up by a microphone, conveyed to a hotword detector, which performs speech understanding techniques to determine whether the hotword was spoken and, if so, awaits an ensuing command or query. Accordingly, utterances directed at an assistant-enabled device take the general form [HOTWORD] [QUERY], where "HOTWORD" in this example is "OK computer" and "QUERY" can be any question, command, declaration, or other request that can be speech recognized, parsed and acted on by the system, either alone or in conjunction with the server via the network. In cases where the user continues the conversation with the assistant-enabled device, such as a mobile phone or smart speaker, the user's interaction with the phone or speaker may become awkward. The user may speak, "Ok computer, play my homework playlist." The phone or speaker may begin to play the first song on the playlist. The user may wish to advance to the next song and speak, "Ok computer, next." To advance to yet another song, the user may speak, "Ok computer, next," again. To alleviate the need to keep repeating the hotword before spe