DE-102025145147-A1 - Voice dictation with a large language model for audio

DE102025145147A1DE 102025145147 A1DE102025145147 A1DE 102025145147A1DE-102025145147-A1

Abstract

A procedure involves receiving audio data that identifies an utterance spoken by a user. The procedure also involves processing the audio data to generate a transcription of the utterance using a multimodal large language model (LLM). The transcription comprises a sequence of expressions. The procedure also involves processing the audio data and the transcription in parallel using the multimodal LLM to identify one or more revision expressions within the sequence. The one or more revision expressions specify a revision action to be performed on at least one other expression within the sequence. The procedure also involves modifying the transcription based on the one or more revision expressions.

Inventors

Quan Wang
Francoise Beaufays
Bhuvana Ramabhadran
Guanlong Zhao
Zhong Meng
Neng Chen
Antoine Bruguier
Yanzhang He
Golan Pundak

Assignees

GOOGLE LLC

Dates

Publication Date: 20260513
Application Date: 20251103
Priority Date: 20241111

Claims (20)

A computer-implemented method, executed on data processing hardware, that causes the data processing hardware to perform operations including: receiving audio data identifying an utterance spoken by a user; processing the audio data to generate a transcription of the utterance, wherein the transcription comprises a sequence of expressions; processing the audio data and the transcription in parallel to identify one or more revision expressions in the sequence of expressions, using a large language model (LLM), wherein the one or more revision expressions specify a revision action to be performed on at least one other expression in the sequence of expressions; and modifying the transcription based on the one or more revision expressions.
Computer-implemented method according to Claim 1 , wherein the operations further comprise: for each respective expression in the sequence of expressions, determining a corresponding intention of the user when speaking the respective expression, based on the parallel processing of the audio data and the transcription, wherein identifying the one or more revision expressions in the sequence of expressions is based on the corresponding intention determined for each respective expression in the sequence of expressions.
Computer-implemented method according to Claim 1 or 2 , wherein the parallel processing of the audio data and the transcription for each respective expression in the sequence of expressions comprises the following: based on the processing of the audio data, determining the corresponding linguistic features of the respective expression; based on the processing of the transcription, determining the corresponding linguistic context of the respective expression; and correlating the corresponding linguistic features of the respective expression with the corresponding linguistic context of the respective expression.
Computer-implemented method according to Claim 3 , whereby: the corresponding linguistic features of the respective expression are not reproduced in the transcription; and the corresponding linguistic context of the respective The expression is not reproduced in the audio data.
Computer-implemented method according to Claim 3 or 4 , wherein the relevant language features include at least one of the following: pitch information; tone information; or prosody information.
Computer-implemented method according to one of the Claims 1 until 5 , wherein the operations further comprise, based on one or more revision expressions, inserting a revision token into the sequence of expressions, wherein the revision token specifies a corresponding N number of expressions in the at least one other expression and corresponding replacement expressions to replace the corresponding N number of expressions in the at least one other expression.
Computer-implemented method according to Claim 6 , whereby the modification of the transcription is further based on the revision token inserted into the sequence of expressions.
Computer-implemented method according to one of the Claims 1 until 7 , the processes further include: obtaining context data associated with the user who made the utterance; and conditioning the multimodal LLM based on the context data.
Computer-implemented method according to one of the Claims 1 until 8 , wherein the operations further comprise: determining a training prompt for a multimodal auxiliary LLM, wherein the training prompt comprises a transcription editing task and a plurality of training samples, each of which comprises a corresponding training transcript paired with a corresponding modified training transcript; and generating a plurality of training samples based on the training prompt using the multimodal auxiliary LLM; and training the multimodal LLM on the plurality of training samples.
Computer-implemented method according to one of the Claims 1 until 9 , wherein the revision action includes at least one of the following: a replacement action; a deletion action; or a spelling action.
System comprising: Data processing hardware; and Storage hardware communicating with the data processing hardware, wherein the storage hardware stores instructions which, when executed on the data processing hardware, cause the data processing hardware to perform operations that include: Receiving audio data identifying an utterance spoken by a user; Processing the audio data to generate a transcription of the utterance, wherein the transcription comprises a sequence of expressions; Processing the audio data and the transcription in parallel to identify one or more revision expressions in the sequence of expressions, using a large language model (LLM), wherein the one or more revision expressions specify a revision action to be performed on at least one other expression in the sequence of expressions; and Modifying the transcription based on the identified one or more revision expressions.
System according Claim 11 , wherein the operations further comprise: for each respective expression in the sequence of expressions, determining a corresponding intention of the user when speaking the respective expression, based on the parallel processing of the audio data and the transcription, wherein identifying the one or more revision expressions in the sequence of expressions is based on the corresponding intention determined for each respective expression in the sequence of expressions.
System according Claim 11 or 12 , wherein the parallel processing of the audio data and the transcription for each respective expression in the sequence of expressions comprises the following: based on the processing of the audio data, determining the corresponding linguistic features of the respective expression; based on the processing of the transcription, determining the corresponding linguistic context of the respective expression; and correlating the corresponding linguistic features of the respective expression with the corresponding linguistic context of the respective expression.
System according Claim 13 , whereby: the corresponding linguistic features of the respective expression are not reproduced in the transcription; and the corresponding linguistic context of the respective expression is not reproduced in the audio data.
System according Claim 13 or 14 , wherein the relevant language features include at least one of the following: pitch information; tone information; or prosody information.
system according to one of the Claims 11 until 15 , wherein the operations further comprise, based on one or more revision expressions, inserting a revision token into the sequence of expressions, wherein the revision token specifies a corresponding N number of expressions in the at least one other expression and corresponding replacement expressions to replace the corresponding N number of expressions in the at least one other expression.
System according Claim 16 , whereby the modification of the transcription is further based on the revision token inserted into the sequence of expressions.
system according to one of the Claims 11 until 17 , the operations further include: obtaining context data associated with the user who made the utterance; and conditioning the multimodal LLM to the context data.
system according to one of the Claims 11 until 18 , wherein the operations further comprise: determining a training prompt for a multimodal auxiliary LLM, wherein the training prompt comprises a transcription editing task and a plurality of training samples, each of which comprises a corresponding training transcript paired with a corresponding modified training transcript; and generating a plurality of training samples based on the training prompt using the multimodal auxiliary LLM; and training the multimodal LLM on the plurality of training samples.
system according to one of the Claims 11 until 19 , wherein the revision action includes at least one of the following: a replacement action; a deletion action; or a spelling action.

Description

TECHNICAL AREA This revelation refers to speech dictation using an audio large-scale language model. GENERAL STATE OF THE ART Automatic speech recognition (ASR) aims to convert speech into text. End-to-end speech recognition models integrate multiple components into a single model, thereby improving the performance (e.g., word error rate (WER) and latency) of speech-to-text transcription. Some systems contain multiple cascaded models capable of performing ASR in several different languages. Recently, speech recognition models have benefited from being improved through training on both audio and text data. However, using both audio and text data presents certain challenges due to the different modalities of audio and text. Many current approaches employ multiple models to process audio and text, which is expensive and difficult to maintain, as each model uses different data sources, training processes, and evaluation metrics. DESCRIPTION OF THE DRAWINGS 1A and 1B These are schematic views of an exemplary speech recognition system.2 This is a schematic view of a sample transcription with an inserted revision token.3 is a schematic view of an exemplary training process for training a multimodal large language model.4 is a flowchart of an exemplary sequence of processes for a computer-implemented procedure for conducting a speech dictation using a large language model.5 is a schematic view of an exemplary computing device that can be used to implement the systems and procedures described here. Identical reference symbols in different drawings indicate identical elements. SUMMARY One aspect of the revelation provides a computer-implemented procedure, executed on data processing hardware, that causes the data processing hardware to perform operations for conducting speech dictation using a large language model. These operations involve receiving audio data that identifies an utterance spoken by a user. They also involve processing the audio data to produce a transcription of the utterance using a multimodal large language model (LLM). The transcription includes a sequence of expressions. The operations also involve processing the audio data and the transcription in parallel to identify one or more revision expressions within the sequence of expressions, using the multimodal LLM. The one or more revision expressions specify a revision action to be performed on at least one other expression within the sequence. Finally, the operations involve modifying the transcription based on the one or more revision expressions. Implementations of the disclosure may include one or more of the following optional features. In some implementations, the operations further involve determining the user's intent when speaking each expression, based on the parallel processing of the audio and transcription. Identifying the one or more revision expressions in the sequence is based on the intent determined for each expression in the sequence. In some examples, the parallel processing of the audio and transcript for each expression in the sequence involves determining the corresponding language features based on the audio processing, determining the corresponding language context based on the transcription, and correlating the corresponding language features of each expression with the corresponding language context. It is possible that the corresponding language features of each expression are not represented in the transcription, and the corresponding language context of each expression is not represented in the audio. In these examples, the relevant language features can include at least one of pitch information, tone information, or prosody information. In some implementations, the operations further involve inserting a revision token into the sequence of expressions based on one or more revision expressions. The revision token specifies a corresponding N number of expressions within the at least one other expression and corresponding replacement expressions to replace those N number of expressions. In these implementations, the transcription modification is also based on the revision token inserted into the sequence of expressions. The operations may further involve retrieving context data associated with the user who spoke the utterance and conditioning the multimodal LLM to that context data. In some examples, the processes further include determining a training prompt for a multimodal helper LLM, generating a multitude of training samples based on the training prompt using the multimodal helper LLM, and training the multimodal LLM using the multitude of training samples. The training prompt includes a transcription editing task and a multitude of training samples. Each training sample includes a corresponding training transcript paired with a corresponding modified training transcript. The revision action can include at least one replacement action, one deletion action, or one spelling action. Another aspect of the revelation prov