EP-4571567-B1 - METHOD, COMPUTER PROGRAM, APPARATUSES, AND SPEECH PROCESSING SYSTEM FOR CORRECTING A VOICE COMMAND

EP4571567B1EP 4571567 B1EP4571567 B1EP 4571567B1EP-4571567-B1

Inventors

MINOW, Jascha
JAHN, CARL
EL MALLOUKI, Said

Dates

Publication Date: 20260513
Application Date: 20231211

Claims (13)

A method (10) for correcting a voice command during runtime, the method (10) comprising obtaining (12) an audio representation of the voice command; speech recognizing (14) the voice command using the audio representation to obtain a text representation of the voice command and recognition confidence information indicating a confidence on whether the text representation matches the audio representation; based on the recognition confidence information: error processing (16; 320) the audio representation and the text representation; or non-error processing (18; 302) the audio representation, wherein the non-error processing (18; 302) comprises performing domain classification (312) configured to determine a domain for the voice command and to provide domain confidence information indicating a confidence on whether the domain matches a domain of the audio representation; the method (10) further comprising, based on the domain confidence information, error processing (16; 320) the audio representation and the text representation, wherein the non-error processing (18; 302) comprises performing entity identification (313) subsequent to the domain classification (312), wherein the performing of the entity identification (313) is configured to determine an entity the voice command refers to and to provide entity confidence information indicating a confidence on whether the entity matches an entity of the voice command; the method (10) further comprising, based on the entity confidence information, error processing (16; 320) the audio representation and the text representation.
The method (10) of claim 1, wherein the non-error processing (18; 302) comprises performing response generation (314) subsequent to the entity identification (313), wherein the performing of the response generation (314) is configured to generate a response to the voice command and response confidence information based on one or more elements of the group of the domain, the entity, the text representation and the audio representation; the method (10) further comprising, based on the response confidence information, forwarding the audio representation and the text representation to the error processing (16).
The method (10) of one of the claims 1 or 2, wherein the error processing (16; 320) comprises identifying one or more errors in the audio representation and/or text representation, and separating (322) the one or more errors to obtain one or more separated errors.
The method (10) of claim 3, wherein the separating (322) of the one or more errors comprises removing irrelevant content from the audio or text representation.
The method (10) of one of the claims 3 or 4, further comprising classifying (323) the one or more errors and/or the one or more separated errors to obtain one or more classified errors.
The method (10) of one of the claims 3 to 5, further comprising correcting (324) the one or more errors, the one or more separated errors, and/or the one or more classified errors to obtain a corrected audio representation and/or a corrected text representation.
The method (10) of claim 6, wherein the correcting (324) comprises correcting words in the text representation based on context or meta information related to the voice command.
The method (10) of one of the claims 6 or 7, further comprising forwarding (326) the corrected audio representation and/or the corrected text representation for re-processing to one of the previous processing steps of one of the claims 1 to 4.
The method (10) of claim 8, further comprising iterating through the error processing (16; 320) until a recognition confidence level is reached or latency limit for the processing has expired.
The method (10) of one of the claims 1 or 2, wherein the error processing (16) comprises attempting to identify one or more errors in the audio representation or the text representation, and, in case no error can be detected, generating an indication (327) that no error could be identified.
A computer program having a program code for performing one of the methods (10) of one of the claims 1 to 10, when the computer program is executed on a computer, a processor, or a programmable hardware component.
An apparatus (20) for correcting a voice command during runtime, the apparatus (20) comprising one or more interfaces (22) configured to communicate an audio representation of a voice command and a response representation to the voice command; and one or more processing devices (24) configured to perform one of the methods (10) of claims 1 to 10.
A speech processing system (200) comprising the apparatus (20) of claim 12.

Description

Field Examples relate to a method, a computer program, an apparatus and a speech processing system for correcting a voice command, more particularly, but not exclusively, for enabling efficient error processing for correcting a voice command in speech recognition based on confidence information. Background Speech recognition, also known as automatic speech recognition (ASR), voice recognition, or speech-to-text, is a capability that enables a computer or electronic device to interpret and respond to human speech. This technology uses algorithms to translate spoken language into text, making it a crucial component for applications such as voice-enabled assistants, dictation software, and automated customer service systems. The process may involve capturing sound through a microphone, processing the audio to filter out noise, and then analyzing the speech signal to determine the sequence of words that were spoken based on the pattern of sounds. Speech recognition systems require the integration of various linguistic components to accurately comprehend spoken language. These include phonetics and phonology to understand the sounds, syntax for grasping grammatical structures, semantics to interpret meaning, and often pragmatics to comprehend the speaker's intent in context. Today, most cutting-edge speech recognition systems employ machine learning techniques, particularly deep learning neural networks, which significantly enhance their ability to learn from a large dataset of spoken language examples. These neural networks model high-level abstractions in data by using multiple processing layers that are trained using vast amounts of speech data. Advancements in speech recognition have had a significant impact on various industries, enhancing accessibility for users with disabilities, improving efficiency in healthcare through medical transcription, and enabling safer driving with voice-controlled navigation systems. Moreover, as the technology continues to evolve, newer applications are being devised, including real-time language translation and voice-activated home automation. However, as much progress as there has been, the field still faces challenges with accurately recognizing speech in noisy environments, differentiating between voices, and handling accents and dialects. Despite these hurdles, the ongoing research and development in this field promise continued improvements in both the accuracy and versatility of speech recognition technology. Speech recognition and speech processing and their quality, in terms of the accuracy of the calculation of the results, is largely based on the available training data and the algorithms used to process the commands. When training the various voice processing components, it is important to cover a very wide range of possible input parameters, such as issuing a voice command when the television is switched on or setting the timer while the extractor hood is running at the highest level. Systems that process voice commands very often struggle with the problem that the actual audio for the voice command contains a wide variety of background noises. As a result, the background noises will override the user's actual command. For example, in a kitchen scenario the command "Set the timer to 5 minutes" is drowned out by the loud noise of the extractor hood. There are currently some methods that attempt to identify background noises for the audio and extract them from the audio in order to use the user's voice and thus the voice command for processing. However, these methods have limitations in detecting background noise as it strongly depends on the use case, the environment and the context. Document US 2023/0110205 A1 describes techniques for handling errors during processing of natural language inputs. A system may process a natural language input to generate an ASR (Automatic Speech Recognition) hypothesis or NLU (Natural Language Understanding) hypothesis. The system may use more than one data searching technique (e.g., deep neural network searching, convolutional neural network searching, etc.) to generate an alternate ASR hypothesis or NLU hypothesis, depending on the type of hypothesis input for alternate hypothesis processing. Document US 11,626,106 B1 discloses a system for determining which component of a speech processing system is the cause of an undesired response to a user input. The system processes ASR data and NLU data to determine the component likely to cause the undesired response. Based on which component is the cause of the undesired response, the system performs an appropriate conversation recovery technique to confirm the speech processing results with the user. Summary It is a finding of the present disclosure that in most conventional error correction methods algorithms refer to the entire audio and the process is always run through, even if there is no background noise for the human ear that needs to be removed. It is also difficult for the algo