EP-4377954-B1 - VOICE OR SPEECH RECOGNITION USING CONTEXTUAL INFORMATION AND USER EMOTION

EP4377954B1EP 4377954 B1EP4377954 B1EP 4377954B1EP-4377954-B1

Inventors

WEI, JUN
DONG, Xiaoxia
PAN, Qimeng
JIN, Kwihyuk
TANG, TONG

Dates

Publication Date: 20260506
Application Date: 20210727

Claims (12)

A method (300, 301, 302, 303, 304, 305, 306, 307, 308) of voice or speech recognition executed by a processor of a computing device, comprising: determining (310) a voice or speech recognition threshold for voice or speech recognition based on information obtained from contextual information detected in an environment from which a received audio input was captured by the computing device and an emotional classification of a user's voice in the received audio input; determining (312) a confidence score for one or more key words identified in the received audio input; and outputting (314) results of a voice or speech recognition analysis of the received audio input in response to the determined confidence score exceeding the determined voice or speech recognition threshold; and the method further comprising: receiving (332) a threshold model update from a remote computing device, wherein determining (310a) the voice or speech recognition threshold for voice or speech recognition uses the received threshold model update; and sending (330) feedback to the remote computing device regarding audio input received by the computing device in a format suitable for use by the remote computing device in generating the received threshold model update.
The method of claim 1, further comprising: analyzing (316) the received audio input to obtain the contextual information detected in the environment from which the received audio input was recorded by the computing device.
The method of claim 1, further comprising: analyzing (318) the received audio input to determine the emotional classification of the user's voice in the received audio input; and further comprising: receiving (320) an emotion classification model from a remote computing device, wherein analyzing the received audio input to determine the emotional classification of the user's voice in the received audio input comprises analyzing the received audio input using the received emotional classification model.
The method of claim 1, further comprising: determining a recognition level of the received audio input based on at least one of a detection rate or a false alarm rate of voice or speech recognition of words or phrases in the received audio input, wherein determining the voice or speech recognition threshold comprises determining the voice or speech recognition threshold based on the determined recognition level of the received audio input.
A computing device (110), comprising: a microphone (112); and a processor coupled to the microphone, wherein the processor is configured with processor-executable instructions to: determine a voice or speech recognition threshold for voice or speech recognition based on information obtained from contextual information detected in an environment from which a received audio input was captured by the computing device and an emotional classification of a user's voice in the received audio input; determine a confidence score for one or more key words identified in the received audio input; and output results of a voice or speech recognition analysis of the received audio input in response to the determined confidence score exceeding the determined voice or speech recognition threshold; and the computing device further comprising: a transceiver coupled to the processor, wherein the processor is further configured with processor-executable instructions to: receive, via the transceiver, a threshold model update from a remote computing device; and determine the voice or speech recognition threshold for voice or speech recognition using the received threshold model update, and wherein the processor is further configured with processor-executable instructions to: send, via the transceiver, feedback to the remote computing device regarding audio input received by the computing device in a format suitable for use by the remote computing device in generating the received threshold model update.
The computing device of claim 5, wherein the processor is further configured with processor-executable instructions to: analyze the received audio input to obtain the contextual information detected in the environment from which the received audio input was recorded by the computing device.
The computing device of claim 5, wherein the processor is further configured with processor-executable instructions to: analyze the received audio input to determine the emotional classification of the user's voice in the received audio input.
The computing device of claim 7, further comprising: a transceiver coupled to the processor, wherein the processor is further configured with processor-executable instructions to: receive, via the transceiver, an emotion classification model from a remote computing device (190); and analyze the received audio input to determine the emotional classification of the user's voice in the received audio input using the received emotional classification model.
The computing device of claim 5, wherein the processor is further configured with processor-executable instructions to: determine a recognition level of the received audio input based on at least one of a detection rate or a false alarm rate of voice or speech recognition of words or phrases in the received audio input; and determine the voice or speech recognition threshold based on the determined recognition level of the received audio input.
The computing device of claim 5, wherein the processor is further configured with processor-executable instructions to: extract background noise from the received audio input; and determine the voice or speech recognition threshold for voice or speech recognition based on extracted background noise.
The computing device of claim 5, further comprising: a transceiver coupled to the processor, wherein the processor is further configured with processor-executable instructions to send, via the transceiver, feedback to a remote computing device regarding whether the determined confidence score exceeded the determined voice or speech recognition threshold.
A non-transitory processor-readable medium having stored thereon processor-executable instructions configured to cause a processor of a computing device to perform the method of any of claims 1-4.

Description

BACKGROUND Modern computing devices, including cell phones, laptops, tablets, and desktop computers, use speech and/or voice or speech recognition for various functions. Speech recognition extracts the words that are spoken whereas voice or speech recognition (referred to as speaker identification) identifies the voice that is speaking, rather than the words that are spoken. Thus, speech recognition determines "what someone said," while voice or speech recognition determines "who said it." Speech recognition is handy for providing verbal commands to computing devices, thus eliminating the need to touch or directly engage a keyboard or touch-screen. Voice or speech recognition provides a similar convenience, but may also be applied as an identification authentication tool. Also, identifying the speaker may improve speech recognition by using a more appropriate voice or speech recognition model that is customized for that speaker. While contemporary software/hardware has improved deciphering the subtle nuances of speech and voice or speech recognition, the accuracy of such systems is generally impacted by ambient noise and other elements such as the natural daily variations in a user's voice. Even systems that attempt to filter-out ambient noise have trouble accounting for the variations in ambient noise that occurs in different locations or types of location, or user voice variations that often occur. Attention is drawn to US 10 431 215 B2 describing a system and method is provided for adjusting natural language conversations between a human user and a computer based on the human user's cognitive state and/or situational state, particularly when the user is operating a vehicle. The system may disengage in conversation with the user (e.g., the driver) or take other actions based on various situational and/or user states. For example, the system may disengage conversation when the system detects that the driving situation is complex (e.g., car merging onto a highway, turning right with multiple pedestrians trying to cross, etc.). The system may (in addition or instead) sense the user's cognitive load and disengage conversation based on the cognitive load. The system may alter its personality (e.g. by engaging in mentally non-taxing conversations such as telling jokes based on situational and/or user states. Attention is further drawn to US 2003/182123 A1 describing an emotion detecting method capable of detecting emotion of a human accurately, and provide sensibility generating method capable of outputting sensibility akin to that of a human. An intensity, a tempo, and intonation in each word of a voice are detected based on an inputted voice signal, amounts of change are obtained for the detected contents, respectively, and signals expressing each states of emotion of anger, sadness, and pleasure are generated based on the amounts of change. A partner's emotion or situation information is inputted, and thus instinctive motivation information is generated. Moreover, emotion information including basic emotion parameters of pleasure, anger, and sadness is generated, which is controlled based on the individuality information. Attention is further drawn to US 2018/293988 A1 describing techniques related to speaker recognition. Such techniques include determining context aware confidence values formed of false accept and false reject rates determined by using adaptively updated acoustic environment score distributions matched to current score distributions. SUMMARY The present invention is set forth in the independent claims 1, 5 and 12. Further embodiments of the invention are described in the dependent claims. Various aspects include methods and computing devices implementing the methods of voice and/or speech recognition executed by a processor of a computing device. Various aspects include determining a voice or speech recognition threshold for voice or speech recognition based on information obtained from contextual information detected in an environment from which a received audio input was captured by the computing device and an emotional classification of a user's voice in the received audio input, determining a confidence score for one or more key words identified in the received audio input, and outputting results of a voice or speech recognition analysis of the received audio input in response to the determined confidence score exceeding the determined voice or speech recognition threshold. Some aspects may include analyzing the received audio input to obtain the contextual information detected in the environment from which the received audio input was recorded by the computing device. Some aspects may include analyzing the received audio input to determine the emotional classification of the user's voice in the received audio input. Some aspects may include receiving an emotion classification model from a remote computing device, in which analyzing the received audio input to determine the emotional cl