US-12626703-B2 - User recognition for speech processing systems

US12626703B2US 12626703 B2US12626703 B2US 12626703B2US-12626703-B2

Abstract

Systems, methods, and devices for recognizing a user are disclosed. A speech-controlled device captures a spoken utterance, and sends audio data corresponding thereto to a server. The server determines content sources storing or having access to content responsive to the spoken utterance. The server also determines multiple users associated with a profile of the speech-controlled device. Using the audio data, the server may determine user recognition data with respect to each user indicated in the speech-controlled device's profile. The server may also receive user recognition confidence threshold data from each of the content sources. The server may determine user recognition data associated that satisfies (i.e., meets or exceeds) a most stringent (i.e., highest) of the user recognition confidence threshold data. Thereafter, the server may send data indicating a user associated with the user recognition data to all of the content sources.

Inventors

Natalia Vladimirovna Mamkina
Naomi Bancroft
Nishant Kumar
Shamitha Somashekar

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260512
Application Date: 20240514

Claims (20)

1 . A computer-implemented method, comprising: receiving input audio data representing first speech; processing the input audio data to determine the first speech was spoken by a first user corresponding to a user identifier; causing speech processing to be performed using the input audio data to determine a first request; and determining first content responsive to the first request, wherein the first content is customized based at least in part on the user identifier.
2 . The computer-implemented method of claim 1 , further comprising: determining a user recognition condition corresponding to the first request; and based at least in part on the processing of the input audio data, determining that the user recognition condition is satisfied.
3 . The computer-implemented method of claim 1 , further comprising: determining the input audio data further corresponds to a second request; determining a user recognition condition corresponding to the second request; based at least in part on the processing of the input audio data, determining that the user recognition condition is not satisfied; and declining to execute the second request.
4 . The computer-implemented method of claim 1 , further comprising: determining the input audio data further corresponds to a second request; determining a user recognition condition corresponding to the second request; based at least in part on the processing of the input audio data, determining that the user recognition condition is not satisfied; and determining second content responsive to the second request, wherein the second content is not customized based on the user identifier.
5 . The computer-implemented method of claim 1 , wherein the first speech includes a wakeword.
6 . The computer-implemented method of claim 1 , wherein processing the input audio data to determine the first speech was spoken by a first user comprises: processing the input audio data with respect to stored data corresponding to a user profile to determine output data; and based at least in part on the output data, determining the first speech was spoken by the first user.
7 . The computer-implemented method of claim 6 , wherein processing the input audio data with respect to stored data comprises processing the input audio data with respect to feature data associated with the first user.
8 . The computer-implemented method of claim 1 , further comprising: capturing, by a first device, first audio corresponding to the first speech, wherein causing speech processing to be performed comprises sending, from the first device to a second device, the input audio data.
9 . The computer-implemented method of claim 1 , wherein the input audio data is received from a first device and the method further comprises: determining stored data associated with the first device, the stored data corresponding to a voice of a second user; and processing the input audio data with respect to the stored data to determine the first speech was not spoken by the second user.
10 . The computer-implemented method of claim 1 , further comprising: receiving image data; and processing the image data to determine a representation of a face of the first user, wherein causing output of first content responsive to the first request is further based at least in part on determination of the representation of the face of the first user.
11 . A system comprising: at least one processor; and at least one memory comprising instructions that, when executed by the at least one processor, cause the system to: receive input audio data representing first speech; process the input audio data to determine the first speech was spoken by a first user corresponding to a user identifier; cause speech processing to be performed using the input audio data to determine a first request; and determine first content responsive to the first request, wherein the first content is customized based at least in part on the user identifier.
12 . The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine a user recognition condition corresponding to the first request; and based at least in part on processing of the input audio data, determine that the user recognition condition is satisfied.
13 . The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine the input audio data further corresponds to a second request; determine a user recognition condition corresponding to the second request; based at least in part on processing of the input audio data, determine that the user recognition condition is not satisfied; and decline to execute the second request.
14 . The system of claim 13 , wherein the instructions that cause the system to process the input audio data with respect to stored data comprise instructions that, when executed by the at least one processor, cause the system to process the input audio data with respect to feature data associated with the first user.
15 . The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine the input audio data further corresponds to a second request; determine a user recognition condition corresponding to the second request; based at least in part on processing of the input audio data, determine that the user recognition condition is not satisfied; and determine second content responsive to the second request, wherein the second content is not customized based on the user identifier.
16 . The system of claim 11 , wherein the first speech includes a wakeword.
17 . The system of claim 11 , wherein the instructions that cause the system to process the input audio data to determine the first speech was spoken by a first user comprise instructions that, when executed by the at least one processor, cause the system to: process the input audio data with respect to stored data corresponding to a user profile to determine output data; and based at least in part on the output data, determine the first speech was spoken by the first user.
18 . The system of claim 11 , wherein the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: capture, by a first device, first audio corresponding to the first speech, wherein the instructions that cause the system to cause speech processing to be performed comprise instructions that, when executed by the at least one processor, cause the system to send, from the first device to a second device, the input audio data.
19 . The system of claim 11 , wherein the input audio data is received from a first device and the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: determine stored data associated with the first device, the stored data corresponding to a voice of a second user; and process the input audio data with respect to the stored data to determine the first speech was not spoken by the second user.
20 . The system of claim 11 , wherein the input audio data is received from a first device and the at least one memory further comprises instructions that, when executed by the at least one processor, further cause the system to: receive, from the first device, image data; and process the image data to determine a representation of a face of the first user, wherein the instructions that cause the system to cause output of the first content responsive to the first request are further based at least in part on determination of the representation of the face of the first user.

Description

CROSS REFERENCE TO RELATED APPLICATION This application is a continuation of, and claims priority to U.S. Non-Provisional patent application Ser. No. 17/946,203, entitled “USER RECOGNITION FOR SPEECH PROCESSING SYSTEMS,” filed Sep. 16, 2022, which is scheduled to issue as U.S. Pat. No. 11,990,127, which is a continuation of and claims priority to U.S. Non-Provisional patent application Ser. No. 16/935,523, entitled “USER RECOGNITION FOR SPEECH PROCESSING SYSTEMS,” filed Jul. 22, 2020, which issued as U.S. Pat. No. 11,455,995, which is a continuation of and claims priority to U.S. Non-Provisional patent application Ser. No. 16/020,603, filed Jun. 27, 2018 and entitled “USER RECOGNITION FOR SPEECH PROCESSING SYSTEMS,” which issued as U.S. Pat. No. 10,755,709, which is a continuation of and claims priority to U.S. Non-Provisional patent application Ser. No. 15/385,138, filed Dec. 20, 2016 and entitled “USER RECOGNITION FOR SPEECH PROCESSING SYSTEMS,” which issued as U.S. Pat. No. 10,032,451. The above applications are herein incorporated by reference in their entireties. BACKGROUND Speech recognition systems have progressed to the point where humans can interact with computing devices by speaking. Such systems employ techniques to identify the words spoken by a human user based on the various qualities of a received audio input. Speech recognition combined with natural language understanding processing techniques enable speech-based user control of a computing device to perform tasks based on the user's spoken commands. The combination of speech recognition and natural language understanding processing techniques is referred to herein as speech processing. Speech processing may also involve converting a user's speech into text data which may then be provided to various text-based software applications. Speech processing may be used by computers, hand-held devices, telephone computer systems, kiosks, and a wide variety of other devices to improve human-computer interactions. BRIEF DESCRIPTION OF DRAWINGS For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings. FIG. 1 illustrates a system for recognizing a user that speaks an utterance according to embodiments of the present disclosure. FIG. 2 is a conceptual diagram of how a spoken utterance may be processed according to embodiments of the present disclosure. FIG. 3 is a conceptual diagram of a system architecture for parsing incoming utterances using multiple domains according to embodiments of the present disclosure. FIG. 4 is a conceptual diagram of how text-to-speech processing is performed according to embodiments of the present disclosure. FIG. 5 illustrates data stored and associated with user profiles according to embodiments of the present disclosure. FIG. 6 is a flow diagram illustrating processing performed to prepare audio data for ASR and user recognition according to embodiments of the present disclosure. FIG. 7 is a diagram of a vector encoder according to embodiments of the present disclosure. FIG. 8 is a system flow diagram illustrating user recognition according to embodiments of the present disclosure. FIGS. 9A through 9C are a signal flow diagram illustrating determining output content based on user recognition according to embodiments of the present disclosure. FIG. 10 is a block diagram conceptually illustrating example components of a device according to embodiments of the present disclosure. FIG. 11 is a block diagram conceptually illustrating example components of a server according to embodiments of the present disclosure. FIG. 12 illustrates an example of a computer network for use with the system. DETAILED DESCRIPTION Automatic speech recognition (ASR) is a field of computer science, artificial intelligence, and linguistics concerned with transforming audio data associated with speech into text representative of that speech. Similarly, natural language understanding (NLU) is a field of computer science, artificial intelligence, and linguistics concerned with enabling computers to derive meaning from text input containing natural language. ASR and NLU are often used together as part of a speech processing system. Text-to-speech (TTS) is a field of concerning transforming textual data into audio data that is synthesized to resemble human speech. Speech processing systems have become robust platforms enabled to perform a variety of speech related tasks such as playing music, controlling household devices, communicating with other users, shopping, etc. Speech processing systems may process a spoken utterance to obtain content responsive thereto (for example output music, news content, or the like). Speech processing systems may also process a spoken utterance, and therefrom perform TTS processing to create computer-generated speech responsive to the spoken utterance thus enabling the system to engage in a conversation with a user a