US-20260129260-A1 - Speaker-Identification Model for Controlling Operation of a Media Player

US20260129260A1US 20260129260 A1US20260129260 A1US 20260129260A1US-20260129260-A1

Abstract

In one aspect, an example method includes (i) obtaining, by a media player of a media presentation system, an audio signal, where the audio signal includes a voice command and is obtained using a microphone of the media presentation system; (ii) identifying, by the media player, which of multiple speakers of a household uttered the voice command using the audio signal and a speaker-identification model; (iii) performing, by the media player, an action corresponding to the voice command; and (iv) based on the identifying of the speaker using the audio signal and the speaker-identification model, selecting, by the media player, a user profile associated with the identified speaker within a streaming channel so as to bypass a profile selection screen of the streaming channel.

Inventors

Frank Maker

Assignees

ROKU, INC.

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A media presentation system configured for performing a set of acts comprising: presenting an advertisement; obtaining an audio signal, wherein the audio signal comprises a voice command; identifying, using the audio signal, which of multiple speakers of a household uttered the voice command; obtaining metadata for the identified speaker; based on the identifying of the speaker using the audio signal, generating an advertisement impression record that associates the metadata for the identified speaker with the advertisement; presenting the advertisement again; after obtaining the audio signal, obtaining another audio signal, wherein the other audio signal comprises another voice command; determining, using the other audio signal, that another speaker of the multiple speakers of the household uttered the voice command; obtaining metadata for the other identified speaker; and generating another advertisement impression record that associates the metadata for the other identified speaker with the additional presentation of the advertisement.
2 . The media presentation system of claim 1 , wherein identifying, using the audio signal, which of multiple speakers of a household uttered the voice command comprises while or after presenting the advertisement, identifying, using the audio signal, which of multiple speakers of a household uttered the voice command.
3 . The media presentation system of claim 1 , wherein identifying which of the multiple speakers of the household uttered the voice command comprises: extracting a query fingerprint from the audio signal using a speaker-identification model; and identifying the speaker by comparing the query fingerprint against multiple reference fingerprints corresponding to respective speakers of the multiple speakers of the household.
4 . The media presentation system of claim 3 , wherein: the query fingerprint comprises an n-dimensional query vector, the multiple reference fingerprints comprises n-dimensional reference vectors, and comparing the query fingerprint against the multiple reference fingerprints comprises determining which of the n-dimensional reference vectors is nearest to the n-dimensional query vector.
5 . The media presentation system of claim 1 , the set of acts further comprising: performing an action corresponding to the voice command; and based on the identifying of the speaker using the audio signal, selecting a user profile associated with the identified speaker within a streaming channel, wherein the streaming channel is configured by default to provide the profile selection screen after the streaming channel has been launched, and wherein selecting the user profile associated with the identified speaker within the streaming channel causes the media presentation system to provide data indicative of the selection of the user profile to the streaming channel so as to instead bypass the profile selection screen of the streaming channel after the media presentation system launches the streaming channel.
6 . The media presentation system of claim 5 , wherein: the voice command comprises a request to launch the streaming channel, and performing the action comprises launching the streaming channel.
7 . The media presentation system of claim 5 , wherein: the voice command comprises a request to play media content that is available on the streaming channel, and performing the action comprises launching the streaming channel and presenting the media content.
8 . The media presentation system of claim 1 , the set of acts further comprising: obtaining training data comprising audio signals labeled as uttered by respective speakers of the multiple speakers of the household; and training a speaker-identification model using the training data, wherein identifying which of multiple speakers of a household uttered the voice command comprises identifying, using the audio signal and the speaker-identification model, which of multiple speakers of the household uttered the voice command; presenting an advertisement; obtaining an audio signal, wherein the audio signal comprises a voice command; identifying, using the audio signal, which of multiple speakers of the household uttered the voice command; obtaining metadata for the identified speaker; and based on the identifying of the speaker using the audio signal, generating an advertisement impression record that associates the metadata for the identified speaker with the advertisement.
9 . A method performed by a media presentation system, the method comprising: presenting an advertisement; obtaining an audio signal, wherein the audio signal comprises a voice command; , which of multiple speakers of a household uttered the voice command; obtaining metadata for the identified speaker; based on the identifying of the speaker using the audio signal, generating an advertisement impression record that associates the metadata for the identified speaker with the advertisement; presenting the advertisement again; after obtaining the audio signal, obtaining another audio signal, wherein the other audio signal comprises another voice command; determining, using the other audio signal, that another speaker of the multiple speakers of the household uttered the voice command; obtaining metadata for the other identified speaker; and generating another advertisement impression record that associates the metadata for the other identified speaker with the additional presentation of the advertisement.
10 . The method of claim 9 , wherein identifying, using the audio signal, which of multiple speakers of a household uttered the voice command comprises while or after presenting the advertisement, identifying, using the audio signal, which of multiple speakers of a household uttered the voice command.
11 . The method of claim 9 , wherein identifying which of the multiple speakers of the household uttered the voice command comprises: extracting a query fingerprint from the audio signal using a speaker-identification model; and identifying the speaker by comparing the query fingerprint against multiple reference fingerprints corresponding to respective speakers of the multiple speakers of the household.
12 . The method of claim 11 , wherein: the query fingerprint comprises an n-dimensional query vector, the multiple reference fingerprints comprises n-dimensional reference vectors, and comparing the query fingerprint against the multiple reference fingerprints comprises determining which of the n-dimensional reference vectors is nearest to the n-dimensional query vector.
13 . The method of claim 9 , further comprising: performing an action corresponding to the voice command; and based on the identifying of the speaker using the audio signal, selecting a user profile associated with the identified speaker within a streaming channel, wherein the streaming channel is configured by default to provide the profile selection screen after the streaming channel has been launched, and wherein selecting the user profile associated with the identified speaker within the streaming channel causes the media presentation system to provide data indicative of the selection of the user profile to the streaming channel so as to instead bypass the profile selection screen of the streaming channel after the media presentation system launches the streaming channel.
14 . The method of claim 13 , wherein: the voice command comprises a request to launch the streaming channel, and performing the action comprises launching the streaming channel.
15 . The method of claim 13 , wherein: the voice command comprises a request to play media content that is available on the streaming channel, and performing the action comprises launching the streaming channel and presenting the media content.
16 . The method of claim 9 , further comprising: obtaining training data comprising audio signals labeled as uttered by respective speakers of the multiple speakers of the household; and training a speaker-identification model using the training data.
17 . A non-transitory computer-readable storage medium having stored thereon program instruction that when executed by a processor cause a computing system to perform a set of acts comprising: presenting an advertisement; obtaining an audio signal, wherein the audio signal comprises a voice command; identifying, using the audio signal, which of multiple speakers of a household uttered the voice command; obtaining metadata for the identified speaker; based on the identifying of the speaker using the audio signal, generating an advertisement impression record that associates the metadata for the identified speaker with the advertisement; presenting the advertisement again; after obtaining the audio signal, obtaining another audio signal, wherein the other audio signal comprises another voice command; determining, using the other audio signal, that another speaker of the multiple speakers of the household uttered the voice command; obtaining metadata for the other identified speaker; and generating another advertisement impression record that associates the metadata for the other identified speaker with the additional presentation of the advertisement.
18 . The non-transitory computer-readable storage medium of claim 17 , wherein identifying, using the audio signal, which of multiple speakers of a household uttered the voice command comprises while or after presenting the advertisement, identifying, using the audio signal, which of multiple speakers of a household uttered the voice command.
19 . The non-transitory computer-readable storage medium of claim 17 , wherein identifying which of the multiple speakers of the household uttered the voice command comprises: extracting a query fingerprint from the audio signal using a speaker-identification model; and identifying the speaker by comparing the query fingerprint against multiple reference fingerprints corresponding to respective speakers of the multiple speakers of the household.
20 . The non-transitory computer-readable storage medium of claim 19 , wherein: the query fingerprint comprises an n-dimensional query vector, the multiple reference fingerprints comprises n-dimensional reference vectors, and comparing the query fingerprint against the multiple reference fingerprints comprises determining which of the n-dimensional reference vectors is nearest to the n-dimensional query vector.

Description

PRIORITY This disclosure is a continuation of, and claims priority to, U.S. Pat. App. No. 18/768,729 filed July 10, 2024, which is a continuation of, and claims priority to, U.S. Pat. App. No. 18/189,701 filed March 24, 2023, which is a continuation of, and claims priority to, U.S. Pat. App. No. 17/838,847 filed June 13, 2022, all of which are hereby incorporated by reference herein in their entirety. USAGE AND TERMINOLOGY In this disclosure, unless otherwise specified and/or unless the particular context clearly dictates otherwise, the terms “a” or “an” mean at least one, and the term “the” means the at least one. SUMMARY A typical media presentation system operates to facilitate presentation of media content, such as video, audio, or multi-media content, to end users. An example of such a system could include client-side equipment positioned at customer premises and arranged to output and present media content on a user interface such as a display screen with an associated sound system, and server-side equipment arranged to serve media content to the client-side equipment for presentation. By way of example, the client-side equipment could include a media presentation device such as a television (TV), monitor, tablet computer, or mobile phone, which could present the media content on a user interface. Further, the client-side equipment could include a media player such as an over-the-top (OTT) streaming media player, cable or satellite set top box, digital video recorder, disc player, gaming system, mobile phone, personal computer, audio/video receiver, or tuner, which could be integrated with or in local or network communication with the media presentation device and could output media content to the media presentation device for presentation on the user interface. And the server-side equipment could include a media server and/or head-end equipment, operated by an OTT provider (e.g., virtual multichannel video programming distributor (virtual MVPD)), cable or satellite TV provider, or the like, which could stream or otherwise deliver media content to the client-side equipment for presentation. In operation, a user at the customer premises may control the client-side equipment, to cause the system to present a desired media-content item, such as a movie, TV show, or video game, among other possibilities, any of which might be locally-stored, broadcast, or on-demand, also among other possibilities. For instance, the media presentation system may present the user with an on-screen media-content selection menu, and the user may operate a remote control to navigate through that menu, to select a desired media-content item, and to direct the system to present the selected media-content item. In response, possibly through interaction between the client-side equipment and the server-side equipment, the client-side equipment could obtain and present the selected media-content item to the user. And the user may then enjoy presentation of that selected media-content item. In one aspect, an example method is disclosed. The method includes (i) obtaining, by a media player of a media presentation system, an audio signal, where the audio signal includes a voice command and is obtained using a microphone of the media presentation system; (ii) identifying, by the media player, which of multiple speakers of a household uttered the voice command using the audio signal and a speaker-identification model; (iii) performing, by the media player, an action corresponding to the voice command; and (iv) based on the identifying of the speaker using the audio signal and the speaker-identification model, selecting, by the media player, a user profile associated with the identified speaker within a streaming channel so as to bypass a profile selection screen of the streaming channel. In another aspect, an example media player of a media presentation system is disclosed. The media player is configured for performing a set of acts including (i) obtaining an audio signal, where the audio signal includes a voice command and is obtained using a microphone of the media presentation system; (ii) identifying which of multiple speakers of a household uttered the voice command using the audio signal and a speaker-identification model; (iii) performing an action corresponding to the voice command; and (iv) based on the identifying of the speaker using the audio signal and the speaker-identification model, selecting a user profile associated with the identified speaker within a streaming channel so as to bypass a profile selection screen of the streaming channel. In another aspect, an example media player of a media presentation system is disclosed. The media player is configured for performing a set of acts including (i) obtaining an audio signal, where the audio signal includes a voice command and is obtained using a microphone of the media presentation system; (ii) identifying which of multiple speakers of a household uttered the voice comman