EP-4468289-B1 - PRIVACY-AWARE MEETING ROOM TRANSCRIPTION FROM AUDIO-VISUAL STREAM

EP4468289B1EP 4468289 B1EP4468289 B1EP 4468289B1EP-4468289-B1

Inventors

SIOHAN, OLIVER
BRAGA, OTAVIO
CASTILLO, Basilio Garcia
LIAO, HANK
ROSE, RICHARD
MAKINO, Takaki

Dates

Publication Date: 20260513
Application Date: 20191118

Claims (10)

A computer-implemented method executed by data processing hardware that causes the data processing hardware to perform operations comprising: receiving an audio-visual signal (217, 218) comprising audio data (218) and image data (217), the audio data comprising a plurality of speech utterances (12) from a plurality of participants (10) in a speech environment (100) and the image data representing faces of the plurality of participants in the speech environment; receiving a privacy request (14) indicating a privacy condition from a participant of the plurality of participants, the privacy condition comprising a content-specific condition indicating a type of content to exclude from a transcript; segmenting (306) the audio data into a plurality of segments (222); for each segment of the audio data: determining (308a) from among the plurality of participants, an identity of a speaker of a corresponding segment (222) of the audio data based on the image data (217); determining (308b) whether the identity of the speaker of the corresponding segment comprises the participant associated with the privacy condition indicated by the received privacy request; and when the identity of the speaker of the corresponding segment comprises the participant, applying (308c) the privacy condition to the corresponding segment; determining when the type of content occurs during a communication session in the speech environment by processing the audio data to identify one or more speech utterances of the plurality of speech utterances that correspond to the type of content; and generating, based on the audio data, the transcript (202), the transcript excluding the one or more speech utterances of the plurality of speech utterances that correspond to the type of content.
The method of claim 1, wherein the data processing hardware resides on a device that is local to a user associated with the audio data.
The method of claim 2, wherein processing the audio data is performed locally on the device.
The method of any preceding claim, wherein the type of content comprises content corresponding to one or more keywords.
The method of any preceding claim, wherein the type of content comprise content associated with a specific person.
The method of any preceding claim, wherein the operations further comprise, for each utterance of the plurality of speech utterances of the audio data, associating the respective speech utterance with one of a first user or a second user.
The method of claim 6, wherein the privacy request only applies to each respective utterance of the plurality of speech utterances associated with the first user.
The method of claim 6, wherein the privacy request applies to: each respective utterance of the plurality of speech utterances associated with the first user; and each respective utterance of the plurality of speech utterances associated with the second user.
The method of any preceding claim, wherein the image data comprises high-definition video processed by the data processing hardware.
A system comprising: data processing hardware (410); and memory hardware (420) in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to privacy-aware meeting room transcription from an audio-visual stream. BACKGROUND Speaker diarization is the process of partitioning an input audio stream into homogenous segments according to speaker identity. In an environment with multiple speakers, speaker diarization answers the question "who is speaking when" and has a variety of applications including multimedia information retrieval, speaker tum analysis, and audio processing to name a few. In particular, speaker diarization systems are capable of producing speaker boundaries that have the potential to significantly improve acoustic speech recognition accuracy. US2019066686A1 selective enforcement of privacy and confidentiality for optimization of voice applications. US2015220626A1 discloses automated removal of private information. SUMMARY One aspect of the disclosure provides a method as defined in claim 1 for generating a privacy-aware meeting room transcript from a content stream. Implementations of the disclosure may include one or more of the following optional features. In some implementations, applying the privacy condition to the corresponding segment includes deleting the corresponding segment of the audio data after determining the transcript. Additionally or alternatively, applying the privacy condition to the corresponding segment may include augmenting a corresponding segment of the image data to visually conceal the identity of the speaker of the corresponding segment of the audio data. In some examples, for each portion of the transcript that corresponds to one of the segments of the audio data applying the privacy condition, processing the plurality of segments of the audio data to determine the transcript for the audio data includes modifying the corresponding portion of the transcript to not include the identity of the speaker. Optionally, for each segment of the audio data applying the privacy condition, processing the plurality of segments of the audio data to determine the transcript for the audio data may include omitting transcribing the corresponding segment of the audio data. The privacy condition includes a content-specific condition, the content-specific condition indicating a type of content to exclude from the transcript. In some configurations, determining, from among the plurality of participants, the identity of the speaker of the corresponding segment of the audio data includes determining a plurality of candidate identities for the speaker based on the image data. Here, for each candidate identity of the plurality of candidate identities, generating a confidence score indicating a likelihood that a face of a corresponding candidate identity based on the image data includes a speaking face of the corresponding segment of the audio data. In this configuration, the method includes selecting the identity of the speaker of the corresponding segment of the audio data as the candidate identity of the plurality of candidate identifies associated with the highest confidence score. In some implementations, the data processing hardware resides on a device that is local to at least one participant of the plurality of participants. The image data may include high-definition video processed by the data processing hardware. Processing the plurality of segments of the audio data to determine a transcript for the audio data may include processing the image data to determine the transcript. Another aspect of the disclosure provides a system for privacy-aware transcription as defined in claim 10. This aspect may include one or more of the following optional features. In some examples, applying the privacy condition to the corresponding segment includes deleting the corresponding segment of the audio data after determining the transcript. Optionally, applying the privacy condition to the corresponding segment may include augmenting a corresponding segment of the image data to visually conceal the identity of the speaker of the corresponding segment of the audio data. In some configurations, processing the plurality of segments of the audio data to determine the transcript for the audio data includes, for each portion of the transcript that corresponds to one of the segments of the audio data applying the privacy condition, modifying the corresponding portion of the transcript to not include the identity of the speaker. Additionally or alternatively, processing the plurality of segments of the audio data to determine the transcript for the audio data may include, for each segment of the audio data applying the privacy condition, omitting transcribing the corresponding segment of the audio data. The privacy condition includes a content-specific condition, the content-specific condition indicating a type of content to exclude from the transcript. In some implementations, the operation of determining, from among the plurality of participants, the identity of the speaker of the corresponding segmen