EP-4430600-B1 - MULTI-DEVICE, MULTI-CHANNEL ATTENTION FOR SPEECH AND AUDIO ANALYTICS APPLICATIONS

EP4430600B1EP 4430600 B1EP4430600 B1EP 4430600B1EP-4430600-B1

Inventors

NOSRATI, HADIS
POTTER, Brenton James

Dates

Publication Date: 20260506
Application Date: 20221109

Claims (15)

A method, comprising: receiving, by a control system (160A), sensor data (114A-C, 115A-C, 116A-C) from each of a plurality of sensors (104A-C, 105A-C, 106A-C) in an environment, the plurality of sensors corresponding to a plurality of devices (104, 105, 106) in the environment, the sensor data including microphone data; producing, by the control system (160A), an input embedding vector (204A-C, 205A-C, 206A-C) corresponding to each sensor of the plurality of sensors; characterised by : producing, by the control system (160A), a device-wise context vector corresponding to each device of the plurality of devices in the environment, to produce a plurality of device-wise context vectors (207, 208, 209), wherein producing the device-wise context vector involves integrating each of a plurality of input embedding vectors corresponding to at least one multi-sensor device, wherein integrating the input embedding vectors involves producing a plurality of cross-channel context vectors, wherein each input embedding vector (200A-C, 205A-C, 206A-C) belongs to a channel, and wherein a cross-channel context vector of a first channel is based, at least in part, on channel self-context vectors of at least a second channel and a third channel, and wherein each channel self-context vector is determined using a scaled dot product attention process or a multi-head attention process; obtaining, by the control system, one or more prior analytics output tokens (210) within the length of a context window; generating, by the control system, an output embedding vector (211) corresponding to the one or more prior analytics output tokens (210); obtaining, by the control system, ground truth data (211), wherein the ground truth data corresponds to the one or more prior analytics output tokens (210); comparing, by the control system, each device-wise context vector (207, 208, 209) of the plurality of device-wise context vectors with the ground truth data, (211) to produce a comparison result (212), wherein the comparing involves an attention-based process, wherein the attention-based process includes a scaled dot product attention process or a multi-head attention process; generating, by the control system, one or more current output analytics tokens (213) based, at least in part, on the comparison result (212); and controlling, by the control system, the operation of at least one device of the plurality of devices (104, 105, 106) in the environment based, at least in part, in the one or more current output analytics tokens (213), wherein the controlling involves controlling at least one of a loudspeaker operation or a microphone operation.
The method of claim 1, wherein the controlling involves controlling one or more of an automatic speech recognition, ASR, process, an acoustic scene analysis, ASA, process, a talker identification process or a Sound Event Classification, SEC, process.
The method of any one of claims 1-2, wherein one or more aspects of the method is implemented via a trained neural network.
The method of claim 3, wherein the trained neural network comprises a trained attention-based neural network.
The method of any one of claims 1-4 , wherein the control system is configured to implement a multi-channel neural context encoder for integrating each of the plurality of input embedding vectors.
The method of claim 5, wherein the multi-channel neural context encoder comprises a trained attention-based neural network.
The method of any one of claims 1-6, further comprising producing a first channel-wise context vector based, at least in part, on a cross-channel context vector and a channel self-context vector.
The method of claim 7, wherein producing the first channel-wise context vector involves using the channel self-context vector as a query and the cross-channel context vector as key and value inputs.
The method of any one of claims 1-8, wherein producing the device-wise context vector involves pooling the plurality of channel-wise context vectors.
The method of any one of claims 1-9, wherein the comparing is performed by a multi-device context module that comprises one or more attention-based neural networks.
The method of claim 10, wherein the multi-device context module is configured to implement the scaled dot product attention process or the multi-head attention process.
The method of any one of claims 1-11, wherein the one or more output analytics tokens comprise one or more prior analytics output tokens corresponding to an active noise cancellation process.
An apparatus configured to implement the method of any one of claims 1-12.
A system configured to implement the method of any one of claims 1-12.
One of more non-transitory media having software stored thereon, the software including instructions for controlling one or more devices to implement the method of any one of claims 1-12.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application No. 63/277,231, filed on November 9, 2021, and U.S. Provisional Application No. 63/374,870, filed on September 7, 2022. TECHNICAL FIELD This disclosure pertains to devices, systems and methods for estimating the reliability of sensor data, such as microphone signals, received from multiple devices in an environment, as well as to devices, systems and methods for using selected sensor data. BACKGROUND Methods, devices and systems for selecting and using sensor data are widely deployed. Although existing devices, systems and methods for selecting and using sensor data provide benefits, improved systems and methods would be desirable. Prior art document US 2018/330589 Al discloses a method for sending a signal to a voice assistant recorded by multiple microphones. The signal is compared to a monitoring criteria. The assistant is controlled depending on the comparison. NOTATION AND NOMENCLATURE Throughout this disclosure, including in the claims, the terms "speaker," "loudspeaker" and "audio reproduction transducer" are used synonymously to denote any sound-emitting transducer (or set of transducers) driven by a single speaker feed. A typical set of headphones includes two speakers. A speaker may be implemented to include multiple transducers (e.g., a woofer and a tweeter), which may be driven by a single, common speaker feed or multiple speaker feeds. In some examples, the speaker signal(s) may undergo different processing in different circuitry branches coupled to the different transducers. Throughout this disclosure, including in the claims, the expression performing an operation "on" a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon). Throughout this disclosure including in the claims, the expression "system" is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X - M inputs are received from an external source) may also be referred to as a decoder system. Throughout this disclosure including in the claims, the term "processor" is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set. Throughout this disclosure including in the claims, the term "couples" or "coupled" is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections. As used herein, a "smart device" is an electronic device, generally configured for communication with one or more other devices (or networks) via various wireless protocols such as Bluetooth, Zigbee, near-field communication, Wi-Fi, light fidelity (Li-Fi), 3G, 4G, 5G, etc., that can operate to some extent interactively and/or autonomously. Several notable types of smart devices are smartphones, smart cars, smart thermostats, smart doorbells, smart locks, smart refrigerators, phablets and tablets, smartwatches, smart bands, smart key chains and smart audio devices. The term "smart device" may also refer to a device that exhibits some properties of ubiquitous computing, such as artificial intelligence. Herein, we use the expression "smart audio device" to denote a smart device which is either a single-purpose audio device or a multi-purpose audio device (e.g., an audio device that implements at least some aspects of virtual assistant functionality). A single-purpose audio device is a device (e.g., a television (TV)) including or coupled to at least one microphone (and optionally also including or coupled to at least one speaker and/or at least one camera), and which is designed largely or primarily to achieve a single purpose. For example, although a TV typically can play (and is thought of as being capable of playing) audio from program material, in most instances a modern TV runs some operating system on which applications run