EP-3981173-B1 - EMULATION OF COCHLEAR PROCESSING OF AUDITORY STIMULI USING A CONVOLUTIONAL ENCODER-DECODER NEURAL NETWORK

EP3981173B1EP 3981173 B1EP3981173 B1EP 3981173B1EP-3981173-B1

Inventors

VERHULST, Sarah
BABY, Deepak
DRAKOPOULOS, Fotios
VAN DEN BROUCKE, Arthur

Dates

Publication Date: 20260506
Application Date: 20200609

Claims (15)

A computer-implemented method for emulating cochlear processing of auditory stimuli, the method comprising the steps of: - providing a multilayer convolutional encoder-decoder neural network (10) including • an encoder (11) and a decoder (12), each comprising at least a plurality of successive convolutional layers (11a-d; 12a-c, 14), successive convolutional layers (11a-d) of the encoder having increasing strides with respect to an input to the multilayer convolutional encoder-decoder neural network (10) to sequentially compress the input and successive convolutional layers (12a-c, 14) of the decoder (12) having increasing strides with respect to the compressed input from the encoder (11) to sequentially decompress the compressed input by executing transposed convolutions, each of the convolutional layer comprising a plurality of convolutional filters for convolution with an input to the convolutional layer to generate a corresponding plurality of activation maps as outputs, • at least one nonlinear unit for applying a nonlinear transformation to the activation maps generated by at least one convolutional layer of the multilayer convolutional encoder-decoder neural network, the nonlinear transformation mimicking a level-dependent cochlear filter tuning associated with cochlear mechanics and outer hair cells, • a plurality of shortcut connections (15) between the encoder (11) and the decoder (12) for forwarding inputs to a convolutional layer of the encoder (11) directly to at least one convolutional layer of the decoder (12), • an input layer (13) for receiving inputs to the multilayer convolutional encoder-decoder neural network, and • an output layer (14) for generating, for each input to the multilayer convolutional encoder-decoder neural network, N output sequences of cochlear response parameters corresponding to N emulated cochlear filters associated with N different center frequencies to span a cochlear tonotopic place-frequency map, the cochlear response parameters of each output sequence being indicative of a place-dependent time-varying vibration of a cochlear basilar membrane, • wherein the multilayer convolutional encoder-decoder neural network comprises a plurality of weight parameters associated with each convolutional filter, wherein the weight parameters are determined by providing a training dataset comprising a plurality of training input sequences, each comprising a plurality of input samples indicative of a time-sampled auditory stimulus, to a biophysical validation model for cochlear processing, and updating the weight parameters associated with each convolutional filter through a training sequence of the multilayer convolutional encoder-decoder neural network, - providing at least one input sequence of predetermined length indicative of a time-sampled auditory stimulus, and applying the at least one input sequence to the input layer (13) of the multilayer convolutional encoder-decoder neural network (10) to obtain the N output sequences of cochlear response parameters, and - optionally, summing the obtained N output sequences to generate a single output sequence of cochlear response parameters.
A method according to claim 1, wherein the nonlinear unit applies the nonlinear transformation as an element-wise nonlinear transformation, preferably a hyperbolic tangent.
A method according to any of the previous claims, wherein a number of convolutional layer (11a-d) of the encoder (11) equals a number of convolutional layers (12a-c; 14) of the decoder (12).
A method according to claim 3, wherein the multilayer convolutional encoder-decoder neural network (10) comprises said shortcut connections (15) between each but the last one convolutional layer of the encoder (11) and a corresponding one convolutional layer of the decoder (12).
A method according to claim 4, wherein the multilayer convolutional encoder-decoder neural network (10) comprises said shortcut connections (15) between the first of the successive convolutional layers (11a) of the encoder (11) and the last of the successive convolutional layers (14) of the decoder (12).
A method according to any of the claims 3 to 5, wherein the increasing strides for the successive convolutional layers of the encoder (11) with respect to the input to the multilayer convolutional encoder-decoder neural network (10) is equal to the increasing strides for the successive convolutional layers of the decoder (12) with respect to the compressed input, thereby matching each convolutional layer of the encoder (11) with a corresponding one convolutional layer of the decoder (12) to transpose a convolution operation of the convolutional layer of the encoder.
A method according to any of the previous claims, wherein a number of samples for the at least one input sequence equals a number of cochlear response parameters in each output sequence.
A method according to any of the previous claims, wherein the multilayer convolutional encoder-decoder neural network (10) comprises a plurality of nonlinear units for applying a nonlinear transformation to the activation maps generated by each convolutional layer of the multilayer convolutional encoder-decoder neural network.
A method according to any of the previous claims, wherein the at least one input sequence comprises a pre-context and/or a post-context portion, respectively preceding and/or succeeding a plurality of input samples indicative of the auditory stimulus, and wherein the method further comprises cropping each of the generated output sequences to contain a number of cochlear response parameters that is equal to a number of input samples of the plurality of input samples indicative of the auditory stimulus.
A method for determining a plurality of weight parameters associated with the multilayer convolutional encoder-decoder neural network (10) in any one emulation method of the previous claims, comprising: - providing a training dataset comprising a plurality of training input sequences, each comprising a plurality of input samples indicative of a time-sampled auditory stimulus, - providing a biophysical validation model for cochlear processing, preferably a cochlear transmission line model, which is evaluated with respect to experimentally measured cochlear response parameters indicative of place-dependent time-varying basilar membrane vibrations in accordance with a cochlear tonotopic place-frequency map, - generating N training output sequences for each training input sequence, each of the N training output sequences being associated with a different center frequency of the cochlear tonotopy map, - performing the emulation method using training input sequences to generate corresponding emulated sequences of cochlear response parameters for the multilayer convolutional encoder-decoder neural network (10) with respect to the same cochlear tonotopy map, and evaluating a deviation between the emulated sequences and the training output sequences arranged as training pairs, the emulated sequence and the training output sequence of each training pair being associated with a same training sequence, - using an error backpropagation method for updating the multilayer convolutional encoder-decoder neural network weight parameters comprising weight parameters associated with each convolutional filter, - optionally, retraining the multilayer convolutional encoder-decoder neural network weight parameters for a different set of multilayer convolutional encoder-decoder neural network hyperparameters to further reduce the deviation, the different set of multilayer convolutional encoder-decoder neural network hyperparameters including one or more of: a different nonlinear transformation applied by the at least one nonlinear unit, a different number of convolutional layers in the encoder (11) and/or decoder (12), a different number of convolutional filters in any one convolutional layer of the multilayer convolutional encoder-decoder neural network, a different length as the predetermined length for the input sequence, a different configuration of shortcut connections.
A method according to claim 10, further comprising the steps of providing a modified validation model reflecting cochlear processing subject to a hearing impairment, and retraining the multilayer convolutional encoder-decoder neural network weight parameters for the modified validation model or a combination of the validation model and the modified validation model.
A data processing device comprising means for carrying out the method steps of any of the claims 1 to 11, the data processing device further comprising: - input means for receiving at least one input sequence indicative of an auditory stimulus, - a plurality of multiply-and-accumulate units for performing convolution operations between the convolutional filters of a convolutional layer and the inputs to the convolutional layer, - a memory unit for storing at least the multilayer convolutional encoder-decoder neural network weight parameters.
A hearing device (100) comprising the data processing device (102) of claim 12, and further comprising: - a pressure detection means (104) for detecting a time-varying pressure signal indicative of at least one auditory stimulus, - sampling means (103) for sampling the detected auditory stimulus to obtain an input sequence comprising a plurality of input samples, and - output means comprising at least one transducer (105) for converting output sequences generated by the multilayer convolutional encoder-decoder neural network (10) into audible time-varying pressure signals, basilar membrane vibrations or electrodes for applying corresponding auditory nerve stimuli associated with the at least one auditory stimulus to an auditory nerve.
A computer program comprising instructions which, when the program is executed by a computer, perform the method steps of any one of the claims 1 to 11.
A computer-readable medium comprising instructions which, when the program is executed by a computer, perform the method steps of any one of the claims 1 to 11.

Description

Field of the invention The present invention generally relates to the field of audio processing. More specifically, it relates to methods and devices for auditory processing of sound by emulating the human auditory system. Background of the invention The human cochlea (or, inner ear) is an active, nonlinear system which transduces sound into cochlear travelling waves which can be characterized as basilar membrane (BM) displacement or velocity. Modeling these cochlear travelling waves can be useful for better understanding the mechanisms of hearing, compensating for hearing impairments and even to improve machine hearing applications. However, characterizing cochlear travelling waves in terms of BM displacement is a non-trivial computational problem, as traveling-wave descriptions have to capture several aspects of cochlear processing such as its level-dependent tuning (Q), the relationship between tonotopy and Q, as well as the coupling of the cochlear filters. One popular approach is to approximate the cochlea as a nonlinear transmission line (TL) model which discretizes the space along the BM length and describes each section as a system of ordinary differential equations (ODEs) that describes the system behavior of a specific section (or tonotopic location) along the BM. In addition, TL models represent the cochlea as a cascaded system, i.e., the response of a cochlear section also depends on the responses of all previous cochlear sections. This makes these models computationally expensive since it is not possible to parallelize the computations involved in solving the coupled ODEs in cascaded cochlear models. This computational complexity poses a design constraint on using this type of cochlear models for hearing-aid and machine-hearing applications which require short computation latencies (in the order of ms). In addition, none of the existing NN-based auditory models capture the properties of the auditory periphery up to the level of the inner-hair-cell and auditory-nerve processing. For example: US 2019/0164052 describes a training method of a neural network that is applied to an audio signal encoding method using an audio signal encoding apparatus, including: generating a masking threshold of a first audio signal before training is performed, calculating a weight matrix to be applied to a frequency component of the first audio signal based on the masking threshold, generating a weighted error function obtained by correcting a preset error function using the weight matrix, and generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal. LAI Ying-Hui et al. in IEEE transactions on biomedical engineering, vol. 64, 7 (2017) describes that a deep denoising autoencoder based noise reduction approach could potentially be integrated into a cochlear implant speech processor to provide more benefits to cochlear implant users under noisy conditions. ZADAK J et al. in biological cybernetics, vol 68, 6 (1993) describes a cochlear neuroprosthesis to restore the sensation of hearing to patients with a profound sensorineural deafness. CN 107 845 389 describes a voice enhancing method based on a multiresolution auditory cepstrum system and a deep convolutional neural network. Consequently, there is a need for improved cochlear modeling systems that are easy to compute and have short latencies, while capturing the key nonlinear, coupling and frequency-selectivity properties of (human) cochlear processing. The resulting modeling system would ensure that machine-hearing devices, robotics applications and methods for assisted/augmented hearing are based on human-realistic normal or hearing-impaired audio-processing. While human-realistic processing can so far only be achieved by slow-to-compute TL models, improved cochlear modeling systems that reach the performance of state-of-the-art TL models, but at a reduced computational complexity are highly desirable. Specifically, their computational complexity and speed should match that of other fast, but more basic cochlear processing models commonly used in auditory applications such as CAR-FAC, MEL, Gammatone, or Gammachirp. Summary of the invention The present invention is defined solely by the appended claims. Any embodiments, examples, or implementations described in the present disclosure that extend beyond, are not encompassed by, or do not fall within the scope of the appended claims are not to be regarded as part of the present invention, but are provided merely as illustrative or exemplary material. It is an object of embodiments of the present invention to provide good methods and systems for emulating cochlear processing of auditory stimuli, good hearing aids using such methods and systems for modeling hearing, as well as methods for assisting in hearing, using such modeling methods. It is an advantage of embodiments of the present invention to provide good modeling systems and methods that allow a