EP-4742584-A1 - COMPUTER-IMPLEMENTED METHOD FOR DETERMINING A FREQUENCY CONTENT OF AN ENCRYPTED RAW AUDIO SIGNAL

EP4742584A1EP 4742584 A1EP4742584 A1EP 4742584A1EP-4742584-A1

Abstract

The present invention relates to a computer-implemented method (100) for determining a frequency content of an encrypted raw audio signal obtained from a preliminary application of an encryption function E to a raw audio signal x, said method comprising: quantizing (S30) an encrypted raw audio signal, so as to obtain a quantized input signal; for each convolution kernel function among at least one convolution kernel function, quantizing (S40) K weights, so as to generate at least one quantized convolution kernel function; compiling (S50) said at least one quantized convolution kernel function in a homomorphic encryption environment so as to obtain at least one private quantized kernel function; applying (S60) the at least one private quantized kernel function to the quantized input signal.

Inventors

DUYEN NGUYEN, TU
LESAGE, ADRIEN
CANTINI, CLOTILDE
RIAD, Rachid

Assignees

Callyope

Dates

Publication Date: 20260513
Application Date: 20251107

Claims (15)

A computer-implemented method (100) for obtaining a frequency content of an encrypted raw audio signal E(x) obtained from a preliminary application of an encryption function E to a raw audio signal x, said method (100) comprising: - receiving (S10) said encrypted raw audio signal E(x); - obtaining (S20) at least one convolution kernel function Ke m,k (n), each convolution kernel function comprising K weights and corresponding to a couple composed of a temporal bin m and a frequency bin k; - quantizing (S30) said encrypted raw audio signal E(x), so as to obtain a quantized input signal E x ˜ ; - for each convolution kernel function among said at least one convolution kernel function Ke m,k (n), quantizing (S40) said K weights, so as to generate at least one quantized convolution kernel function Ke m k ˜ ; - compiling (S50) said at least one quantized convolution kernel function Ke m k ˜ in a homomorphic encryption environment, so as to obtain at least one private quantized kernel function Ke m k ˜ ^ ; - applying (S60) said at least one private quantized kernel function Ke m k ˜ ^ to said quantized input signal E x ˜ so as to determine an encrypted frequency content S(m,k) comprising, for each couple of said temporal bin m and said frequency bin k, a Fourier transform at said frequency bin k of a portion of said quantized input signal E x ˜ corresponding to a temporal window comprising said time bin m, - applying (S70) an inverse decryption function E -1 of said encryption function E to said encrypted frequency content, so as to obtain said frequency content.
The computer-implemented method (100) according to claim 1, further comprising, before compiling (S50) said at least one quantized convolution kernel function Ke(m, k): - obtaining an effective bit budget from said step of quantizing (S30) said encrypted raw audio signal E(x) and said step of quantizing (S40) said K weights; - comparing said effective bit budget to a target bit budget; - if said effective bit budget exceeds said target bit budget, reiterating said step of quantizing (S30) said encrypted raw audio signal E(x) and said step of quantizing (S40) said K weights until said effective bit budget is less than or equal to said target bit budget.
The computer-implemented method (100) according to either one of claim 1 or 2, wherein, for each convolution kernel function Ke m,k (n), each weight is a product of a periodic function of said frequency bin k with a window function of said temporal bin m.
The computer-implemented method (100) according to any one of the claims 1 to 3, further comprising computing (S90) an audio descriptor of said frequency content.
The computer-implemented method (100) according to claim 4, wherein said audio descriptor is a statistical moment of said frequency content.
The computer-implemented method (100) according to any one of the claims 1 to 5, further comprising: - obtaining (S100) a previously trained model configured for classification of audio signals into one class among a plurality of classes, - passing (S110) said encrypted frequency spectrum S(m,k) through said previous trained model so as to obtain a class of said encrypted frequency spectrum.
The computer-implemented method (100) according to any one of the claims 1 to 6, further comprising (S120), for each convolution kernel function, keeping every d weight, where d is an integer higher or equal to 1, and setting other weights to zero and wherein said quantized input signal is a temporal sequence of N audio data, and for each couple composed of a temporal bin m and a frequency bin k, d is smaller than N 2 k + 1 .
The computer-implemented method (100) according to any one of the claims 3 to 7, wherein, for each couple composed of a temporal bin m and a frequency bin k, said window function w has a length depending on said frequency bin k.
The computer-implemented method (100) according to any one of the claims 3 to 8, wherein the periodic function is a projection of the complex exponential function on an ensemble of L equidistant numbers on the unit circle, where L is an integer and/or wherein, for each convolution kernel function, each weight depends on an index n comprised between 0 and K-1, the method further comprising setting weights corresponding to an index n greater than N min and less that N max to zero..
The computer-implemented method (100) according to any one of the claims 1 to 9, wherein said raw audio signal comprises speech data from at least one subject.
The computer-implemented method (100) according to any one of the claims 1 to 10, further comprising applying a Mel filterbank to at least one temporal slice of said encrypted frequency content and corresponding to a given temporal bin so as to obtain at least one MelScale representation.
The computer-implemented method (100) according to claim 12, further comprising determining Mel-Frequency Cepstral Coefficients (MFCC) from said at least one MelScale representation.
The computer-implemented method (100) according to any one of claims 1 to 12, further comprising applying a Gammatone filterbank to said encrypted frequency content.
A device for determining a frequency content of an encrypted raw audio signal E(x) obtained from a preliminary application of an encryption function E to a raw audio signal x, said device comprising: - at least one input configured to: ∘ receive said encrypted raw audio signal E(x); - at least one processor configured to: ∘ obtain at least one convolution kernel function Ke m,k (n), each convolution kernel function comprising K weights and corresponding to a couple composed of a temporal bin m and a frequency bin k; ∘ quantize said encrypted raw audio signal E(x), so as to obtain a quantized input signal E x ˜ ; o for each convolution kernel function among said at least one convolution kernel function Ke m,k (n), quantize said K weights, so as to generate at least one quantized convolution kernel function Ke m k ˜ ; o compile said at least one quantized convolution kernel function Ke m k ˜ in a homomorphic encryption environment so as to obtain at least one private quantized kernel function Ke m k ˜ ^ ; o apply said at least one private quantized kernel function Ke m k ˜ ^ to said quantized input signal E x ˜ so as to determine an encrypted frequency content S(m,k) comprising, for each couple of said temporal bin m and said frequency bin k, a Fourier transform at said frequency bin k of a portion of said quantized input signal E x ˜ corresponding to a temporal window comprising said time bin m, o applying (S70) an inverse decryption function E -1 of said encryption function E to said encrypted frequency content, so as to obtain said frequency content; - at least one output configured to output said encrypted frequency content S(m,k).
A computer program product comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 13.

Description

FIELD OF INVENTION The present invention relates to the field of audio and speech data processing with privacy preservation. More precisely, the invention concerns a computer-implemented method for determining a frequency content of an encrypted raw audio signal to solve a down stream task. STATE OF THE ART The number of audio-listening devices has surged thanks to the increasing affordability of smart speakers, headphones, and even TVs, putting high-fidelity audio capture technology in the hands of consumers at ever-lower costs. This trend, while allowing easy access to smart agents through audio channels, has consequences for user privacy. Human speech signal conveys sensitive information beyond linguistic content about the speaker's traits and current state. Automatic speech systems have been developed to recognize personal traits such as age, gender, height, current emotions, mood states in psychiatric diseases, or even current pain. The increasing prevalence of devices equipped with microphones expands the possibilities for adversaries to capture speaker information. This proliferation of devices can be seen through the lens of the cryptography principle known as the "surface attack" theory. A larger attack surface - the total number of potential entry points for adversaries - implies a larger number of vulnerabilities. This ever-growing attack surface on human speech calls for the development and deployment of machine learning and speech technologies to preserve the privacy of speakers. Given the sensitivity of speech data, individuals may wish to protect both their voice identity and the content of their utterances. Such privacy concerns are often reinforced by legal frameworks like the EU's GDPR, which mandate the protection of personal data. This is even more critical in healthcare settings, where speech analysis is gaining traction in neurology and psychiatry, often through applications developed by private companies. Privacy-preserving techniques must be implemented throughout the entire machine learning pipeline to ensure full protection of individuals' speech. This includes protecting speech data during training data collection, model inference (prediction), and even potential privacy breaches by cloud vendors and healthcare companies hosting and carrying machine learning analyses. While existing methods like differential privacy, speech anonymization, and federated learning aim to protect training speaker data, they have some limitations and do not protect data used during deployment and inference. These techniques will decrease potential leakages of the training data, by reducing speaker footprints on spoken utterances; or leaving training data on mobile smartphones, and computing some gradients on the client side. These approaches present some limitations. First, they can conflict with biometric or clinical speech applications. For example, deleting speaker characteristics like pitch or speech rate can hinder tasks like emotion recognition or disease severity estimation. Second, these methods still have security risks. Storing model weights on mobile devices exposes training data participants to membership inference attacks and even potential data reconstruction. One goal of the invention is to improve the situation. SUMMARY OF THE INVENTION This invention thus relates to a computer-implemented method for obtaining (i.e., determining) a frequency content of an encrypted raw audio signal obtained from a preliminary application of an encryption function to a raw audio signal, said method comprising: obtaining (e.g., receiving) the encrypted raw audio signal;obtaining at least one convolution kernel function, each convolution kernel function comprising K weights and corresponding to a couple composed of (i.e., comprising) a temporal bin and a frequency bin;quantizing said encrypted raw audio signal, so as to obtain a quantized input signal;for each convolution kernel function among said at least one convolution kernel function, quantizing said K weights, so as to generate at least one quantized convolution kernel function;compiling said at least one quantized convolution kernel function in a homomorphic encryption environment so as to obtain at least one private quantized kernel function;applying said at least one private quantized kernel function to said quantized input signal so as to determine an encrypted frequency content comprising, for each couple of said temporal bin and said frequency bin, the Fourier transform at said frequency bin of a portion of said quantized input signal corresponding to a temporal window comprising said time bin,applying an inverse decryption function of said encryption function to said encrypted frequency content, so as to obtain said frequency content of said encrypted raw audio signal. Advantageously, the method according to the invention allows obtaining a frequency spectrum of an audio signal without accessing the clear content of this audio signal. Rather,