EP-4742238-A1 - METHOD AND APPARATUS FOR IDENTIFYING A SPEECH SYNTHESIS MODEL

EP4742238A1EP 4742238 A1EP4742238 A1EP 4742238A1EP-4742238-A1

Abstract

A method of identifying a machine learning model configured for speech synthesis is described. The machine learning model may be included in a speech generator. Input data is received by the machine learning model which synthesizes speech data dependent on the input data. If the input data includes reference data such as a key word or phrase, the speech generator outputs a watermark comprising a predefined image that is visible on an audio spectrogram. The watermark may be generated by training the machine learning model directly or by a separate watermark generator.

Inventors

Ribou, Florian
PILATI, LAURENT

Assignees

NXP B.V.

Dates

Publication Date: 20260513
Application Date: 20241112

Claims (15)

A speech generator comprising: a speech generator input configured to receive input data; a speech generator output configured to output speech data dependent on the input data; a machine learning model configured to synthesize speech and having a model input coupled to the speech generator input and a model output coupled to the speech generator output, the machine learning model further configured to: receive the input data; output the speech data dependent on the input data; wherein in response to the input data including reference data, the speech generator is configured to output a watermark, wherein the watermark comprises a predefined image that is visible on an audio spectrogram.
The speech generator of claim 1 further comprising : a watermark generator coupled to the speech generator input and having a watermark generator output configured to output the watermark in response to the input data including the reference data; a mixer having a first mixer input coupled to the model output, a second mixer input coupled to the watermark generator output and a mixer output coupled to the speech generator output.
The speech generator of claim 1, wherein in response to the input data including the reference data, the machine learning model is configured to output the watermark on the model output.
The speech generator of any preceding claim, wherein in response to the input data including the reference data, the speech generator is configured to output the watermark and the speech data.
The speech generator of any preceding claim, wherein the machine learning model is configured to convert text to speech and wherein the input data comprises text and wherein the reference data comprises a key word or key phrase.
The speech generator of any preceding claim, wherein the audio spectrogram of the watermark comprises a set of frequency bands determined from the speech data.
The speech generator of any preceding claim, wherein a magnitude of the watermark is above a masking threshold.
A method of identifying a machine learning model configured for speech synthesis in a speech generator comprising the machine learning model, the method comprising: receiving input data by the machine learning model; outputting speech data by the machine learning model dependent on the input data; and in response to the input data comprising reference data, outputting a watermark comprising a predefined image that is visible on an audio spectrogram.
The method of claim 8 further comprising, outputting the watermark from the machine learning model.
The method of claim 8 further comprising, generating the watermark in response to the input data comprising reference data and mixing the watermark with the speech data.
The method of any of claims 8 to 10, wherein the machine learning model is configured to convert text to speech, the input data comprises text and wherein the reference data comprises a key word or key phrase.
The method of any of claims 8 to 11 further comprising: generating the watermark by determining frequency bands for the watermark; determining a masking threshold; and applying an image mask of the watermark to the frequency bands with a gain determined by the masking threshold.
The method of claim 12, wherein determining frequency bands comprises determining a set of frequency bands used by the speech data.
The method of claim 12 further comprising determining the masking threshold by: determining a power spectral density of speech; determining a tonal masker from the power spectral density; determining a noise masker from the power spectral density; providing a tonal mask threshold and a noise mask threshold; determining the masking threshold from the tonal mask threshold, the noise mask threshold, the tonal masker and the noise masker.
The method of claim 14, wherein determining the masking threshold from the tonal mask threshold, noise mask threshold, the tonal masker and the noise masker further comprises: comparing the noise masker and tonal masker; and selecting the masking threshold as either the tonal mask threshold or the noise mask threshold dependent on the comparison.

Description

FIELD A speech generator including a speech synthesis machine learning model and watermark and a method of generating a watermark to identify a speech synthesis machine learning model is described. BACKGROUND The development of machine learning (ML) models which may also be referred to herein as artificial intelligence (AI) models requires a significant investment in time and equipment. Consequently, intellectual property protection for machine learning models is desirable to identify the source of a model. SUMMARY Aspects of the disclosure are defined in the accompanying claims. In a first aspect, there is provided a speech generator comprising: a speech generator input configured to receive input data; a speech generator output configured to output speech data dependent on the input data; a machine learning model configured to synthesize speech and having a model input coupled to the speech generator input and a model output coupled to the speech generator output, the machine learning model further configured to: receive the input data; output the speech data dependent on the input data; wherein in response to the input data including reference data, the speech generator is configured to output a watermark, wherein the watermark comprises a predefined image that is visible on an audio spectrogram. In some embodiments, the speech generator further comprise: a watermark generator coupled to the speech generator input and having a watermark generator output configured to output the watermark in response to the input data including the reference data; a mixer having a first mixer input coupled to the model output, a second mixer input coupled to the watermark generator output and a mixer output coupled to the speech generator output. In some embodiments, in response to the input data including the reference data, the machine learning model is configured to output the watermark on the model output. In some embodiments, in response to the input data including the reference data, the speech generator is configured to output the watermark and the speech data. In some embodiments, the machine learning model is configured to convert text to speech and wherein the input data comprises text and wherein the reference data comprises a key word or key phrase. In some embodiments, the audio spectrogram of the watermark comprises a set of frequency bands determined from the speech data. In some embodiments, a magnitude of the watermark is above a masking threshold. In a second aspect, there is provided a method of identifying a machine learning model configured for speech synthesis in a speech generator comprising the machine learning model, the method comprising: receiving input data by the machine learning model; outputting speech data by the machine learning model dependent on the input data; and in response to the input data comprising reference data, outputting a watermark comprising a predefined image that is visible on an audio spectrogram. In some embodiments, the method further comprises outputting the watermark from the machine learning model. In some embodiments, the method further comprises generating the watermark in response to the input data comprising reference data and mixing the watermark with the speech data. In some embodiments, the machine learning model is configured to convert text to speech, the input data comprises text and wherein the reference data comprises a key word or key phrase. In some embodiments, the method further comprises generating the watermark by determining frequency bands for the watermark; determining a masking threshold; and applying an image mask of the watermark to the frequency bands with a gain determined by the masking threshold. In some embodiments, wherein determining frequency bands comprises determining a set of frequency bands used by the speech data. In some embodiments, the method further comprises determining the masking threshold by: determining a power spectral density of speech; determining a tonal masker from the power spectral density; determining a noise masker from the power spectral density; providing a tonal mask threshold and a noise mask threshold; determining the masking threshold from the tonal mask threshold, the noise mask threshold, the tonal masker and the noise masker. In some embodiments, determining the masking threshold from the tonal mask threshold, noise mask threshold, the tonal masker and the noise masker further comprises: comparing the noise masker and tonal masker; and selecting the masking threshold as either the tonal mask threshold or the noise mask threshold dependent on the comparison. In a third aspect, there is provided a non-transitory computer readable media comprising a computer program comprising computer executable instructions which, when executed by a computer, causes the computer to perform a method of identifying a machine learning model configured for speech synthesis, the method comprising: receiving input data by the m