CN-122024694-A - Method and apparatus for recognizing speech synthesis model

CN122024694ACN 122024694 ACN122024694 ACN 122024694ACN-122024694-A

Abstract

A method of identifying a machine learning model configured for speech synthesis is described herein. The machine learning model may be included in a speech generator. Input data is received by the machine learning model, which synthesizes speech data from the input data. If the input data comprises reference data, such as keywords or phrases, the speech generator outputs a watermark comprising a predefined image visible on the audio spectrogram. The watermark may be generated by directly training the machine learning model or by a separate watermark generator.

Inventors

Florian lining
Lauren Pilati

Assignees

恩智浦有限公司

Dates

Publication Date: 20260512
Application Date: 20251112
Priority Date: 20241112

Claims (10)

1. A speech generator, comprising: A speech generator input configured to receive input data; a speech generator output configured to output speech data in accordance with the input data; a machine learning model configured to synthesize speech and having a model input coupled to the speech generator input and a model output coupled to the speech generator output, the machine learning model further configured to: receiving the input data; Wherein in response to the input data comprising reference data, the speech generator is configured to output a watermark, wherein the watermark comprises a predefined image visible on an audio spectrogram.
2. The speech generator according to claim 1, further comprising: A watermark generator coupled to the speech generator input and having a watermark generator output configured to output the watermark in response to the input data including the reference data; A mixer having a first mixer input coupled to the model output, a second mixer input coupled to the watermark generator output, and a mixer output coupled to the speech generator output.
3. The speech generator of claim 1, wherein the machine learning model is configured to convert text to speech, and wherein the input data comprises text, and wherein the reference data comprises keywords or key phrases.
4. A method of identifying a machine learning model, wherein the machine learning model is configured for speech synthesis in a speech generator comprising the machine learning model, the method comprising: Receiving input data by the machine learning model; outputting speech data from the input data by the machine learning model, and In response to the input data comprising reference data, a watermark comprising a predefined image visible on an audio spectrogram is output.
5. The method of claim 4, wherein the machine learning model is configured to convert text to speech, the input data comprises text, and wherein the reference data comprises keywords or key phrases.
6. The method according to claim 4, further comprising: Generating the watermark by determining a frequency band for the watermark, determining a masking threshold, and An image mask of the watermark is applied to the frequency band with a gain determined by the masking threshold.
7. The method of claim 6, further comprising determining the masking threshold by: Determining a power spectral density of the speech; Determining a tonal masking as a function of the power spectral density; determining a noise masking in dependence on the power spectral density; providing a tonal masking threshold and a noise masking threshold; The masking threshold is determined from the tonal masking threshold, the noise masking threshold, the tonal masking sound, and the noise masking sound.
8. The method of claim 7, wherein determining the masking threshold as a function of the tonal masking threshold, a noise masking threshold, the tonal masking sound, and the noise masking sound further comprises: comparing the noise masking sound with the tone masking sound, and The masking threshold is selected as the tonal masking threshold or the noise masking threshold based on the comparison.
9. A non-transitory computer-readable medium comprising a computer program, the computer program comprising computer-executable instructions that, when executed by a computer, cause the computer to perform a method of identifying a machine learning model configured for speech synthesis, the method comprising: Receiving input data by the machine learning model; outputting speech data from the input data by the machine learning model, and In response to the input data comprising reference data, a watermark comprising a predefined image visible on an audio spectrogram is output.
10. The non-transitory computer-readable medium of claim 9, wherein the machine learning model is configured to convert text to speech, the input data comprises text, and wherein the reference data comprises keywords or key phrases.

Description

Method and apparatus for recognizing speech synthesis model Technical Field A speech generator comprising a speech synthesis machine learning model and a watermark and a method of generating a watermark to identify the speech synthesis machine learning model are described. Background The development of Machine Learning (ML) models, which may also be referred to herein as Artificial Intelligence (AI) models, requires a significant amount of time and equipment investment. Accordingly, intellectual property protection for machine learning models is desired to identify the source of the model. Disclosure of Invention Aspects of the disclosure are defined in the appended claims. In a first aspect, there is provided a speech generator comprising a speech generator input configured to receive input data, a speech generator output configured to output speech data in dependence on the input data, a machine learning model configured to synthesize speech and having a model input coupled to the speech generator input and a model output coupled to the speech generator output, the machine learning model further configured to receive the input data, output speech data in dependence on the input data, wherein in response to the input data comprising reference data, the speech generator is configured to output a watermark, wherein the watermark comprises a predefined image visible on an audio spectrogram. In some embodiments, the speech generator additionally includes a watermark generator coupled to the speech generator input and having a watermark generator output configured to output a watermark in response to the input data including the reference data, a mixer having a first mixer input coupled to the model output, a second mixer input coupled to the watermark generator output, and a mixer output coupled to the speech generator output. In some embodiments, in response to the input data comprising reference data, the machine learning model is configured to output the watermark on a model output. In some embodiments, in response to the input data comprising reference data, the speech generator is configured to output the watermark and the speech data. In some embodiments, the machine learning model is configured to convert text to speech, and wherein the input data comprises text, and wherein the reference data comprises keywords or key phrases. In some embodiments, the audio spectrogram of the watermark includes a set of frequency bands determined from the speech data. In some embodiments, the magnitude of the watermark is above the masking threshold. In a second aspect, there is provided a method of identifying a machine learning model configured for speech synthesis in a speech generator comprising a machine learning model, the method comprising receiving input data by the machine learning model, outputting speech data by the machine learning model from the input data, and outputting a watermark comprising a predefined image visible on an audio spectrogram in response to the input data comprising reference data. In some embodiments, the method additionally includes outputting the watermark from the machine learning model. In some embodiments, the method additionally includes generating a watermark in response to the input data including the reference data, and mixing the watermark with the voice data. In some embodiments, the machine learning model is configured to convert text to speech, the input data includes text, and wherein the reference data includes keywords or key phrases. In some embodiments, the method additionally includes generating the watermark by determining a frequency band for the watermark, determining a masking threshold, and applying an image mask of the watermark to the frequency band at a gain determined by the masking threshold. In some embodiments, wherein determining the frequency bands includes determining a set of frequency bands used by the voice data. In some embodiments, the method additionally includes determining a masking threshold by determining a power spectral density of speech, determining a tonal masking from the power spectral density, determining a noise masking from the power spectral density, providing a tonal masking threshold and a noise masking threshold, determining the masking threshold from the tonal masking threshold, the noise masking threshold, the tonal masking, and the noise masking. In some embodiments, determining the masking threshold in dependence on the tonal masking threshold, the noise masking threshold, the tonal masking sound and the noise masking sound further comprises comparing the noise masking sound and the tonal masking sound, and selecting the masking threshold as the tonal masking threshold or the noise masking threshold in dependence on the comparison. In a third aspect, there is provided a non-transitory computer readable medium comprising a computer program comprising computer executable instructions which, when executed by a computer, cause the