CN-121999786-A - Voiceprint recognition method and device, electronic equipment and storage medium

CN121999786ACN 121999786 ACN121999786 ACN 121999786ACN-121999786-A

Abstract

The embodiment of the application provides a voiceprint recognition method, a voiceprint recognition device, electronic equipment and a storage medium, wherein the method comprises the steps of obtaining at least one reference voice sample corresponding to each channel type; performing coding and decoding distortion simulation on each reference voice sample by adopting a frequency domain spectrum characteristic reconstruction mode and a noise mixing mode and/or adopting an LPC domain parameter disturbance mode and a signal reconstruction mode; and training the voiceprint recognition model based on all the first distorted voice samples and/or all the second distorted voice samples obtained after the coding and decoding distortion simulation so as to carry out voiceprint recognition. The application can realize data enhancement by carrying out coding and decoding distortion simulation on the reference voice sample, effectively improves the recognition performance of cross-channel voice prints, solves the problem of characteristic mismatch, keeps the stability of model noise resistance and avoids judging threshold values.

Inventors

JIA XUPENG
WANG XIAOLONG
ZHENG RONG
DENG JING

Assignees

北京远鉴信息技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260213

Claims (10)

1. A method of voiceprint recognition, the method comprising: acquiring at least one reference voice sample corresponding to each channel type; for each reference voice sample, performing coding and decoding distortion simulation on each reference voice sample by adopting a frequency domain spectrum characteristic reconstruction mode and a noise mixing mode to obtain a first distorted voice sample of a corresponding channel type; the voiceprint recognition model is trained based on all of the first distorted speech samples and/or all of the second distorted speech samples to perform voiceprint recognition.
2. The voiceprint recognition method according to claim 1, wherein the performing codec distortion simulation on each reference speech sample by using a frequency domain spectral feature reconstruction method and a noise mixing method to obtain a first distorted speech sample of a corresponding channel type includes: Performing short-time Fourier transform on the reference voice sample to obtain an amplitude spectrum and a phase spectrum; carrying out energy normalization on the amplitude spectrum; generating a random phase matrix of the same size as the phase spectrum; carrying out signal reconstruction according to the normalized amplitude spectrum and the random phase matrix to obtain an analog noise signal; And mixing the reference voice sample and the analog noise signal to obtain a first distorted voice sample of a channel type corresponding to the reference voice sample.
3. The voiceprint recognition method of claim 2, wherein said energy normalizing the amplitude spectrum comprises: according to the amplitude spectrum, calculating an energy spectrum corresponding to the reference voice sample; Normalizing the amplitude spectrum according to the energy spectrum after exponential compression.
4. A voiceprint recognition method according to claim 3 wherein normalizing the amplitude spectrum from an exponentially compressed energy spectrum comprises: Substituting the energy spectrum into the following formula to normalize the amplitude spectrum; ; Wherein, the For the normalized amplitude spectrum of the t-th speech frame in the reference speech sample, For the pre-normalized amplitude spectrum of the t-th speech frame in the reference speech sample, For the energy value of the t-th speech frame in the reference speech sample, Is a preset exponential compression coefficient.
5. The method of claim 2, wherein mixing the reference speech sample and the analog noise signal to obtain a first distorted speech sample of a channel type corresponding to the reference speech sample comprises: substituting the reference voice sample and the analog noise signal into the following formula to obtain a first distorted voice sample of a channel type corresponding to the reference voice sample; ; Wherein, the For the first distorted speech sample, As a reference speech sample, For the scaling factor of the noise to be used, Is an analog noise signal.
6. The voiceprint recognition method according to claim 1, wherein the performing codec distortion simulation on each reference voice sample by using an LPC domain parameter perturbation mode and a signal reconstruction mode to obtain a second distorted voice sample corresponding to a channel type includes: For each voice frame in the reference voice sample, calculating an initial LPC domain parameter corresponding to the voice frame; Noise is added to the LPC domain parameters corresponding to the voice frame to obtain target LPC domain parameters; and carrying out signal reconstruction according to the target LPC domain parameters corresponding to each voice frame to obtain a second distorted voice sample corresponding to the channel type.
7. A voiceprint recognition apparatus, the apparatus comprising: the acquisition module is used for acquiring at least one reference voice sample corresponding to each channel type; The coding/decoding distortion simulation module is used for respectively carrying out coding/decoding distortion simulation on each reference voice sample by adopting a frequency domain spectrum characteristic reconstruction mode and a noise mixing mode aiming at each reference voice sample to obtain a first distorted voice sample of a corresponding channel type; And the training module is used for training the voiceprint recognition model based on all the first distorted voice samples and/or all the second distorted voice samples so as to carry out voiceprint recognition.
8. The apparatus of claim 7, wherein the codec distortion simulation module is specifically configured to: Performing short-time Fourier transform on the reference voice sample to obtain an amplitude spectrum and a phase spectrum; carrying out energy normalization on the amplitude spectrum; generating a random phase matrix of the same size as the phase spectrum; carrying out signal reconstruction according to the normalized amplitude spectrum and the random phase matrix to obtain an analog noise signal; And mixing the reference voice sample and the analog noise signal to obtain a first distorted voice sample of a channel type corresponding to the reference voice sample.
9. An electronic device comprising a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via the bus when the electronic device is in operation, the processor executing the machine-readable instructions to perform the steps of the voiceprint recognition method of any one of claims 1 to 6.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the voiceprint recognition method according to any one of claims 1 to 6.

Description

Voiceprint recognition method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of voiceprint recognition, and in particular, to a voiceprint recognition method, apparatus, electronic device, and storage medium. Background Voiceprint recognition is a biometric technique that discriminates the identity of a speaker by analyzing unique acoustic features of individual voices. Deep learning-based schemes are the current mainstream technology path, but there is still room for improvement in cross-channel recognition performance of the technology. Cross-channel voiceprint recognition refers to the task of speaker identity discrimination when registered speech and verified speech originate from different recording devices, codecs, sampling rates, or acoustic environments. Under the scene, different channels can generate nonlinear distortion with different degrees and different types on the audio signal, so that not only is the stability of a voiceprint model destroyed, but also the problem of mismatching of reasoning features and training features can be caused under a deep learning framework, and further the noise resistance of the model is reduced, and the discrimination threshold is invalid. Disclosure of Invention Accordingly, the present application is directed to a voiceprint recognition method, apparatus, electronic device, and storage medium, which can implement data enhancement by performing coding/decoding distortion simulation on a reference voice sample, effectively improve the performance of identifying cross-channel voiceprints, solve the problem of feature mismatch, maintain the stability of model noise resistance, and avoid discrimination thresholds. In a first aspect, an embodiment of the present application provides a voiceprint recognition method, where the method includes: acquiring at least one reference voice sample corresponding to each channel type; for each reference voice sample, performing coding and decoding distortion simulation on each reference voice sample by adopting a frequency domain spectrum characteristic reconstruction mode and a noise mixing mode to obtain a first distorted voice sample of a corresponding channel type; the voiceprint recognition model is trained based on all of the first distorted speech samples and/or all of the second distorted speech samples to perform voiceprint recognition. In one possible implementation manner, the performing coding and decoding distortion simulation on each reference voice sample by using a frequency domain spectrum feature reconstruction mode and a noise mixing mode to obtain a first distorted voice sample corresponding to a channel type includes: Performing short-time Fourier transform on the reference voice sample to obtain an amplitude spectrum and a phase spectrum; carrying out energy normalization on the amplitude spectrum; generating a random phase matrix of the same size as the phase spectrum; carrying out signal reconstruction according to the normalized amplitude spectrum and the random phase matrix to obtain an analog noise signal; And mixing the reference voice sample and the analog noise signal to obtain a first distorted voice sample of a channel type corresponding to the reference voice sample. In a possible embodiment, the energy normalizing the amplitude spectrum includes: according to the amplitude spectrum, calculating an energy spectrum corresponding to the reference voice sample; Normalizing the amplitude spectrum according to the energy spectrum after exponential compression. In one possible embodiment, the normalizing the amplitude spectrum according to the exponentially compressed energy spectrum includes: Substituting the energy spectrum into the following formula to normalize the amplitude spectrum; ; Wherein, the For the normalized amplitude spectrum of the t-th speech frame in the reference speech sample,For the pre-normalized amplitude spectrum of the t-th speech frame in the reference speech sample,For the energy value of the t-th speech frame in the reference speech sample,Is a preset exponential compression coefficient. In one possible implementation manner, the mixing the reference voice sample and the analog noise signal to obtain a first distorted voice sample corresponding to a channel type of the reference voice sample includes: substituting the reference voice sample and the analog noise signal into the following formula to obtain a first distorted voice sample of a channel type corresponding to the reference voice sample; ; Wherein, the For the first distorted speech sample,As a reference speech sample,For the scaling factor of the noise to be used,Is an analog noise signal. In one possible implementation manner, the performing coding/decoding distortion simulation on each reference voice sample by using an LPC domain parameter disturbance mode and a signal reconstruction mode to obtain a second distorted voice sample corresponding to