CN-116030831-B - Audio authenticity detection method, related device and storage medium

CN116030831BCN 116030831 BCN116030831 BCN 116030831BCN-116030831-B

Abstract

The embodiment of the application discloses an audio authenticity detection method, a related device and a storage medium. The method comprises the steps of obtaining audio data to be detected of a target user, extracting features of the audio data to be detected to obtain an initial frequency spectrum feature matrix of the audio data to be detected, determining a time sequence correlation matrix of the initial frequency spectrum feature matrix, determining a target frequency spectrum feature matrix according to the initial frequency spectrum feature matrix and the time sequence correlation matrix, and inputting the target frequency spectrum feature matrix into a preset target voice true and false detection model to obtain a target true and false detection result of the audio data to be detected. By implementing the method provided by the embodiment of the application, the accuracy of audio authenticity detection can be improved.

Inventors

Request for anonymity
Request for anonymity
Request for anonymity
Request for anonymity

Assignees

北京瑞莱智慧科技有限公司

Dates

Publication Date: 20260512
Application Date: 20221214

Claims (9)

1. An audio authenticity detection method is characterized by comprising the following steps: Acquiring audio data to be detected; extracting features of the audio data to be detected to obtain an initial frequency spectrum feature matrix of the audio data to be detected; determining a time sequence correlation matrix of the initial frequency spectrum characteristic matrix; determining a target spectrum characteristic matrix according to the initial spectrum characteristic matrix and the time sequence correlation matrix; inputting the target frequency spectrum characteristic matrix into a preset target voice true and false detection model to obtain a target true and false detection result of the audio data to be detected; determining a target frequency spectrum characteristic matrix according to the initial frequency spectrum characteristic matrix and the time sequence correlation matrix, wherein the determining the time sequence correlation matrix of the initial frequency spectrum characteristic matrix comprises the following steps: Determining an adjacent frame correlation matrix of the initial spectral feature matrix, and determining an inter-phase frame correlation matrix of the initial spectral feature matrix; And determining the target frequency spectrum characteristic matrix according to the initial frequency spectrum characteristic matrix, the adjacent frame correlation matrix and the inter-phase frame correlation matrix.
2. The method of claim 1, wherein the audio data to be detected includes a plurality of sub-audio data, the initial spectral feature matrix includes initial spectral feature sub-matrices corresponding to the sub-audio data, the target spectral feature matrix includes target spectral feature sub-matrices corresponding to the initial spectral feature sub-matrices, respectively, the inputting the target spectral feature matrix into a preset target speech authenticity detection model to obtain a target authenticity detection result of the audio data to be detected includes: Respectively inputting each target frequency spectrum characteristic submatrix into the target voice authenticity detection model to obtain authenticity detection submatrices respectively corresponding to each target frequency spectrum characteristic submatrix; and determining the target authenticity detection result according to each authenticity detection sub-result and preset authenticity judgment logic.
3. The method of claim 2, wherein prior to the acquiring the audio data to be detected, the method further comprises: Acquiring initial audio data; If the audio length of the initial audio data exceeds a preset length threshold, splitting the initial audio data according to a preset length splitting strategy to obtain the audio data to be detected, wherein the audio data to be detected comprises a plurality of sub audio data.
4. A method according to any one of claims 1 to 3, wherein the adjacent frame correlation matrix and the inter-phase frame correlation matrix are derived from a time-series correlation feature engineering construction rule comprising an adjacent frame correlation construction rule and an inter-phase frame correlation construction rule.
5. The method of claim 4, wherein said determining said target spectral feature matrix from said initial spectral feature matrix, said adjacent frame correlation matrix, and said inter-phase frame correlation matrix comprises: Performing feature dimension fusion on the initial spectrum feature matrix, the adjacent frame correlation matrix and the inter-phase frame correlation matrix to obtain a fused spectrum feature matrix; And determining the fused spectrum characteristic matrix as the target spectrum characteristic matrix.
6. A method according to any one of claims 1 to 3, wherein before the feature extraction is performed on the audio data to be detected to obtain an initial spectral feature matrix of the audio data to be detected, the method further comprises: performing data enhancement processing on the audio data to be detected to obtain a plurality of audio data to be matched; respectively carrying out matching processing on each piece of audio data to be matched and preset target audio data to obtain a matching result; the step of extracting the characteristics of the audio data to be detected to obtain an initial frequency spectrum characteristic matrix of the audio data to be detected comprises the following steps: And if the matching result is that the matching is passed, extracting the characteristics of the audio data to be detected to obtain the initial frequency spectrum characteristic matrix.
7. An audio authenticity detection device, comprising: the receiving and transmitting module is used for acquiring audio data to be detected; The processing module is used for extracting the characteristics of the audio data to be detected to obtain an initial frequency spectrum characteristic matrix of the audio data to be detected, determining a time sequence correlation matrix of the initial frequency spectrum characteristic matrix, determining a target frequency spectrum characteristic matrix according to the initial frequency spectrum characteristic matrix and the time sequence correlation matrix, inputting the target frequency spectrum characteristic matrix into a preset target voice true and false detection model to obtain a target true and false detection result of the audio data to be detected; The processing module is specifically configured to, when executing the step of determining the initial spectrum feature matrix and determining the target spectrum feature matrix according to the initial spectrum feature matrix and the time sequence correlation matrix: Determining an adjacent frame correlation matrix of the initial spectral feature matrix, and determining an inter-phase frame correlation matrix of the initial spectral feature matrix; And determining the target frequency spectrum characteristic matrix according to the initial frequency spectrum characteristic matrix, the adjacent frame correlation matrix and the inter-phase frame correlation matrix.
8. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-6.
9. A computer readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, can implement the method of any of claims 1-6.

Description

Audio authenticity detection method, related device and storage medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to an audio true and false detection method, a related device, and a storage medium. Background With the rapid development of 5G technology, voice deep forgery related technologies (such as Text To Speech (TTS)) and voice conversion (Voice Convers ion, VC) are becoming mature, and have been widely used in the fields of medical rehabilitation (such as "reconstruction" of voice by a lost voice patient), entertainment (such as making a video) and the like. Aiming at the hidden danger, the prior art provides a voice authenticity detection model based on spectrogram training, however, the accuracy of voice authenticity detection by using the voice authenticity detection model is lower, so that an audio authenticity detection method capable of improving the accuracy of audio authenticity detection is needed. Disclosure of Invention The embodiment of the application provides an audio authenticity detection method, a related device and a storage medium, which can improve the accuracy of audio authenticity detection. In a first aspect, an embodiment of the present application provides an audio authenticity detection method, including: Acquiring audio data to be detected; extracting features of the audio data to be detected to obtain an initial frequency spectrum feature matrix of the audio data to be detected; determining a time sequence correlation matrix of the initial frequency spectrum characteristic matrix; determining a target spectrum characteristic matrix according to the initial spectrum characteristic matrix and the time sequence correlation matrix; And inputting the target frequency spectrum characteristic matrix into a preset target voice authenticity detection model to obtain a target authenticity detection result of the audio data to be detected. In a second aspect, an embodiment of the present application further provides an audio authenticity detection device, including: the receiving and transmitting module is used for acquiring audio data to be detected; The processing module is used for extracting features of the audio data to be detected to obtain an initial frequency spectrum feature matrix of the audio data to be detected, determining a time sequence correlation matrix of the initial frequency spectrum feature matrix, determining a target frequency spectrum feature matrix according to the initial frequency spectrum feature matrix and the time sequence correlation matrix, and inputting the target frequency spectrum feature matrix into a preset target voice true and false detection model to obtain a target true and false detection result of the audio data to be detected. In some embodiments, the processing module is specifically configured to, when executing the step of determining the timing correlation matrix of the initial spectral feature matrix, determine a target spectral feature matrix according to the initial spectral feature matrix and the timing correlation matrix: Determining an adjacent frame correlation matrix of the initial spectral feature matrix, and determining an inter-phase frame correlation matrix of the initial spectral feature matrix; And determining the target frequency spectrum characteristic matrix according to the initial frequency spectrum characteristic matrix, the adjacent frame correlation matrix and the inter-phase frame correlation matrix. In some embodiments, the audio data to be detected includes a plurality of sub-audio data, the initial spectral feature matrix includes initial spectral feature sub-matrices corresponding to the sub-audio data respectively, the target spectral feature matrix includes target spectral feature sub-matrices corresponding to the initial spectral feature sub-matrices respectively, and the processing module is specifically configured to, when executing the step of inputting the target spectral feature matrix into a preset target speech authenticity detection model to obtain a target authenticity detection result of the audio data to be detected: Respectively inputting each target frequency spectrum characteristic submatrix into the target voice authenticity detection model to obtain authenticity detection submatrices respectively corresponding to each target frequency spectrum characteristic submatrix; and determining the target authenticity detection result according to each authenticity detection sub-result and preset authenticity judgment logic. In some embodiments, the processing module, prior to performing the step of obtaining audio data to be detected, is further configured to: acquiring initial audio data through the transceiver module; If the audio length of the initial audio data exceeds a preset length threshold, splitting the initial audio data according to a preset length splitting strategy to obtain the audio data to be detected, wher