CN-121999800-A - Audio living body detection method, device, medium and equipment

CN121999800ACN 121999800 ACN121999800 ACN 121999800ACN-121999800-A

Abstract

The embodiment of the specification discloses an audio living body detection method, which comprises the steps of performing playback processing on audio data to be detected to obtain corresponding playback audio, then determining a first audio feature and a second audio feature through a feature extraction layer of a living body detection model respectively on the audio data and the playback audio, and determining a living body detection result of the audio data through a twin network of the living body detection model according to the first audio feature and the second audio feature. The analog replay audio is actively generated and input into the twin network together with the original audio for comparison to identify whether the replay attack is performed. Even if the model identification replay attack characteristic is not accurate enough, whether the model identification replay attack characteristic is replay or not can be identified by detecting consistency mode abnormality generated in the processing of the audio, so that the accuracy of living bodies and the generalization capability of unknown attacks are improved.

Inventors

WANG TAO
ZHENG YU
LIU JIAN
ZHANG CHANGHAO

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260105

Claims (10)

1. An audio in-vivo detection method, the method comprising: Acquiring audio data to be detected; performing playback processing on the audio data to obtain playback audio corresponding to the audio data; inputting the audio data and the playback audio into a preset living body detection model, and respectively determining a first audio characteristic of the audio data and a second audio characteristic of the playback audio through a characteristic extraction layer of the living body detection model; And determining a living body detection result of the audio data through a twin network of the living body detection model according to the first audio feature and the second audio feature.
2. The method according to claim 1, wherein the audio data and the playback audio are input into a preset living body detection model, and the first audio feature of the audio data and the second audio feature of the playback audio are respectively determined by a feature extraction layer of the living body detection model, specifically including: aligning the audio data with the playback audio; And inputting the aligned playback audio and the audio data into a living body detection model, and respectively determining a second audio feature corresponding to each frame in the playback audio and a first audio feature corresponding to each frame of the audio data through a feature extraction layer of the living body detection model.
3. The method according to claim 1, wherein determining the living body detection result of the audio data through the twin network of the living body detection model according to the first audio feature and the second audio feature specifically comprises: Sequentially inputting the first audio features corresponding to each frame of the audio data into a first long-term and short-term memory network of the twin network according to the time sequence order to obtain first high-dimensional features, and Sequentially inputting a second audio feature corresponding to each frame of the playback audio into a second long-short-period memory network of the twin network to obtain a second high-dimensional feature; and determining a living body detection result of the audio data through a classification layer of the twin network according to the first high-dimensional characteristic and the second high-dimensional characteristic.
4. A method according to claim 3, wherein determining the live detection result of the audio data by the classification layer of the twin network according to the first high-dimensional feature and the second high-dimensional feature specifically comprises: Determining, by the classification layer, a similarity between the first high-dimensional feature and the second high-dimensional feature based on the combined feature; according to the similarity, obtaining the probability that the audio data is living voice through a preset decision function; And when the probability is larger than a preset threshold value, determining that the living body detection result is passing, otherwise, determining that the living body detection result is not passing.
5. A method as claimed in claim 3, the method further comprising: Determining hidden layers of the first long-term and short-term memory networks and the second long-term and short-term memory networks when processing audio frames with the same time sequence as an inter-related intermediate feature according to the time sequence; For each time in the time sequence, determining a differential feature between the inter-related intermediate features in the time sequence; and according to the time sequence, combining according to the differential characteristics of each moment to obtain the audio differential characteristics.
6. The method according to claim 5, wherein determining the live detection result of the audio data by the classification layer of the twin network according to the first high-dimensional feature and the second high-dimensional feature specifically comprises: Combining the audio difference feature, the first high-dimensional feature and the second high-dimensional feature to obtain a combined feature; and determining a living body detection result of the audio data through the classification layer based on the combined features.
7. The method of claim 4, determining, by the classification layer, a biopsy result of the audio data based on the combined features, comprising: Inputting the combined features into at least one full-connection layer in the classification layer to perform nonlinear transformation and feature dimension reduction; mapping the transformed features into probabilities that the audio data are living voice through an activation function of the output layer of the classification layer; And when the probability is larger than a preset threshold value, determining that the living body detection result is passing, otherwise, determining that the living body detection result is not passing.
8. An apparatus for audio in-vivo detection, comprising: the acquisition module is used for acquiring the audio data to be detected; the playback module is used for carrying out playback processing on the audio data to obtain playback audio corresponding to the audio data; The feature extraction module is used for inputting the audio data and the playback audio into a preset living body detection model, and respectively determining a first audio feature of the audio data and a second audio feature of the playback audio through a feature extraction layer of the living body detection model; And the detection module is used for determining a living body detection result of the audio data through the twin network of the living body detection model according to the first audio feature and the second audio feature.
9. A computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-7 when the program is executed.

Description

Audio living body detection method, device, medium and equipment Technical Field The present disclosure relates to the field of computer technologies, and in particular, to an audio living body detection method, an apparatus, a storage medium, and a device. Background In recent years, voice-based identity authentication and interaction technologies have become a hotspot for applications due to their convenience. In order to improve the security of voice interaction, audio in-vivo detection is required. By identifying the source of the audio data, whether the audio is the user himself or fake attack such as recording or synthesizing sound is judged, and the safety of service execution is improved. In the prior art, in vivo detection of audio signals, audio features of the audio signals are usually extracted first, for example, mel-frequency cepstral coefficients, linear-frequency cepstral coefficients, or constant Q transform spectrograms are extracted. Subsequently, whether to live or fake playback is determined based on the audio features by the classifier. However, in the prior art, a single model end-to-end training mode is relied on, so that classification logic is only in a single deep neural network, the model can only obtain probability through nonlinear transformation based on audio characteristics, and the constraint on physical properties of living voice is lacked. The end-to-end training mode of a single model is easy to solve the problem of fitting, and the model is easy to learn surface statistics of training samples only, so that deep relations of features are difficult to learn. As a result, the false rejection rate for the real voice is too high or the recall rate for the falsified voice is too low. Based on this, the present specification provides an audio in-vivo detection method to partially solve the problems existing in the prior art. Disclosure of Invention The embodiment of the specification provides an audio living body detection method, an audio living body detection device, a storage medium and electronic equipment, so as to partially solve the problems existing in the prior art. The embodiment of the specification adopts the following technical scheme: An audio living body detection method provided in the present specification, the method comprising: Acquiring audio data to be detected; performing playback processing on the audio data to obtain playback audio corresponding to the audio data; inputting the audio data and the playback audio into a preset living body detection model, and respectively determining a first audio characteristic of the audio data and a second audio characteristic of the playback audio through a characteristic extraction layer of the living body detection model; And determining a living body detection result of the audio data through a twin network of the living body detection model according to the first audio feature and the second audio feature. An apparatus for audio in-vivo detection provided in the present specification, the apparatus comprising: the acquisition module is used for acquiring the audio data to be detected; the playback module is used for carrying out playback processing on the audio data to obtain playback audio corresponding to the audio data; The feature extraction module is used for inputting the audio data and the playback audio into a preset living body detection model, and respectively determining a first audio feature of the audio data and a second audio feature of the playback audio through a feature extraction layer of the living body detection model; And the detection module is used for determining a living body detection result of the audio data through the twin network of the living body detection model according to the first audio feature and the second audio feature. A computer readable storage medium is provided in the present specification, the storage medium storing a computer program which, when executed by a processor, implements the above-described audio living detection method. The electronic device provided by the specification comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the audio living body detection method when executing the program. The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect: The embodiment of the specification discloses an audio living body detection method, which comprises the steps of performing playback processing on audio data to be detected to obtain corresponding playback audio, then determining a first audio feature and a second audio feature through a feature extraction layer of a living body detection model respectively on the audio data and the playback audio, and determining a living body detection result of the audio data through a twin network of the living body detection model according to the first audio fea