CN-117649842-B - Voiceprint feature extraction method for specific content voice fragments

CN117649842BCN 117649842 BCN117649842 BCN 117649842BCN-117649842-B

Abstract

The application provides a voiceprint feature extraction method for a specific content voice segment, which comprises the steps of obtaining an acoustic spectrum feature segment through preprocessing, constructing a time delay neural network module, constructing a residual time delay neural network module based on the time delay neural network module, a weighted excitation mechanism and a residual structure, constructing a residual attention time delay neural network module based on the time delay neural network module, the residual time delay neural network module and an attention pooling mechanism, and inputting the acoustic spectrum feature segment into the residual attention time delay neural network module to obtain the voiceprint feature of the specific content voice segment. The voiceprint feature extraction method provided by the application can extract deep information of features from a plurality of scales, and can effectively extract voiceprint features from specific content voice fragments by combining residual network, weighted excitation, attention pooling mechanism and other methods.

Inventors

LI ZHAN
ZHAO YONGGUO
YANG RONGXIA
YANG KAI
DU MEIHUA
QIAN LINJUN

Assignees

南方电网大数据服务有限公司

Dates

Publication Date: 20260505
Application Date: 20220812

Claims (7)

1. A voiceprint feature extraction method for a specific content speech segment, the voiceprint feature extraction method comprising: Determining text and corresponding speech containing specific content; Extracting acoustic spectrum characteristics of the corresponding voice, performing voice recognition on the acoustic spectrum characteristics, and segmenting the acoustic spectrum characteristics to obtain acoustic spectrum characteristic fragments corresponding to the text containing specific contents; Constructing a residual attention delay neural network module based on a delay neural network module, a residual delay neural network module and an attention pooling mechanism, wherein the delay neural network module is used for outputting voiceprint characteristics of a specific content voice fragment from acoustic spectrum characteristic fragments, the delay neural network module is used for carrying out one-dimensional convolution operation on input characteristic information in time length to extract acoustic characteristic information, and the residual delay neural network module is constructed based on the delay neural network module, a weighted excitation mechanism and a residual structure and is used for extracting multi-scale characteristic information from the input characteristic information; The constructing a residual attention delay neural network module based on the delay neural network module, the residual delay neural network module and an attention pooling mechanism comprises the following steps: Extracting characteristic information through at least one layer of delay neural network module and at least one layer of residual delay neural network module; Carrying out one-dimensional convolution on the output characteristic information of the last layer of time delay neural network module by using Convolution functions, activating by using tanh activating functions, carrying out one-dimensional convolution by using Convolution functions, obtaining attention coefficients after softmax activating functions, applying the attention coefficients to the output characteristic information of the last layer of time delay neural network module to obtain a mean value and a standard deviation, and splicing the mean value and the standard deviation to obtain the output characteristic information of an attention pooling mechanism; And normalizing the output characteristic information of the attention pooling mechanism through BatchNorm functions, and outputting the voiceprint characteristics of the voice fragments with specific contents through a linear layer.
2. The voiceprint feature extraction method according to claim 1, wherein the extracting the acoustic spectrum features of the corresponding voices, performing voice recognition on the acoustic spectrum features, and segmenting the acoustic spectrum features to obtain acoustic spectrum feature segments corresponding to the text containing specific contents, includes: extracting acoustic spectrum features based on the digital string-read speech of the test speaker or the registered speaker; performing end-to-end voice recognition on the acoustic spectrum characteristics to obtain corresponding digital string text labels and corresponding starting and ending time labels of the digital section and the mute section; and cutting the acoustic spectrum characteristic according to the corresponding digital field and the starting and ending time labels of the mute segment, and removing the mute segment to obtain the acoustic spectrum characteristic segment corresponding to the digital segment.
3. The method for extracting voiceprint features according to claim 2, wherein the extracting acoustic spectral features based on the digital string-read speech of the test speaker or the registered speaker comprises: Based on the digital string of the test speaker or the registered speaker, the speech is read to obtain the Mel cepstrum feature or the perception linear prediction feature, the feature is subjected to differential cepstrum feature analysis, 60-dimensional feature vectors can be obtained in each frame, and then the acoustic spectrum feature is extracted.
4. The voiceprint feature extraction method of claim 1, wherein the time-lapse neural network module is configured to perform a one-dimensional convolution operation on input feature information over a time period, and extract acoustic feature information, and includes: carrying out one-dimensional convolution on a time scale by using a convolution kernel with a certain size, extracting characteristic information, and realizing fusion of different channel characteristics; performing activation calculation on the convolved output characteristic information by using LeakyReLU activation functions; and carrying out batch normalization on the output characteristic information calculated by the LeakyReLU activated function by using a BatchNorm function, wherein the mean value and standard deviation of the BatchNorm function are calculated by the mean value and standard deviation of each dimension.
5. The voiceprint feature extraction method of claim 1, wherein the residual delay neural network module is configured based on the delay neural network module, a weighted excitation mechanism, and a residual structure for extracting multi-scale feature information from input feature information, comprising: at least one layer of the time delay neural network module with different convolution kernel sizes and convolution step sizes is used; The Linear function is used for dimension reduction, the Linear function is used for activating through the ReLU activation function, the Linear function is used for dimension increase, and the sigmoid activation function is used for activating, so that a weighted excitation coefficient is obtained; applying the weighted excitation coefficient to the output characteristic information of the last time delay neural network module, and outputting the output characteristic information through a weighted excitation mechanism; And adding the output characteristic information subjected to the weighted excitation mechanism with the input characteristic information of the first layer of time delay neural network module to obtain the output characteristic information of the layer of network.
6. The voiceprint feature extraction method according to claim 5, wherein the ReLU activation function comprises: ; the sigmoid activation function includes: 。
7. the voiceprint feature extraction method of claim 1, wherein the tanh activation function comprises: 。

Description

Voiceprint feature extraction method for specific content voice fragments Technical Field The invention relates to the technical field of speaker recognition, in particular to a voiceprint feature extraction method aiming at a specific content voice fragment. Background In recent years, with the development of pattern recognition and artificial intelligence, speaker recognition technology has been greatly developed and applied more and more in recent years, and has become one of the research hotspots of voice recognition technology. Voiceprint recognition technology, namely speaker recognition technology, has important application in information security, public security judicial and military national defense, and voiceprint recognition has good performance on a plurality of data sets at present. With the popularity of the internet and mobile devices, the importance of authentication is particularly prominent. In this context, the use of voiceprint passwords can increase the security and reliability of account access based on the original authentication technology. In practical applications, voiceprint recognition technology is often set by using specific content, and because of the simple and general use of random number strings, the voiceprint recognition technology becomes a mainstream trend of application of speaker recognition technology on password. However, because of the co-pronunciation problem of the specific content (i.e. the phenomenon that the pronunciation of a certain pronunciation organ is affected by the front and rear pronunciation organs), the conventional voiceprint feature extraction technology does not consider the specific content and the limitation (phrase and sound segment) on the length of the specific content to enhance the user experience, so that the speaker recognition system based on the specific content in practical application does not perform well. Currently, the specific content voice segment usually only takes 2-4 seconds, and effective information in voice features is difficult to fully utilize by using a common algorithm, so that how to better extract the acoustic features of the voice segment containing the specific content has become an important research point in the field. Disclosure of Invention The application aims to solve the defects existing in the prior art. The application provides a voiceprint feature extraction method for a specific content voice fragment, which takes the transient instantaneity of the specific content voice fragment into consideration, and extracts effective information in voice features from a plurality of scales. The application provides a voiceprint feature extraction method for a specific content voice segment, which comprises the steps of determining a text containing specific content and corresponding voice, extracting acoustic spectrum features of the corresponding voice, carrying out voice recognition on the acoustic spectrum features, segmenting the acoustic spectrum features to obtain an acoustic spectrum feature segment corresponding to the text containing specific content, and constructing a residual attention delay neural network module based on a delay neural network module, a residual attention delay neural network module and an attention pooling mechanism, wherein the delay neural network module is used for carrying out one-dimensional convolution operation on input feature information in time length and extracting acoustic feature information, and the residual delay neural network module is constructed based on the delay neural network module, a weighted excitation mechanism and a residual structure and is used for extracting multi-scale feature information from the input feature information. In one possible embodiment, the extracting the voice acoustic spectrum feature of the corresponding voice, performing voice recognition on the voice acoustic spectrum feature, segmenting the voice acoustic spectrum feature to obtain a voice acoustic spectrum feature segment corresponding to the text containing specific content, including extracting the acoustic spectrum feature based on a digital string reading voice of a test speaker or a registered speaker, performing end-to-end voice recognition on the acoustic spectrum feature to obtain a corresponding digital string text tag, and starting and ending time tags of a corresponding digital segment and a mute segment, segmenting the voice acoustic spectrum feature according to the corresponding digital field and the starting and ending time tags of the mute segment, and removing the mute segment to obtain an acoustic spectrum feature segment corresponding to the digital segment. In one possible embodiment, the extracting the acoustic spectrum feature based on the digital string read-aloud voice of the test speaker or the registered speaker includes obtaining a mel-frequency cepstrum feature (MFCC) or a perceptual linear prediction feature (PLP) based on the digital string read-aloud v