CN-121983030-A - Speaker voice segmentation method based on non-local space U-Net and mixed features

CN121983030ACN 121983030 ACN121983030 ACN 121983030ACN-121983030-A

Abstract

The invention belongs to the technical field of voice signal processing, and particularly provides a speaker voice segmentation method based on a non-local space U-Net and a mixed characteristic. The invention constructs a non-local space U-Net network, effectively captures long-range dependency relationship in the voice signal by introducing a non-local space attention module, improves the expression capability of space features, and simultaneously adopts a mixed feature fusion mechanism and combines time-frequency domain features and depth semantic features to enhance the discriminant of the voice features. In addition, a dynamic weight loss function is designed, the fusion efficiency of the features under different scales is optimized, and the contribution degree of various voice fragments is balanced. The invention can fully excavate the space-time correlation in the voice signal, overcomes the dependence of the traditional method on local characteristics and single characteristics, and remarkably improves the accuracy and the robustness of the voice segmentation of a speaker.

Inventors

LI SUYUAN
LIU XINYUE
LU SIYU
JIANG NAN
WANG YIMING
QIN JIA
WANG DAN
SHAO DONGMEI
WANG HUAPENG
YANG HONGCHEN

Assignees

中国刑事警察学院

Dates

Publication Date: 20260505
Application Date: 20251216

Claims (5)

1. A speaker voice segmentation method based on a non-local space U-Net and a mixed characteristic is characterized by comprising the following steps: S1, acquiring a time domain waveform according to an input voice signal, and preprocessing the waveform to obtain a voice spectrogram of the voice signal; s2, extracting a Mel cepstrum coefficient and a cepstrum coefficient, superposing and normalizing the two characteristics in a frequency dimension, and carrying out speaker voice segmentation by combining the Mel cepstrum coefficient and the cepstrum coefficient as mixed characteristics according to the characteristics of high accuracy of the Mel cepstrum coefficient characteristics and strong robustness of the cepstrum coefficient characteristics; s3, constructing an encoder model based on a non-local space U-Net network, designing a non-local space module, and reducing the dimension of the frequency spectrum characteristics of the voice; s4, measuring the distance between the two sections of features by using KL divergence; S5, circulating the steps S3 and S4 until the range of the feature vector is exceeded, and then smoothing the distance obtained in the step S4 by using a low-pass filter, and taking a local extremum of a curve as a segmentation point of voice; therefore, a speaker voice segmentation model with high voice segmentation accuracy and stability based on the non-local space U-Net and the mixed characteristics can be obtained through training.
2. The method for speaker voice segmentation based on non-local spatial U-Net and hybrid features of claim 1, comprising the steps of: The speech signal is pre-processed, in particular, The test voice signal sample is several minutes of multi-person dialogue audio recorded by a mobile phone, firstly, an audio file is converted into a WAV file, and noise is added in a noise library; The obtained WAV file is combined as a mixed characteristic to carry out speaker voice segmentation according to the characteristics of high accuracy of the mel cepstrum coefficient characteristic and strong robustness of the cepstrum coefficient characteristic; The method comprises preprocessing input voice signal to obtain Mel cepstrum coefficient, radiating sound wave from sound channel to lip to air, and easily losing higher frequency part in voice during air propagation, pre-emphasizing original voice by first-order digital filter for improving high frequency effect to compensate voice suppression by sounding system, and applying the formula as follows, ; The value is generally between 0.9 and 1.0; let the original speech signal be The voice signal after passing through the pre-emphasis filter is : ; The speech is framed by weighting the speech with a movable window of finite length, usually not continuous from frame to frame, but rather there is partial overlap, assuming that there is a pair of length Is framed by the speech signal of (a): ; Wherein the method comprises the steps of For the frame-shift to be performed, For the frame length, the ratio of the frame shift to the frame length is 0-0.5, so that the smooth transition of the frames is ensured; The speech signal can generate discontinuous signals due to the situation that two sections in the signal are suddenly cut off caused by speech framing, so that the frequency spectrum after Fourier transformation is distorted; Window function The function of (2) is that the value of (0) is taken in a given interval, and the hanning window is one of the most common window functions: ; after original framing and windowing, carrying out time-frequency analysis on voice through short-time Fourier transform: ; Wherein, the Is a window function, the meaning of the short-time Fourier transform is to use a window function For voice signals Intercepting, performing Fourier transform on the intercepted local signals, namely calculating the transformation at the moment t, and continuously changing the value t to obtain a set of Fourier transforms at each moment This function represents the frequency component of the speech signal obtained around time t The data at each t moment in the collection is taken as absolute value or square is calculated to obtain energy spectrum; in order to calculate the mel-frequency cepstrum coefficient, M filters are arranged in the frequency spectrum range of each frame of voice Wherein Represents the center frequency of the mth triangular filter; ; The logarithmic energy output by each triangular filter bank is: ; Wherein, the Is the first Frame speech signal spectrogram When filtering speech frequency spectrum by using triangular filter bank, obtaining the product of the amplitude of every point in each filter bandwidth and the energy of correspondent point in the spectrogram, summing, using the result as output of said triangular filter, and mapping the actual frequency spectrum onto Mel frequency spectrum; ; for the cepstrum coefficient characteristic, the Gammatone filter is similar to the basal membrane of the cochlea of a human ear, can effectively simulate the physiological characteristics similar to frequency division, and establishes an auditory model similar to the cochlea, and the time domain expression is as follows: ; Wherein, the Represents the phase of the light and, Representing the center frequency of the wave-shaped wave, Representing the order of the filter, Is the first The bandwidth of the individual filters is chosen to be the same, As a result of the center frequency, Is the filter gain; The bandwidth for a fourth order gammatine filter can be expressed as: ; The bandwidth of the filter increases with increasing center frequency, wherein ERB represents an equivalent rectangular bandwidth, which is a psychoacoustic measure, and the attenuation speed of the impulse response of each filter is determined by the psychoacoustic measure, and the relationship between ERB and frequency f is as follows: ; Where ERB represents the equivalent rectangular bandwidth, Is the center frequency.
3. The method for speaker voice segmentation based on the non-local space U-Net and the mixed features according to claim 1, wherein the step S3 is to design a non-local space module, construct an encoder model based on the non-local space U-Net network, and reduce the dimension of the spectrum features of the voice, specifically: Constructing a non-local space U-Net network to reduce the dimension of the spectrum characteristic of the voice, and using the dimension-reduced characteristic for the subsequent speaker voice segmentation processing, wherein the mel cepstrum coefficient and the cepstrum coefficient characteristic are used as voice characteristics in a frequency domain, so as to be suitable for the voice characteristics, respectively reducing the convolution layers and the transposed convolution channel numbers in a decoder and an encoder in an original network, and adding a plurality of linear layers between the decoder and the encoder to obtain one-dimensional characteristic representation; The method comprises the steps of designing a non-local space module, wherein the non-local space module can inhibit characteristic response of an irrelevant background area and improve performance before each up-sampling block, carrying out convolution operation on characteristic graphs generated by the up-sampling block and the down-sampling block at corresponding positions in the non-local space module, and then combining the characteristics to obtain attention coefficients, wherein the attention coefficients can improve the weight of a voice characteristic area and ensure extraction of an effective characteristic area in the non-local space module; in particular, decoder features Can be expressed as: ; Encoder features Can be expressed as: ; Wherein, the And Feature maps representing the up-sample block and the down-sample block of the first layer respectively, And Respectively represent the first layer learning feature map And Is used for the weight parameters of the (c), And Representing the corresponding learning bias of the first layer, H and W representing the height and width of the feature map, After obtaining the characteristics of the encoder and decoder, the attention coefficient calculation formula is as follows: ; Wherein, the And The learnable weight parameters and bias respectively representing the first layer extracted feature map, Representing Sigmoid activation functions, with each output value ranging from 0 to 1, i.e Furthermore, all the learned weight parameters are convolved by 1X 1, And The weight parameters of the feature map are updated through back propagation learning, and the output features of the non-local space module can be expressed as follows: ; The training method comprises the steps of training a self-encoder by using a TED-LIUM data set, segmenting a corpus in the data set with a fixed length, extracting a Mel cepstrum coefficient and cepstrum coefficient characteristics respectively, inputting the segmented corpus into the encoder for training, calculating loss of model output and model input by using mean square error in the training process, and optimizing the model by using Adam.
4. The method for speaker voice segmentation based on non-local spatial U-Net and mixed features according to claim 1, wherein the step S4 uses KL divergence to measure the distance between two features 。
5. The method for speaker voice segmentation based on the non-local space U-Net and the mixed features as claimed in claim 1, wherein the steps S5 are looped through the steps S3 and S4 until the range of the feature vector is exceeded, and then the distance obtained in the step S4 is smoothed by using a low-pass filter to obtain a local extremum of a curve as a segmentation point of the voice.

Description

Speaker voice segmentation method based on non-local space U-Net and mixed features Technical Field The invention belongs to the technical field of voice signal processing, and particularly provides a speaker voice segmentation method based on a non-local space U-Net and a mixed characteristic. Background In order to improve the intelligent level of voice signal processing and meet the high-precision segmentation requirements of the fields of intelligent customer service, conference transcription, judicial evidence collection and the like on the voices of multiple speakers, the application of speaker voice segmentation technology in complex acoustic environments is increasingly important. Speaker speech segmentation aims at accurately dividing speech segments of different speakers from a continuous audio stream and marking their identity information. Traditional methods based on manual features (such as mel-triangle filters MFCC, pitch) and clustering algorithms (such as K-means, GMM) have significantly reduced segmentation effect in scenarios with low signal-to-noise ratio or overlapping speech by multiple people, and are difficult to adapt to acoustic characteristic differences of different speakers. Therefore, the research on the speaker voice segmentation method with strong robustness and high adaptability has important significance for realizing accurate voice analysis and processing. The existing speaker voice segmentation method can be mainly divided into a method based on traditional signal processing and a method based on deep learning. Methods based on conventional signal processing generally rely on time domain features such as short-time energy, zero crossing rate, or frequency domain features such as spectral centroid, harmonic structure, etc., in combination with Dynamic Time Warping (DTW) or Hidden Markov Models (HMM) for segmentation. However, such methods suffer from dramatic degradation in the case of noise interference, speech overlap, or non-stationary background sounds. In recent years, a speech segmentation method based on deep learning significantly improves segmentation accuracy through end-to-end training, for example, a model based on a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN) can automatically learn timing and spectral features of speech. However, the current mainstream speaker voice segmentation method still has two key problems that on one hand, an algorithm based on fixed scale analysis is insufficient in robustness in a complex noise environment and is easy to be subjected to background interference to reduce segmentation accuracy, and on the other hand, a traditional feature extraction method can generate a large number of redundant segmentation points, so that not only is computation complexity increased, but also the accuracy of a segmentation boundary is affected. These problems severely restrict the application effect of the speech segmentation system in the actual scene. Disclosure of Invention In order to solve the technical problems, the invention provides a speaker voice segmentation method based on a non-local space U-Net and mixed features. The technical scheme adopted by the invention is as follows: A speaker voice segmentation method based on non-local space U-Net and mixed features comprises the following steps: S1, acquiring a time domain waveform according to an input voice signal, and preprocessing the waveform to obtain a voice spectrogram of the voice signal; s2, extracting a Mel cepstrum coefficient and a cepstrum coefficient, superposing and normalizing the two characteristics in a frequency dimension, and carrying out speaker voice segmentation by combining the Mel cepstrum coefficient and the cepstrum coefficient as mixed characteristics according to the characteristics of high accuracy of the Mel cepstrum coefficient characteristics and strong robustness of the cepstrum coefficient characteristics; s3, constructing an encoder model based on a non-local space U-Net network, designing a non-local space module, and reducing the dimension of the frequency spectrum characteristics of the voice; s4, measuring the distance between the two sections of features by using KL divergence; S5, circulating the steps S3 and S4 until the range of the feature vector is exceeded, and then smoothing the distance obtained in the step S4 by using a low-pass filter, and taking a local extremum of a curve as a segmentation point of voice; therefore, a speaker voice segmentation model with high voice segmentation accuracy and stability based on the non-local space U-Net and the mixed characteristics can be obtained through training. Further, the method comprises the following steps: The speech signal is pre-processed, in particular, The test voice signal sample is several minutes of multi-person dialogue audio recorded by a mobile phone, firstly, an audio file is converted into a WAV file, and noise is added in a noise library; The obtained WAV fil