CN-121999804-A - Vocal music feature decoupling identification method and system based on mutual information minimization

CN121999804ACN 121999804 ACN121999804 ACN 121999804ACN-121999804-A

Abstract

The invention discloses a nasal sound recognition method based on acoustic feature decoupling, which is characterized in that a double-flow decoupling depth neural network is constructed, a mutual information minimization countermeasure training strategy is introduced, and sound source features (vocal cord vibration) and sound channel features (oral cavity and nasal cavity adjustment) are forcedly separated from a single audio signal. On the basis, the high-resolution characteristic of CQT transformation is utilized to accurately extract the acoustic characteristics related to the nasal sounds, and physiological offset correction is carried out by combining with personal acoustic fingerprints, so that the accurate quantification and recognition of different nasal sound states (normal, defect and skill) are finally realized, and objective, visual and artistic style adaptive feedback basis is provided for vocal music teaching.

Inventors

WANG JIALONG
ZHAO LIN
CHENG YANSHUO
LIU XINYU
CAI YIMING
BIAN XINRUI

Assignees

华中科技大学

Dates

Publication Date: 20260508
Application Date: 20260304

Claims (9)

1. The nasal sound identification method based on acoustic feature decoupling is characterized by comprising the following steps of: (1) Acquiring a real-time singing audio signal to be detected, and performing microphone frequency response compensation on the real-time singing audio signal to obtain a standard digital audio stream; (2) Performing constant Q conversion CQT on the standard digital audio stream obtained in the step (1) to obtain a time-frequency tensor; (3) Inputting the time-frequency tensor obtained in the step (2) into a pre-trained double-flow decoupling depth neural network to obtain a sound source characteristic vector and a sound channel transmission characteristic vector; (4) Obtaining an individual acoustic feature reference of a singer corresponding to the real-time singing audio signal obtained in the step (1), removing fundamental frequency interference generated by vocal cord vibration in the individual acoustic feature reference by utilizing the sound source feature vector obtained in the step (3), so as to obtain a processed individual acoustic feature reference, and performing physiological offset correction on the sound channel transmission feature vector obtained in the step (3) by utilizing the processed individual acoustic feature reference so as to obtain a normalized dynamic nose degree coefficient; (5) And (3) obtaining a final nasal sound recognition result according to the normalized dynamic nose degree coefficient obtained in the step (4).
2. The method for recognizing nasal sounds based on decoupling of acoustic features as claimed in claim 1, wherein the recognition result is specifically whether there is a normal nasal sound, a high nasal sound defect, a low nasal sound occlusion, or a stylized skill nasal sound of a singer corresponding to the real-time singing audio signal.
3. The method for identifying nasal sounds based on acoustic feature decoupling as claimed in claim 1 or 2, wherein, If the dynamic nose degree coefficient is between 0.2 and 0.5, the singer nose sound corresponding to the real-time singing audio signal is normal, and the mouth-nose resonance proportion is in an acoustic balance state, so that the method accords with the standard sounding science; If the dynamic nose degree coefficient is between 0.7 and 1.0, the singer corresponding to the real-time singing audio signal is indicated to have a high nasal sound defect, which indicates that the soft palate of the singer is sagged or is not closed fully, so that excessive air flow enters the nasal cavity to generate obvious 'alarm sound' sense; If the dynamic nose degree coefficient is between 0 and 0.2, indicating that the singer corresponding to the real-time singing audio signal has low nasal sound occlusion, indicating that the nasal passage is not smooth, and causing the loss of necessary nasal resonance frequency in the acoustic signal; If the dynamic nose degree coefficient is between 0.5 and 0.7, it indicates that the singer corresponding to the real-time singing audio signal has stylized skill noses, indicating that it is exercising a specific singing skill.
4. A method for identifying nasal sounds based on decoupling of acoustic features as claimed in any one of claims 1 to 3, wherein, The dual-stream decoupling deep neural network adopts a 12-layer structure, wherein: Layer 1 is a shared feature extraction layer whose input is dimensional The layer performs convolution normalized activation operation on the single channel CQT time-frequency tensor to obtain a dimension of The parameters of the convolution normalized activation operation are the number of output channels 64, the convolution kernel size 3x3, the step length 1 and the filling value 1; representing a total number of frequency points of the single-channel CQT time-frequency tensor on a frequency axis, which is used for representing the pitch resolution of the single-channel CQT time-frequency tensor; Representing the number of time frames of the single-channel CQT time-frequency tensor on a time axis for characterizing the duration of the single-channel CQT time-frequency tensor; Layer 2 is the feature distribution layer, the input of which is the dimension of the shared feature extraction layer output is The layer outputs the feature map to the first sound source coding branch layer and the first sound channel coding branch layer at the same time; Layer 3 is a first sound source coding branching layer whose input is a feature map output by a feature distribution layer, which performs a one-dimensional depth separable convolution operation on the feature map to obtain a feature of dimension The characteristic diagram of the convolution operation with the one-dimensional depth being separable is output, wherein the parameters of the one-dimensional depth being the number of output channels is 128, the convolution kernel size is 5, and the step length is 1; the layer 4 is a second sound source coding branch layer, the input of the layer is a feature map output by the first sound source coding branch layer, the layer executes example normalization processing on the feature map, and performs feature reconstruction processing on the normalized feature map so as to acquire sound source coding features and output the sound source coding features.
5. The method for identifying nasal sounds based on decoupling of acoustic features as claimed in claim 4, wherein, The 5 th layer is a first sound channel coding branch layer, the input of the layer is a characteristic diagram output by a characteristic distribution layer, the layer sequentially executes global average pooling and linear mapping processing on the characteristic diagram to obtain a frequency spectrum weight coefficient aiming at a first resonance peak (F1 frequency band and a 2kHz-4kHz nasal cavity resonance sensitive area, and performs weighting enhancement processing by utilizing the frequency spectrum weight coefficient to obtain a sound channel characteristic diagram after frequency spectrum enhancement and output the sound channel characteristic diagram; Layer 6 is a second channel coding branching layer whose input is the first channel coding branching layer output with dimensions of The layer carries out multi-scale convolution processing on the sound channel characteristic diagram to obtain a plurality of multi-dimensional resonance peak envelope characteristics reflecting the physical forms of the oral cavity and the nasal cavity, and carries out splicing processing on all the multi-dimensional resonance peak envelope characteristics to obtain the dimension as And outputs the sound channel morphological feature map of (2), Layer 7 is a mutual information countermeasure discrimination layer, the input of which is the output of the second coding branch layer with the dimension of Firstly, performing three-level full-connection mapping processing on the sound channel morphological feature map to obtain a probability distribution result for predicting sound source information, then, introducing a real pitch label corresponding to a standard digital audio stream, calculating cross entropy loss between the prediction probability distribution result and the real pitch label, and outputting the cross entropy loss value as a mutual information loss value; Layer 8 is a gradient inversion layer, the input of which is the second channel coding branching layer output with the dimension of 128, Which performs an identity mapping operation on the vocal tract morphology feature map during the forward propagation phase, and performs a reverse processing on the gradients returned by the mutual information challenge discrimination layer during the reverse propagation phase to obtain reverse gradients and deliver them back to the second coding branch layer.
6. The method for identifying nasal sounds based on acoustic feature decoupling as claimed in claim 5, wherein, Layer 9 is a global pooling layer, the input of which is the output of the second source coding branch layer respectively has the dimension of Is a sound source coding feature map of the second channel coding branch layer output with the dimension of The layer performs global average pooling processing on the two feature images respectively to make the space dimension of the feature images Compressing to be 1 so as to respectively acquire and output sound source characteristic vectors and sound channel characteristic vectors with dimensions of 128; The 10 th layer is a double-flow independent mapping layer, the input of the layer is a sound source characteristic vector and a sound channel characteristic vector which are output by a global pooling layer, and the layer respectively executes full-connection linear projection processing on the sound source characteristic vector and the sound channel characteristic vector so as to respectively acquire and output a sound source semantic vector and a sound channel semantic vector with dimensions of 64; Layer 11 is a dimension-reducing feature output layer, the input of the layer is a sound source semantic vector and a sound channel semantic vector which are output by a double-flow independent mapping layer, and the layer respectively executes full-connection dimension-reducing processing on the sound source semantic vector and the sound channel semantic vector so as to respectively obtain compact sound source feature vectors with dimension of 32 Sum channel feature vector ; The 12 th layer is an output interface layer, the input of the 12 th layer is a compact sound source characteristic vector and a sound channel transmission characteristic vector which are output by a dimension reduction characteristic output layer, the layer carries out output mapping processing on the two characteristic vectors, namely, the sound channel transmission characteristic vector is used as physical characteristic for representing resonance states of an oral cavity and a nasal cavity to be output, and the compact sound source characteristic vector is used as physical characteristic for representing a vocal cord vibration state to be output, so that a final acoustic characteristic decoupling identification result is obtained.
7. The method for identifying nasal sounds based on acoustic feature decoupling as claimed in claim 6, wherein the dual-flow decoupling depth neural network is trained by: (1-1) acquiring an original singing audio sample set, and dividing the original singing audio sample set into a training set and a test set according to the proportion of 8:2; (1-2) performing constant Q transformation on the training set obtained in the step (1-1) to obtain a transformed training set; (1-3) performing data enhancement processing on the transformed training set obtained in the step (1-2) to obtain an enhanced training set; (1-4) for each dimension of the enhanced training set obtained in step (1-3) is For the singing audio sample, inputting the singing audio sample into a shared feature extraction layer of a double-flow decoupling depth neural network to perform convolution normalization activation processing so as to obtain the corresponding dimension of the singing audio sample Is a shallow feature map of (1); (1-5) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), the singing audio sample obtained in the step (1-4) is corresponding to a dimension of The shallow feature map of (2) is input into a feature distribution layer of a double-flow decoupling deep neural network for distribution processing so as to obtain two identical feature distribution layers with the same dimension Is a feature map of (1); (1-6) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), inputting a feature map corresponding to the singing audio sample obtained in the step (1-5) into a first sound source coding branch layer of a dual-stream decoupling depth neural network to perform one-dimensional depth separable convolution processing so as to obtain a feature map corresponding to the singing audio sample, wherein the feature map is of a dimension of Is a sound source middle characteristic diagram; (1-7) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), the singing audio sample obtained in the step (1-6) is corresponding to a dimension of The sound source intermediate feature map of (2) is input into a second sound source coding branch layer of a double-flow decoupling depth neural network for instance normalization and feature reconstruction processing so as to obtain the corresponding dimension of the singing audio sample Is a sound source coding feature map of (1); (1-8) for each singing audio sample in the training set after enhancement processing obtained in the step (1-3), inputting another feature map corresponding to the singing audio sample obtained in the step (1-5) into a first channel coding branch layer of a dual-stream decoupling depth neural network to perform predictive value calculation and weighted enhancement processing so as to obtain a feature map corresponding to the singing audio sample, wherein the feature map is of a dimension of Is characterized by a spectrally enhanced sound spectrum; (1-9) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), the singing audio sample obtained in the step (1-8) is corresponding to a dimension of The frequency spectrum enhanced sound channel characteristic diagram of (2) is input into a second sound channel coding branch layer of a double-flow decoupling depth neural network to carry out multi-scale convolution and splicing processing so as to obtain the corresponding dimension of the singing audio sample as follows Is a vocal tract morphological feature map; (1-10) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), the singing audio sample obtained in the step (1-9) is corresponding to a dimension of Inputting the vocal tract morphological feature map of the double-flow decoupling depth neural network to perform three-level full-connection mapping processing on the mutual information countermeasure discrimination layer so as to obtain a prediction probability distribution result which corresponds to the singing audio sample and is used for predicting sound source information; (1-11) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), calculating a cross entropy loss value as a mutual information loss corresponding to the singing audio sample according to the prediction probability distribution result corresponding to the singing audio sample obtained in the step (1-10) and a true pitch label corresponding to the singing audio sample ; (1-12) For each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), performing a reverse updating operation on mutual information loss corresponding to the singing audio sample obtained in the step (1-11) by using a gradient inversion layer and a gradient descent method to obtain a reverse gradient, and obtaining the weight of a sound source coding branch by using the reverse gradient; (1-13) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), respectively inputting the sound source coding feature map corresponding to the singing audio sample obtained in the step (1-7) and the vocal tract morphological feature map corresponding to the singing audio sample obtained in the step (1-9) into a global pooling layer of a dual-stream decoupling depth neural network to perform global average pooling processing so as to obtain a global pooling layer with two dimensions corresponding to the singing audio sample Is a fixed length feature vector of (1); (1-14) for each singing audio sample in the training set obtained in the step (1-3), inputting the two fixed-length feature vectors obtained in the step (1-13) into a double-flow independent mapping layer of a double-flow decoupling depth neural network to perform independent full-connection linear projection processing so as to obtain two dimensions corresponding to the singing audio sample Semantic vectors of (2); (1-15) for each singing audio sample in the training set after the enhancement processing obtained in the step (1-3), inputting two semantic vectors corresponding to the singing audio sample obtained in the step (1-14) into a dimension-reducing feature output layer of a dual-stream decoupling depth neural network to perform dimension-reducing processing so as to obtain a final dimension corresponding to the singing audio sample, wherein the final dimension is A compact sound source feature vector and a channel transmission feature vector; (1-16) repeatedly executing the steps (1-4) to (1-15) until the mutual information loss obtained in the step (1-11) converges or reaches the preset iteration times, so as to obtain the preliminarily trained double-flow decoupling depth neural network; And (1-17) performing fine tuning test and physiological reference calibration on the preliminarily trained double-flow decoupling depth neural network obtained in the step (1-16) by using the test set obtained in the step (1-1) until the dynamic quantization coefficient accumulated error of the double-flow decoupling depth neural network is lower than a preset threshold value so as to obtain the finally trained double-flow decoupling depth neural network.
8. The method for identifying nasal sounds based on decoupling of acoustic features as claimed in claim 7, wherein, The original singing audio sample set in the step (1-1) comprises singing audio samples of American voice, popular and national singing methods, large character group C2-small character group C6 voice domain and various nasal sound states; The data enhancement processing in step (1-3) includes randomly performing one or a combination of any of a minute pitch shift, time warping, and addition of white gaussian noise.
9. A nasal recognition system based on acoustic feature decoupling, comprising the following modules: the first module is used for acquiring a real-time singing audio signal to be detected, and carrying out microphone frequency response compensation on the real-time singing audio signal to obtain a standard digital audio stream; The second module is used for performing constant Q conversion CQT on the standard digital audio stream obtained by the first module so as to obtain a time-frequency tensor; The third module is used for inputting the time-frequency tensor obtained by the second module into a pre-trained double-flow decoupling depth neural network so as to obtain a sound source characteristic vector and a sound channel transmission characteristic vector; a fourth module, configured to obtain an individual acoustic feature reference of the singer corresponding to the real-time singing audio signal obtained by the first module, remove, using the acoustic source feature vector obtained by the third module, fundamental frequency interference generated by vocal cord vibration in the individual acoustic feature reference, thereby obtaining a processed individual acoustic feature reference, and perform physiological bias correction on the vocal tract transmission feature vector obtained by the third module using the processed individual acoustic feature reference, so as to obtain a normalized dynamic nose degree coefficient; and a fifth module, configured to obtain a final nasal sound recognition result according to the normalized dynamic nose degree coefficient obtained by the fourth module.

Description

Vocal music feature decoupling identification method and system based on mutual information minimization Technical Field The invention belongs to the technical field of audio signal processing and artificial intelligence aided education intersection, and particularly relates to a vocal music feature decoupling identification method and system based on mutual information minimization. Background Vocal feature decoupling identification refers to the process of separating and identifying individual acoustic features generated by different physiological structures (e.g., vocal cords, oral cavity, nasal cavity) from a singing audio signal. Accurate decoupling is a key premise for objectively evaluating sound production skills, particularly for timbre control (such as nasal sound management). However, in practical acoustic signals, the sound source characteristics (e.g., pitch, loudness) produced by vocal cord vibrations are highly coupled with the vocal tract characteristics (e.g., formants) produced by oral-nasal adjustments, which presents a great challenge for fine vocal teaching based on non-contact audio. Currently, the recognition and evaluation method for vocal features includes the first acoustic feature analysis method based on conventional signal processing, which detects nasal sounds by calculating an energy ratio of a specific frequency band, or evaluates the pitch and rhythm using a fundamental frequency extraction algorithm, the second end-to-end black box recognition method based on a deep learning model, which trains a neural network through a large amount of labeling data, directly maps out a nonlinear relationship of audio to an evaluation result, and the third method, which directly measures nasal cavity vibration or airflow using a contact or invasive sensor such as a nasal flowmeter, an accelerometer, etc., depending on a dedicated hardware device. However, several of the above existing methods have some non-negligible drawbacks: Firstly, the acoustic feature analysis method based on the traditional signal processing is extremely easy to be interfered by the change of the frequency harmonic energy when singers sing high voice, so that serious misjudgment is generated on the identification of key formants representing nasal voice, and effective guidance is difficult to be given under the complex conditions of accurate voice but wrong sounding; Secondly, the end-to-end black box recognition method based on the deep learning model has serious characteristic coupling problem, and the model cannot distinguish 'blocking nasal sound' caused by physiological structures such as cold, rhinitis and the like from 'functional nasal sound' caused by improper sounding skills, so that the evaluation result lacks of physical interpretability and accuracy; Thirdly, the acoustic feature analysis method based on the traditional signal processing has the inherent limitation of insufficient frequency resolution in a low frequency band, is difficult to clearly capture a fine acoustic structure for distinguishing dense low-frequency harmonic waves from anti-resonance points of nasal cavities, and brings barriers to accurate quantitative evaluation of nasal sound degree in a physical layer; Fourth, the above method relying on special hardware device usually uses abstract waveform diagram or spectrogram as feedback, so that it is difficult for the ordinary user to understand and associate the abstract waveform diagram or spectrogram with his own physiological actions (such as soft palate lifting), and this non-intuitive feedback mechanism makes it difficult for the user to establish correct muscle memory, thus greatly limiting self-learning efficiency; Fifth, the above-mentioned acoustic feature analysis method based on traditional signal processing and the end-to-end black box recognition method based on deep learning model generally adopt fixed acoustic threshold values for judgment, lack the ability to dynamically adjust evaluation criteria according to different artistic styles such as American noise and fashion, and easily cause "error correction" to artistic expression. Disclosure of Invention Aiming at the above defect or improvement requirement of the prior art, the invention provides a vocal music feature decoupling identification method and system based on mutual information minimization, which aims to solve the technical problems that the prior acoustic feature analysis method based on traditional signal processing is extremely easy to be interfered by the change of frequency harmonic energy when singers sing high voice, so that serious misjudgment is caused to the key resonance peak identification of the nasal sound, effective guidance is difficult to be given under the complex condition of accurate voice but wrong voice, the prior end-to-end black box identification method based on deep learning model has serious feature coupling problem, the model is often unable to distinguish ' obstructive nasal sound