CN-122002204-A - Non-invasive binaural speech assessment method based on multi-source sound image fusion

CN122002204ACN 122002204 ACN122002204 ACN 122002204ACN-122002204-A

Abstract

The invention discloses a non-invasive type double-ear voice assessment method based on multi-source sound chart fusion, which belongs to the technical field of audio signal processing and deep learning, and utilizes a front pre-training self-supervision learning voice basic model to extract high-dimensional weighting characteristics of voice to be tested, extract frequency spectrum amplitude and personalized audiogram data expanded along frequency dimension to be used as input of a characteristic fusion module in a double-ear voice assessment network so as to deeply mine frame-level potential characteristics of the voice to be tested. A two-branch evaluation network is constructed, one combining the fusion features with SSL characterization to predict the voice quality score, and the other combining the fusion features with audiogram data to output the quality level. The invention not only effectively realizes the synchronous evaluation of the voice quality score and the grade of the hearing aid, but also remarkably improves the Pierson linear correlation and the Spilot monotone correlation between the predicted value and the real evaluation index, and provides a high-precision solution for individualized hearing evaluation.

Inventors

TANG GUICHEN
YANG YANG
LIANG RUIYU
WANG QINGYUN
XIE YUE
ZHAO XIAOYAN
CHENG JIAMING
ZOU CAIRONG
XU YU

Assignees

南京工程学院

Dates

Publication Date: 20260508
Application Date: 20260205

Claims (10)

1. A non-invasive binaural speech assessment method based on multi-source sound map fusion, comprising the steps of: Step 1, acquiring a voice data set for model training, evaluation and test, wherein the binaural pure voice is utilized for simulating acquisition through three stages of noise addition, enhanced denoising and personalized compensation; Step 2, calculating the voice quality scores and voice quality grades of the left and right voice channels in the binaural voice based on the voice data set; Extracting the frequency spectrum amplitude values of left and right ear voices, extracting the voice weighting characterization output by a pre-trained self-supervision learning voice basic model, firstly framing and windowing the voice to be detected, then performing STFT on each frame of data and then calculating to obtain the frequency spectrum amplitude values, wherein the weighting characterization is obtained by calculating the output weighting of each hidden layer of the self-supervision learning model; Step 4, constructing a binaural speech evaluation model, wherein the binaural speech evaluation model comprises a feature fusion network, a speech score prediction network and a speech quality grading network; And 5, realizing the evaluation of the binaural voice based on the trained binaural voice evaluation model.
2. The method for non-invasive binaural speech assessment based on multi-source sound map fusion according to claim 1, wherein step 2 comprises determining the audiogram of the left and right ears as 1*8 dimensions, which is recorded as And The audiogram dimension corresponds to the frequency [250Hz,500Hz,1kHz,2kHz,3kHz,4kHz,6kHz,8kHz ], and the 3-dimensional speech quality evaluation true value, the left ear, the right ear and the whole synthesis are recorded in each speech processing process and recorded as 、 A kind of electronic device And 2-dimensional voice grade label values, namely a left ear and a right ear are respectively recorded as follows: And 。
3. The non-invasive binaural speech assessment method according to claim 1, wherein in step 3, the spectral magnitudes of the left and right ear voices are respectively noted as And The speech weighted characterizations are respectively noted as And Taking the left ear as an example, the expression is as follows: ; Wherein, the , The frame length is indicated as such, And Respectively representing the speech frames before and after windowing, The Hamming window function is specifically expressed by the following formula (2): ; Wherein, the Is a constant value, and is set to 0.46; Secondly, performing STFT of 512 points on each frame of data, setting frame shift to be half of frame length, calculating and obtaining the frequency spectrum amplitude of a voice frame, and performing STFT on windowed voice, wherein the specific expression is as follows: ; Wherein, the Representation of The corresponding complex frequency spectrum is used for the data processing, F is the frequency corresponding to a single characteristic frequency point of the voice; ; Wherein, the Representation of Is used for the spectrum amplitude of the (c), The real part is represented by a real part, Representing the imaginary part, and the spectral amplitude characteristic of the whole left ear voice is marked as: The right ear is recorded as the same theory ; The voice weighted representation is obtained by carrying out self-supervision learning on a voice basic model, wherein the output characteristics of each hidden layer in the model are weighted, and the specific expression is as follows by taking a left ear as an example: ; Wherein, the Represent the first The layer outputs the weight corresponding to the speech characterization, The sum of the weights of all layers is 1; Represent the first The layer outputs a speech characterization, the right ear is similarly available, 。
4. A non-invasive binaural speech assessment method based on multi-source sound map fusion according to claim 1, wherein in step 3, the audiogram is expanded along the frequency according to the following mapping rule, obtaining a set of audiogram loss characteristics of 1 x 256; ; Wherein, the The frequency corresponding to the single characteristic frequency point of the voice; Expanded data for left ear audiogram, HL i L is left ear audiogram Zone of audibility, i=0, 1..7, and speech spectral features And speech weighted characterization After combination, the data dimension size is [3, frame number and feature number ], wherein 3 is the channel number, the first channel is voice weighted representation, the second channel is voice spectrum feature, and the third channel is extended audiogram.
5. A non-invasive binaural speech assessment method based on multi-source sound-image fusion according to claim 1, wherein in step 4, the activation layer of the feature fusion network employs An activation function, the mathematical expression of which is as follows: ; Wherein, the Slope for the negative region; The pooling layer uses L 2 norm pooling, and the following formula is shown: ; Wherein, the Input data representing the pooling layer is presented, Representing the output of the pooling layer Determining a pooling norm type; the output of the left ear feature fusion network is noted as: The right ear is recorded as the same theory 。
6. The non-invasive binaural speech assessment method according to claim 1, wherein in step 4, the BiLSTM layers and the multi-headed attentiveness mechanism layers in the speech score prediction network are represented by expressions (9) - (12); The BiLSTM layer consists of a forward LSTM and a backward LSTM, learns the two-way dependency relationship in time series data and uses the context information, and the formula is as follows: ; Wherein, the An index representing the current frame is displayed, Representing activation functions The function of the function is that, The activation values of the input gate, the forget gate and the output gate are respectively; And The current cell state and the cell state of the previous time step respectively, And The hidden states are the current hidden state and the hidden state of the previous time step respectively; Representing the current input frame of the video signal, Respectively an input gate, a forget gate, a unit state and an output gate, Then the bias vector for each gate; Representing element-by-element multiplication; for hyperbolic tangent activation function, the following is defined: ; Wherein, the Representing a natural exponential function of the sign, Representing the function input; The mathematical expression of the multi-headed attention layer is as follows: ; ; Wherein, the Respectively the first Head of personal computer Is selected from the group consisting of a query, a key, and a matrix of values, Representing dimensions of queries and keys Representing a projection matrix mapping the cascade output of all heads to a final output, T representing a matrix transpose; The outputs of the speech score prediction networks of the left and right ears are respectively recorded as: And 。
7. The non-invasive binaural speech assessment method according to claim 1, wherein in step 4, the activation layer and Softmax classification layer in the speech quality classification network are represented by expressions (13) - (14); The active layer is selected from The activation function is expressed mathematically as follows: ; Wherein, the The slope of the negative number region is a learnable parameter which is updated and learned during training, the slope is outputted through a Softmax layer and is the probability distribution of each divided voice quality level, argmax is used for extracting the index corresponding to the maximum probability value in the Softmax output and is used as a grading label, and the mathematical expression is as follows: ; Wherein, the Is normalized by the output of the full connection layer and then is activated by the function Is provided with an output of (a), The number dimension of the classified labels; the output of the voice quality grading network of the left ear is represented, and the right ear is recorded as follows: 。
8. A non-invasive binaural speech assessment method based on multi-source sound map fusion according to claim 1, characterized in that in step 4, the loss function is represented by expression (15); ; Wherein, the An average absolute error value between the overall voice quality score and the true overall voice quality score output by the decision fusion layer; And Average absolute error values between the left and right whisper voice quality prediction scores and the real scores respectively; And The average absolute error values between the left and right whisper voice quality classification prediction labels and the real labels are respectively.
9. A computer device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method of claim 1.
10. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor performs the steps of the method of claim 1.

Description

Non-invasive binaural speech assessment method based on multi-source sound image fusion Technical Field The invention relates to the technical field of audio signal processing and deep learning, in particular to a non-invasive binaural speech assessment method based on multi-source sound image fusion. Background Hearing loss remains a worldwide public health problem. Hearing aids are currently the best choice for improving hearing of individuals with hearing impairment, and their compensating effect directly influences the wear rate and satisfaction of hearing impaired patients. The quality of the compensating effect of the hearing aid is often identified by the quality of the output speech. However, traditional subjective hearing aid speech quality assessment relies on audiologists and a large number of subjects scoring speech quality, which is cumbersome, time consuming, laborious, and subject to subject variability. The existing objective hearing aid voice quality assessment also faces a plurality of challenges and optimization spaces. Early objective speech Quality evaluation methods, such as speech Quality evaluation indexes PESQ and POLQA proposed by the international telecommunications union, and Quality-Net and MOS-Net proposed by researchers in the related fields in combination with deep learning techniques. However, these assessment indices are applicable to normal hearing populations and are used to predict subjective mean opinion scores. The hearing aid voice quality is interfered by various factors such as hearing loss individuation, voice environment and the like, and the objective evaluation method is relatively more complicated and full of challenges. The hearing aid speech quality assessment index can be classified into invasive and non-invasive based on the presence or absence of reference speech. The invasive speech quality evaluation indexes such as HASQI and PEMO-Q-HI have high evaluation accuracy, but pure signals are generally difficult to obtain in practical environments. Non-invasive hearing aid speech quality assessment is favored by researchers in the relevant field because it does not require a reference signal, and is becoming an important research direction in this field. In recent years, deep learning technology and a voice basic model are rapidly developed, and the powerful nonlinear modeling capability of the deep learning technology and the voice basic model enables the deep learning technology and the voice basic model to show great potential in a non-invasive hearing aid voice evaluation method. However, model building for binaural whole assessment is currently lacking. In addition, the self-supervision learning voice basic model of the current front edge is utilized, so that the effect of the self-supervision learning voice basic model in a hearing aid voice quality evaluation scene can be mined, and further research is worth. It can be seen that there is still a great space for optimizing the existing non-invasive speech quality assessment methods of hearing aids. Based on the challenges existing at the present stage, how to establish a non-invasive binaural speech assessment method based on multi-source sound image fusion has great research value and significance. Disclosure of Invention The invention aims to get rid of the dependence of subjective voice quality evaluation of the hearing aid and further optimize and overcome the defects in the existing non-invasive objective voice quality evaluation method of the hearing aid. By constructing an overall evaluation framework suitable for binaural voice quality, the front-edge voice basic model is utilized to carry out deep optimization and adaptation, so that high-precision evaluation of the output voice quality of the hearing aid is realized. And the model prediction precision is improved, meanwhile, the calculation resource consumption is effectively reduced, and the dual optimization of the evaluation performance and the efficiency is realized. The technical scheme adopted by the invention is that the non-invasive binaural speech assessment method based on multi-source sound image fusion comprises the following steps: Step 1, acquiring a voice data set for model training, evaluation and test, wherein the binaural pure voice is utilized for simulating acquisition through three stages of noise addition, enhanced denoising and personalized compensation; Step 2, calculating the voice quality scores and voice quality grades of the left and right voice channels in the binaural voice based on the voice data set; Extracting the frequency spectrum amplitude values of left and right ear voices, extracting the voice weighting characterization output by a pre-trained self-supervision learning voice basic model, firstly framing and windowing the voice to be detected, then performing STFT on each frame of data and then calculating to obtain the frequency spectrum amplitude values, wherein the weighting characterization is obtained