CN-116386612-B - Training method of voice detection model, voice detection method, device and equipment

CN116386612BCN 116386612 BCN116386612 BCN 116386612BCN-116386612-B

Abstract

The disclosure provides a training method of a voice detection model, a voice detection method, a device and equipment, relates to the technical field of computers, and particularly relates to the technical field of deep learning and voice recognition. The specific implementation scheme is that a voice sample set is obtained, wherein the voice sample set comprises sample characteristics of a voice sample and labeling information of the voice sample; the method comprises the steps of inputting sample characteristics into an encoder of a voice detection model to obtain intermediate characteristics, wherein the dimension of the intermediate characteristics is smaller than that of the sample characteristics, inputting the intermediate characteristics into a decoder of the voice detection model to obtain prediction confidence of a voice sample, wherein the prediction confidence is used for representing the probability that the voice sample is non-silent, training the voice detection model according to labeling information and the prediction confidence to obtain a trained voice detection model, wherein the trained voice detection model can be used for voice silence detection, the processing speed of the voice detection model is improved, the real-time performance of the voice detection model is good, and multiple scenes can be met.

Inventors

ZHANG HUI
MA SIMENG
LI XIAOHUI
ZHOU YANG
ZHAO QIAN
CHEN ZEYU

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20230508

Claims (20)

1. A method of training a speech detection model, comprising: acquiring a voice sample set, wherein the voice sample set comprises sample characteristics of a voice sample and labeling information of the voice sample; Inputting the sample characteristics into an encoder of a voice detection model to obtain intermediate characteristics, wherein the dimension of the intermediate characteristics is smaller than that of the sample characteristics, and the encoder comprises a plurality of first convolution blocks, wherein each first convolution block comprises a first separable convolution layer, a first residual error connection layer and a first normalization layer; the first separable convolution layer is used for carrying out first separable convolution processing on the input features of the first convolution block to obtain first features; the first residual connection layer is used for processing the first feature to obtain a second feature, and fusing the second feature with the first feature to obtain a third feature; The first normalization layer is used for normalizing the third feature; Inputting the intermediate features into a decoder of the voice detection model to obtain the prediction confidence of the voice sample, wherein the prediction confidence is used for representing the probability that the voice sample is non-mute; And training the voice detection model according to the labeling information and the prediction confidence coefficient to obtain a trained voice detection model.
2. The method of claim 1, wherein the encoder comprises N first convolution blocks connected in sequence, N being a positive integer greater than or equal to 2, and the remaining first convolution blocks of the N first convolution blocks except for the nth first convolution block are used to reduce the dimension of the sample feature.
3. The method of claim 2, wherein, The first normalization layer is further configured to perform a first convolution process on the third feature; And the step size of the first convolution processing in the rest first convolution blocks is a positive integer greater than 1.
4. The method of any of claims 1-3, wherein the encoder further comprises a second convolution block connected between an nth-1 first convolution block and the nth first convolution block; the second convolution block comprises a second separable convolution layer, a second residual error connection layer and a second normalization layer; The second separable convolution layer is used for carrying out second separable convolution processing on the input features of the second convolution block to obtain fourth features; The second residual connection layer is used for processing the input features of the second convolution block to obtain fifth features, and fusing the fourth features with the fifth features to obtain sixth features; The second normalization layer is used for normalizing the sixth feature.
5. The method according to any one of claim 1 to 3, wherein, Inputting the intermediate features into a decoder of the voice detection model to obtain the prediction confidence of the voice sample, wherein the method comprises the following steps: inputting the intermediate features into a cyclic neural network of the decoder to obtain a plurality of probability values of the voice sample, wherein the voice sample comprises a plurality of voice frames, and the plurality of probability values are probabilities of the plurality of voice frames respectively; and averaging the probability values to obtain the prediction confidence coefficient.
6. A method according to any one of claims 1-3, further comprising: extracting spectral features of the speech samples; Normalizing the spectrum characteristics to obtain first input characteristics; And splicing the spectral features and the first input features to obtain the sample features.
7. A method according to any one of claims 1-3, further comprising: and acquiring a plurality of voice samples for constructing the voice sample set according to the preset sampling rate and a plurality of preset sampling time lengths.
8. A method according to any one of claims 1-3, further comprising: acquiring sample audio; And performing generalization processing on the sample audio through background sound increasing processing to obtain the voice sample set.
9. A method according to any one of claims 1-3, further comprising: Compressing the trained voice detection model to reduce the storage space of the trained voice detection model; And cutting the trained voice detection model to remove partial operators which are not involved in the training process of the voice detection model in the model reasoning library.
10. A voice detection method, comprising: acquiring a current voice segment in a voice stream to be detected; Determining the voice characteristics of the current voice segment; Inputting the voice features into a trained voice detection model to obtain the confidence level of the current voice segment, wherein the trained voice detection model is obtained by training according to the method of any one of claims 1-9; and determining a first label of the current voice segment according to the confidence level of the current voice segment, wherein the first label is used for determining the position of the non-mute segment in the voice stream to be detected.
11. The method of claim 10, wherein determining the first tag of the current speech segment based on the confidence level of the current speech segment comprises: determining a second label of the current voice section according to the confidence level of the current voice section, wherein the second label is used for indicating whether the voice section is mute or not; And determining the first label of the current voice section according to the second label of the current voice section.
12. The method of claim 11, wherein determining the second label of the current speech segment based on the confidence level of the current speech segment comprises: determining that the second label of the current voice segment is non-mute under the condition that the confidence coefficient of the current voice segment is larger than a first confidence coefficient threshold value; and determining that the second label of the current voice segment is mute under the condition that the confidence coefficient of the current voice segment is smaller than or equal to the first confidence coefficient.
13. The method of claim 11, wherein determining the second label of the current speech segment based on the confidence level of the current speech segment comprises: Determining that a second label of the current voice segment is non-silent under the condition that the confidence coefficient of the current voice segment is larger than a second confidence coefficient threshold value and the confidence coefficient of a next voice segment of the current voice segment is larger than a third confidence coefficient threshold value, wherein the second confidence coefficient threshold value is smaller than the third confidence coefficient threshold value; determining that a second label of the current voice segment is silent under the condition that the confidence coefficient of the current voice segment is smaller than or equal to the second confidence coefficient threshold value; and determining that the second label of the current voice segment is mute under the condition that the confidence coefficient of the current voice segment is larger than the second confidence coefficient threshold value and the confidence coefficient of the next voice segment is smaller than or equal to the third confidence coefficient threshold value.
14. The method of any of claims 11-13, wherein determining the first tag of the current speech segment from the second tag of the current speech segment comprises: And under the conditions that the second label of the last voice section of the current voice section is mute, the second label of the current voice section is non-mute and the second labels of the continuous X voice sections after the current voice section are all non-mute, determining the first label of the current voice section as the initial section of the non-mute section, wherein X is a preset value and X is a positive integer greater than or equal to 1.
15. The method of claim 14, wherein determining the first tag of the current speech segment from the second tag of the current speech segment further comprises: and determining that the current voice segment is an ending segment of the non-mute segment under the conditions that the starting segment of the non-mute segment is detected before the current voice segment, the second label of the current voice segment is non-mute and the second labels of the continuous Y voice segments after the current voice segment are all mute, wherein Y is a preset value and is a positive integer greater than or equal to 1.
16. The method of claim 14, determining a first tag for the current speech segment from a second tag for the current speech segment, further comprising: Determining a first label of the current voice segment as a middle segment of the non-silence segment in the case that a start segment of the non-silence segment has been detected before the current voice segment and an end segment of the non-silence segment has not been detected; the method further comprises the steps of: And in the case that the first label of the current voice segment is the middle segment of the non-mute segment, reducing a confidence threshold for determining the second label of the current voice segment.
17. The method of claim 15, further comprising: shifting the starting time of the starting section forward by a first preset period, and taking the shifted starting time of the starting section as the starting time of the non-mute section; and shifting the ending time of the ending section forward for a second preset period, and taking the shifted ending time of the ending section as the ending time of the non-mute section.
18. The method according to any of claims 10-13, wherein obtaining a current speech segment in the speech stream to be examined comprises: Under the condition that the sampling rate of the to-be-detected voice stream is larger than a first preset sampling rate, carrying out downsampling processing on the to-be-detected voice stream so as to enable the sampling rate of the to-be-detected voice stream to be changed into the first preset sampling rate; acquiring a current voice segment in the voice stream to be detected according to the first preset sampling rate; inputting the voice characteristics into a trained voice detection model to obtain the confidence coefficient of the current voice segment, wherein the method comprises the following steps: Under the condition that the sampling rate of the voice stream to be detected is a first preset sampling rate, inputting the voice characteristic into a first voice detection model to obtain the confidence coefficient of the current voice section, wherein the first voice detection model is a trained voice detection model, and the applicable sampling rate of the first voice detection model is the first preset sampling rate.
19. The method of claim 18, wherein, Inputting the voice characteristics into a trained voice detection model to obtain the confidence coefficient of the current voice segment, and further comprising: And under the condition that the sampling rate of the voice stream to be detected is a second preset sampling rate lower than the first preset sampling rate, inputting the voice characteristic into a second voice detection model to obtain the confidence coefficient of the current voice section, wherein the second voice detection model is a trained voice detection model, and the applicable sampling rate of the second voice detection model is the second preset sampling rate.
20. A training device for a speech detection model, comprising: the voice sample collection comprises sample characteristics of a voice sample and labeling information of the voice sample; the coding unit is used for inputting the sample characteristics into a coder of the voice detection model to obtain intermediate characteristics, wherein the dimension of the intermediate characteristics is smaller than that of the sample characteristics; the first separable convolution layer is used for carrying out first separable convolution processing on the input features of the first convolution block to obtain first features; the first residual connection layer is used for processing the first feature to obtain a second feature, and fusing the second feature with the first feature to obtain a third feature; The first normalization layer is used for normalizing the third feature; the decoding unit is used for inputting the intermediate features into a decoder of the voice detection model to obtain the prediction confidence of the voice sample, wherein the prediction confidence is used for representing the probability that the voice sample is non-mute; And the training unit is used for training the voice detection model according to the labeling information and the prediction confidence coefficient to obtain a trained voice detection model.

Description

Training method of voice detection model, voice detection method, device and equipment Technical Field The disclosure relates to the technical field of computers, in particular to the technical field of deep learning and voice recognition, and specifically relates to a training method of a voice detection model, a voice detection method, a device and equipment. Background Voice silence detection (Voice Activity Detection, VAD) is an important front-end processing module in speech recognition that mainly serves to detect whether a current speech segment is muted or un-muted. The VAD is applied to offline voice processing, can be used for segmenting audio in video, and can generate voice subtitles with time stamps in cooperation with voice recognition. The VAD is applied to streaming voice processing, and voice recognition service can be requested to transfer text when voice is detected, so that the utilization rate of the voice recognition service is improved, the cost is saved, and the processing speed of the VAD is required. Therefore, how to increase the processing speed of voice silence detection is attracting more and more attention. Disclosure of Invention The disclosure provides a training method of a voice detection model, a voice detection method, a device and equipment. According to one aspect of the disclosed embodiments, a training method for a voice detection model is provided, which includes obtaining a voice sample set, wherein the voice sample set includes sample features of voice samples and labeling information of the voice samples, inputting the sample features into an encoder of the voice detection model to obtain intermediate features, wherein dimensions of the intermediate features are smaller than those of the sample features, inputting the intermediate features into a decoder of the voice detection model to obtain prediction confidence of the voice samples, the prediction confidence is used for representing probability that the voice samples are non-silent, and training the voice detection model according to the labeling information and the prediction confidence to obtain a trained voice detection model. According to another aspect of the disclosed embodiments, a voice detection method is provided, which includes obtaining a current voice segment in a voice stream to be detected, determining voice characteristics of the current voice segment, inputting the voice characteristics into a trained voice detection model to obtain confidence of the current voice segment, wherein the trained voice detection model is obtained by training according to the method of any one of the above embodiments, and determining a first label of the current voice segment according to the confidence of the current voice segment, wherein the first label is used for determining a position of a non-mute segment in the voice stream to be detected. According to another aspect of the embodiment of the disclosure, a training device for a voice detection model is provided, which comprises an acquisition unit, an encoding unit, a decoding unit and a training unit, wherein the acquisition unit is used for acquiring a voice sample set, the voice sample set comprises sample characteristics of voice samples and labeling information of the voice samples, the encoding unit is used for inputting the sample characteristics into an encoder of the voice detection model to obtain intermediate characteristics, the dimension of the intermediate characteristics is smaller than that of the sample characteristics, the decoding unit is used for inputting the intermediate characteristics into a decoder of the voice detection model to obtain prediction confidence of the voice samples, the prediction confidence is used for representing the probability that the voice samples are non-silent, and the training unit is used for training the voice detection model according to the labeling information and the prediction confidence to obtain the trained voice detection model. According to another aspect of the disclosed embodiments, a voice detection device is provided, which includes a segmentation unit for obtaining a current voice segment in a voice stream to be detected, a determination unit for determining a current voice feature of the current voice segment, a prediction unit for inputting the voice feature into a trained voice detection model to obtain a confidence level of the voice segment, the trained voice detection model being obtained by training according to the method of any one of the above embodiments, and a processing unit for determining a first tag of the current voice segment according to the confidence level of the current voice segment, the first tag being used for determining a position of a non-mute segment in the voice stream to be detected. According to another aspect of the present disclosure, there is provided an electronic device comprising at least one processor, and a memory communicatively coupled to the at least