CN-116312619-B - Voice activity detection model generation method and device, medium and electronic equipment

CN116312619BCN 116312619 BCN116312619 BCN 116312619BCN-116312619-B

Abstract

The invention relates to a voice activity detection model generation method, a device, a medium and electronic equipment, wherein the method comprises the steps of obtaining a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; the method comprises the steps of inputting a voice sample frame into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice sample frame, determining a loss value according to a preset loss function, a voice probability result which corresponds to the voice sample frame in a voice sample data set and a sample label, wherein the loss function comprises a first loss function, the first loss function is a smooth approximate opposite number of an F1 fraction, and adjusting model parameters of the voice activity detection model according to the loss value until training parameters of the estimated voice activity detection model meet preset conditions to obtain the voice activity detection model, so that the effect of the voice activity detection model in practical application is improved.

Inventors

WEN SHIXUE
MA ZEJUN

Assignees

北京有竹居网络技术有限公司

Dates

Publication Date: 20260505
Application Date: 20230129

Claims (10)

1. A method for generating a speech activity detection model, comprising: Acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; Inputting the voice sample frame into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice sample frame; Determining a loss value according to a preset loss function, a corresponding voice probability result of a voice sample frame in the voice sample data set and a sample label, wherein the loss function comprises a first loss function, the first loss function is the inverse number of smooth approximation of an F1 score, and the smooth approximation of the F1 score meets continuous guidance; And adjusting model parameters of the voice activity detection model according to the loss value until the training parameters of the voice activity detection model are evaluated to meet the preset conditions so as to obtain the voice activity detection model.
2. The method of claim 1, wherein the loss function further comprises a second loss function, the second loss function being a cross entropy loss function, the determining a loss value based on a predetermined loss function, a corresponding speech probability result for a speech sample frame in the speech sample dataset, and a sample label comprising: determining a first loss value according to the first loss function, a corresponding voice probability result of the voice sample frame in the voice sample data set and a sample label; determining a second loss value according to the second loss function, a corresponding voice probability result of the voice sample frame in the voice sample data set and a sample label; And weighting the first loss value and the second loss value according to a weight relation to obtain a loss value, wherein the weight relation is used for representing that the sum of the first weight corresponding to the first loss value and the second weight corresponding to the second loss value is a preset value, the first weight is in direct proportion to the adjusted times of the model parameters of the voice activity detection model, and the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model.
3. The method of claim 2, wherein the second weight is characterized by the formula: ; Wherein, the As a result of the second weight being set, For the current number of adjustments to the model parameters of the voice activity detection model, And e is a natural constant for the preset total number of times of adjusting the model parameters of the voice activity detection model.
4. A method according to claim 3, wherein the preset condition comprises the number of times the model parameters of the speech activity detection model have been adjusted to reach the total number of times.
5. The method of claim 1, wherein inputting the speech sample frame into a speech activity detection model to obtain a speech probability result corresponding to the speech sample frame output by the speech activity detection model comprises: and inputting the voice sample frame into a candidate voice activity detection model to obtain a voice probability result which is output by the candidate voice activity detection model and corresponds to the voice sample frame, wherein the candidate voice activity detection model is obtained by training with a cross entropy loss function as an objective function.
6. The method of claim 5, wherein the predetermined condition comprises the loss value differing from a last determined loss value by less than a first predetermined threshold.
7. A voice activity detection model generation apparatus, comprising: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; the input module is used for inputting the voice sample frame into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice sample frame; The determining module is used for determining a loss value according to a preset loss function, a corresponding voice probability result of the voice sample frame in the voice sample data set and a sample label, wherein the loss function comprises a first loss function, the first loss function is the inverse number of smooth approximation of an F1 fraction, and the smooth approximation of the F1 fraction meets continuous guidance; And the adjusting module is used for adjusting the model parameters of the voice activity detection model according to the loss value until the training parameters for evaluating the voice activity detection model meet the preset conditions so as to obtain the voice activity detection model.
8. The apparatus of claim 7, wherein the loss function further comprises a second loss function, the second loss function being a cross entropy loss function, the determining module comprising: A first determining submodule, configured to determine a first loss value according to the first loss function, a corresponding speech probability result of a speech sample frame in the speech sample dataset, and a sample label; a second determining submodule, configured to determine a second loss value according to the second loss function, a corresponding speech probability result of the speech sample frame in the speech sample dataset, and a sample label; The weighting sub-module is used for weighting the first loss value and the second loss value according to a weight relation to obtain a loss value, wherein the weight relation is used for representing that the sum of the first weight corresponding to the first loss value and the second weight corresponding to the second loss value is a preset value, the first weight is in direct proportion to the adjusted times of the model parameters of the voice activity detection model, and the second weight is in inverse proportion to the adjusted times of the model parameters of the voice activity detection model.
9. A computer readable medium on which a computer program is stored, characterized in that the program, when being executed by a processing device, carries out the steps of the method according to any one of claims 1-6.
10. An electronic device, comprising: A storage device having at least one computer program stored thereon; at least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.

Description

Voice activity detection model generation method and device, medium and electronic equipment Technical Field The disclosure relates to the technical field of neural networks, and in particular relates to a voice activity detection model generation method, a device, a medium and electronic equipment. Background A voice activity detection model (voice activity detection, VAD) can detect speech in a piece of audio. In the related art, the voice activity detection model (voice activity detection, VAD) is usually optimized by using the cross entropy loss function as an objective function, however, in the practical application of the voice activity detection model, the objective of the voice activity detection model is not strictly consistent with the cross entropy loss function used in the optimization, so that the effect of the voice activity detection model in the practical application is not ideal. Disclosure of Invention This section is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This section is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. In a first aspect, the present disclosure provides a method for generating a voice activity detection model, including obtaining a voice sample data set, where the voice sample data set includes a voice sample frame and a sample tag corresponding to the voice sample frame; Inputting the voice sample frame into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice sample frame; Determining a loss value according to a preset loss function, a corresponding voice probability result of a voice sample frame in the voice sample data set and a sample label, wherein the loss function comprises a first loss function which is the smooth approximate inverse number of an F1 score; And adjusting model parameters of the voice activity detection model according to the loss value until the training parameters of the voice activity detection model are evaluated to meet the preset conditions so as to obtain the voice activity detection model. In a second aspect, the present disclosure provides a voice activity detection model generating apparatus, including: the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice sample data set, wherein the voice sample data set comprises a voice sample frame and a sample label corresponding to the voice sample frame; the input module is used for inputting the voice sample frame into a voice activity detection model to obtain a voice probability result which is output by the voice activity detection model and corresponds to the voice sample frame; The determining module is used for determining a loss value according to a preset loss function, a corresponding voice probability result of the voice sample frame in the voice sample data set and a sample label, wherein the loss function comprises a first loss function which is the smooth approximate inverse number of the F1 fraction; And the adjusting module is used for adjusting the model parameters of the voice activity detection model according to the loss value until the training parameters for evaluating the voice activity detection model meet the preset conditions so as to obtain the voice activity detection model. In a third aspect, the present disclosure provides a computer readable medium having stored thereon a computer program which when executed by a processing device performs the steps of the method described in the first aspect. In a fourth aspect, the present disclosure provides an electronic device comprising: A storage device having at least one computer program stored thereon; At least one processing means for executing said at least one computer program in said storage means to carry out the steps of the method described in the first aspect. According to the technical scheme, the situation that the F1 score is taken as an index in the actual application of the evaluation model, the F1 score cannot be continuously led and the model performance is better when the loss value for model training is generally lower is considered, the constructed first loss function is the reverse number of the smooth approximation of the F1 score, the smooth approximation of the F1 score can meet the continuous conduction, the reverse number of the smooth approximation of the F1 score can meet the judgment condition that the model performance is better when the loss value is lower in model training, therefore, the reverse number of the smooth approximation of the F1 score is used as the first loss function for training the voice activity detection model, the loss function used in the generation process of the voice acti