CN-122024711-A - Anti-interference voice awakening method, system and terminal equipment

CN122024711ACN 122024711 ACN122024711 ACN 122024711ACN-122024711-A

Abstract

The application belongs to the technical field of voice signal processing and provides an anti-interference voice awakening method, which comprises the steps of firstly acquiring a voice signal, then carrying out noise suppression through a signal processing sub-network to output enhancement features, respectively outputting a first discrimination result and a scene recognition result through an awakening acoustic model and an awakening decoder based on the enhancement features, outputting a second discrimination result through the awakening acoustic model and an end-to-end awakening discrimination sub-network, and finally carrying out self-adaptive fusion on the first discrimination result and the second discrimination result according to the scene recognition result to output a final awakening decision. According to the application, through integrated modeling and scene self-adaptive decision, the wake-up robustness in a high-noise scene is improved.

Inventors

YUAN YUSHUAI
FAN ZHIYONG

Assignees

南京信息工程大学

Dates

Publication Date: 20260512
Application Date: 20260410

Claims (10)

1. An anti-interference voice wake-up method is characterized by comprising the following steps: Extracting characteristics of a voice signal in a target application scene to obtain voice characteristics; Noise suppression and feature enhancement are carried out on the voice features through a signal processing sub-network in the integrated voice processing model, so that enhanced voice features and scene recognition results are obtained; Outputting a first wake-up discrimination result through a wake-up acoustic model and a wake-up decoder in the integrated speech processing model based on the enhanced speech feature and the speech feature; based on the enhanced voice feature and the voice feature, outputting a second wake-up discrimination result through a wake-up acoustic model and an end-to-end wake-up discrimination sub-network in the integrated voice processing model; And carrying out self-adaptive fusion on the first wake-up discrimination result and the second wake-up discrimination result according to the scene recognition result, and outputting a final wake-up decision result.
2. The method of claim 1, wherein noise suppression and feature enhancement of the speech features by the signal processing subnetwork in the integrated speech processing model results in enhanced speech features, comprising: Inputting the voice characteristics into the signal processing sub-network, and performing noise suppression processing on the voice characteristics based on time-frequency domain characteristics; and outputting the enhanced voice characteristics after noise suppression.
3. The method of claim 2, wherein the outputting, based on the enhanced speech feature and the speech feature, a first wake discrimination result by a wake acoustic model and a wake decoder in the integrated speech processing model comprises: Inputting the enhanced voice features and the voice features into a wake-up acoustic model for time sequence modeling to obtain a syllable level posterior probability representation; And decoding a preset keyword through a wake-up decoder based on the syllable level posterior probability representation to obtain the first wake-up judging result.
4. The method of claim 3, wherein the outputting, based on the enhanced speech feature and the speech feature, a second wake discrimination result through a wake acoustic model and an end-to-end wake decision sub-network in the integrated speech processing model comprises: inputting the voice characteristics and the enhanced voice characteristics as input characteristics into a wake-up acoustic model for time sequence modeling, and outputting audio embedding audio embedding; Inputting audio embedding to the end-to-end wake-up discrimination sub-network, integrally modeling the end-to-end wake-up discrimination sub-network on the premise of not depending on explicit wake-up word alignment information, and outputting the second wake-up discrimination result.
5. The method of claim 4, wherein the adaptively fusing the first wake-up decision result and the second wake-up decision result according to the scene recognition result, and outputting a final wake-up decision result, comprises: determining fusion weights of the first wake-up discrimination result and the second wake-up discrimination result based on the scene recognition result; The first wake-up discrimination result and the second wake-up discrimination result are subjected to weighted fusion according to the fusion weight, and a comprehensive wake-up score is obtained; and comparing the comprehensive wake-up score with a wake-up threshold value, and outputting the final wake-up decision result according to a comparison result.
6. The method of claim 5, wherein the determining the fusion weight of the first wake-up discrimination result and the second wake-up discrimination result based on the scene recognition result comprises: when the scene recognition result indicates a high-noise scene, setting the fusion weight distributed to the second wake-up discrimination result to be higher than the fusion weight distributed to the first wake-up discrimination result; and when the scene recognition result indicates a low-noise scene, setting that the fusion weight allocated to the first wake-up discrimination result is higher than the fusion weight allocated to the second wake-up discrimination result.
7. The method of claim 1, wherein the integrated speech processing model is obtained by an end-to-end joint training, the training comprising: constructing a training data set, wherein the training data set comprises a noise-free voice sample, a low-noise voice sample and a high-noise voice sample; based on the training data set, training the integrated voice processing model by adopting a multi-loss function joint optimization strategy, wherein the multi-loss function at least comprises signal processing loss, wake-up discrimination loss and end-to-end wake-up discrimination loss.
8. The method of claim 7, wherein training the integrated speech processing model using a multiple-loss function joint optimization strategy based on the training dataset comprises: Calculating signal processing loss between the enhanced voice characteristics output by the signal processing sub-network and the corresponding voice without noise; calculating classification discrimination loss between syllable confidence coefficient output by the wake-up acoustic model and a real label; calculating the end-to-end wake judging Loss between the second wake judging result output by the end-to-end wake judging sub-network and the real label by adopting Focal Loss; carrying out weighted summation on the signal processing loss, the classification discrimination loss and the end-to-end wake-up discrimination loss to obtain total loss; updating parameters of the integrated speech processing model based on the total loss; the calculating the end-to-end wake-up discrimination Loss between the second wake-up discrimination result output by the end-to-end wake-up discrimination sub-network and the real tag by adopting Focal Loss specifically comprises: And aiming at the high-noise voice sample, increasing the weighting coefficient of the focus loss function, and calculating the end-to-end wake-up discrimination loss between the second wake-up discrimination result and the real tag by adopting the focus loss function based on the weighting coefficient.
9. An anti-interference voice wake-up system, comprising: The feature extraction module is used for extracting features of the voice signals in the target application scene to obtain voice features; The feature enhancement module is used for carrying out noise suppression and feature enhancement on the voice features through a signal processing sub-network in the integrated voice processing model to obtain enhanced voice features and scene recognition results; the first wake-up judging module is used for outputting a first wake-up judging result through a wake-up acoustic model and a wake-up decoder in the integrated voice processing model based on the enhanced voice feature and the voice feature; the second wake-up judging module is used for outputting a second wake-up judging result through a wake-up acoustic model and an end-to-end wake-up judging sub-network in the integrated voice processing model based on the enhanced voice feature and the voice feature; And the self-adaptive fusion decision module is used for carrying out self-adaptive fusion on the first wake-up discrimination result and the second wake-up discrimination result according to the scene recognition result and outputting a final wake-up decision result.
10. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

Description

Anti-interference voice awakening method, system and terminal equipment Technical Field The application belongs to the technical field of voice signal processing, and particularly relates to an anti-interference voice awakening method, an anti-interference voice awakening system and terminal equipment. Background The voice wake-up technology is used as a man-machine interaction entrance and is widely applied to scenes such as vehicle-mounted systems, intelligent home and wearable equipment. The existing voice awakening system generally adopts a two-stage architecture, wherein noise-carrying voice is enhanced through a signal processing module, such as spectral subtraction, wiener filtering or voice noise reduction based on a neural network, and then the enhanced voice is input into an awakening module for keyword recognition, and the awakening module is usually realized based on a hidden Markov model or a deep neural network. However, the current method has the problems that the signal processing module and the voice wake-up module are independently deployed and the training targets are inconsistent, and when the noise type or intensity changes, the voice enhancement result and the wake-up model characteristics are difficult to match, so that the wake-up rate is reduced or the false wake-up rate is increased. Disclosure of Invention The embodiment of the application provides an anti-interference voice awakening method, an anti-interference voice awakening system and terminal equipment, which can solve the problems that a signal processing module and a voice awakening module are independently deployed and training targets are inconsistent in the current method, and when the noise type or intensity changes, voice enhancement results are difficult to match with awakening model characteristics, so that the awakening rate is reduced or the false awakening rate is increased. According to a first aspect, the embodiment of the application provides an anti-interference voice awakening method, which comprises the steps of extracting characteristics of a voice signal in a target application scene to obtain voice characteristics, carrying out noise suppression and characteristic enhancement on the voice characteristics through a signal processing sub-network in an integrated voice processing model to obtain enhanced voice characteristics and scene recognition results, outputting a first awakening judgment result through an awakening acoustic model and an awakening decoder in the integrated voice processing model based on the enhanced voice characteristics and the voice characteristics, outputting a second awakening judgment result through an awakening acoustic model and an end-to-end awakening judgment sub-network in the integrated voice processing model, carrying out self-adaptive fusion on the first awakening judgment result and the second awakening judgment result according to the scene recognition result, and outputting a final awakening decision result. In a possible implementation manner of the first aspect, the noise suppression and feature enhancement are performed on the voice feature through the signal processing sub-network in the integrated voice processing model, so as to obtain an enhanced voice feature, which includes: inputting the voice characteristics into a signal processing sub-network, and performing noise suppression processing on the voice characteristics based on the time-frequency domain characteristics; and outputting the enhanced voice characteristics after noise suppression. Optionally, in another possible implementation manner of the first aspect, the outputting, by a wake acoustic model and a wake decoder in an integrated speech processing model, a first wake discrimination result based on the enhanced speech feature and the speech feature includes: Inputting the enhanced voice features and the voice features into a wake-up acoustic model for time sequence modeling to obtain a syllable level posterior probability representation; and decoding the preset keywords through a wake-up decoder based on the syllable level posterior probability representation to obtain a first wake-up judging result. Optionally, in another possible implementation manner of the first aspect, the outputting, based on the enhanced voice feature and the voice feature, a second wake-up discrimination result through a wake-up acoustic model and an end-to-end wake-up decision sub-network in the integrated voice processing model includes: The voice features and the enhanced voice features are used as input features, input into a wake-up acoustic model for time sequence modeling, and output audio is embedded audio embedding; audio embedding is input to the end-to-end wake-up discrimination sub-network, the integral modeling is carried out on the sub-network on the premise of not depending on the explicit wake-up word alignment information, and a second wake-up discrimination result is output. Optionally, in another possible