CN-121983094-A - Voice endpoint detection method, storage medium and program product

CN121983094ACN 121983094 ACN121983094 ACN 121983094ACN-121983094-A

Abstract

The embodiment of the application provides a voice endpoint detection method, a storage medium and a program product. The method comprises the steps of obtaining a voice signal to be detected, carrying out framing windowing on the voice signal to obtain a framing windowing signal, carrying out first nonlinear filtering processing and second nonlinear filtering processing on the framing windowing signal to obtain a first enhancement signal and a second enhancement signal, carrying out cepstrum domain feature extraction processing and fusion processing on the first enhancement signal and the second enhancement signal to obtain fusion features, carrying out detection processing on the voice signal according to the fusion features to obtain detection results, and improving accuracy and stability of voice endpoint detection.

Inventors

WEI MINGYANG

Assignees

锐迪科微电子（上海）有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A method for detecting a voice endpoint, comprising: obtaining a voice signal to be detected, and carrying out framing and windowing processing on the voice signal to obtain a framing and windowing signal, wherein the voice signal comprises noise; Performing first nonlinear filtering processing and second nonlinear filtering processing on the framing windowed signal to obtain a first enhancement signal and a second enhancement signal, wherein delay parameters corresponding to the first nonlinear filtering processing are smaller than delay parameters corresponding to the second nonlinear filtering processing; Performing cepstral domain feature extraction and fusion processing on the first enhancement signal and the second enhancement signal to obtain fusion features; and detecting the voice signal according to the fusion characteristics to obtain a detection result.
2. The method of claim 1, wherein performing cepstral domain feature extraction and fusion on the first and second enhancement signals to obtain a fusion feature comprises: performing cepstrum domain feature extraction processing on the first enhanced signal to obtain a first cepstrum peak feature; Performing cepstrum domain feature extraction processing on the second enhanced signal to obtain a second cepstrum peak feature; And carrying out weighted fusion processing on the first cepstrum peak characteristic and the second cepstrum peak characteristic to obtain a fusion characteristic.
3. The method of claim 2, wherein performing cepstral domain feature extraction on the first enhancement signal to obtain a first cepstral peak feature comprises: Performing discrete Fourier transform processing on the first enhancement signal to obtain a first amplitude spectrum; performing recursive smoothing on the first amplitude spectrum to obtain a first power spectrum; carrying out logarithmic processing and inverse discrete Fourier transform processing on the first power spectrum to obtain a first cepstrum; and determining a maximum value corresponding to the first cepstrum according to a preset cepstrum search interval, and determining the maximum value as a first cepstrum peak value characteristic.
4. The method of claim 2, wherein performing weighted fusion processing on the first cepstrum peak feature and the second cepstrum peak feature to obtain a fused feature, comprises: acquiring a preset fusion feature, a first preset recursive average feature and a second preset recursive average feature; Determining a first current recursive average feature according to the first preset recursive average feature, the first cepstrum peak feature and the second cepstrum peak feature; determining a second current recursive average feature according to the second preset recursive average feature, the first cepstrum peak feature and the second cepstrum peak feature; and determining fusion features according to the preset fusion features, the first current recursive average features and the second current recursive average features.
5. The method of claim 4, wherein after obtaining the fusion feature, the method further comprises: updating the fusion characteristics into preset fusion characteristics; Updating the first current recursive average feature into a first preset recursive average feature; And updating the second current recursive average characteristic into a second preset recursive average characteristic.
6. The method according to any one of claims 1-5, wherein performing detection processing on the speech signal according to the fusion feature to obtain a detection result includes: Judging whether the fusion characteristic is larger than or equal to a preset threshold value; If yes, determining that the detection result is voice; if not, determining that the detection result is that no voice exists.
7. The method according to any one of claims 1-5, wherein performing a first nonlinear filtering process and a second nonlinear filtering process on the framed windowed signal to obtain a first enhancement signal and a second enhancement signal, comprises: Inputting the framing windowing signal to a first nonlinear filter to obtain a first enhancement signal, wherein the first nonlinear filter is used for enhancing high-frequency voice harmonic waves and inhibiting burst noise, and the delay parameter of the first nonlinear filter is a first preset value; And inputting the framing windowing signal to a second nonlinear filter to obtain a second enhancement signal, wherein the second nonlinear filter is used for retaining low-frequency weak voice harmonic waves and inhibiting continuous noise, the delay parameter of the second nonlinear filter is a second preset value, and the first preset value is smaller than the second preset value.
8. An electronic device is characterized by comprising a memory and a processor; the memory stores computer-executable instructions; the processor executing computer-executable instructions stored in the memory, causing the processor to perform the method of any one of claims 1-7.
9. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1-7.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of claims 1-7.

Description

Voice endpoint detection method, storage medium and program product Technical Field The present application relates to the field of data processing, and in particular, to a method for detecting a voice endpoint, a storage medium, and a program product. Background The voice endpoint detection technology is widely applied to intelligent voice interaction systems, such as intelligent voice assistants, voice conference systems, voice transcription services, telemedicine voice recording, vehicle-mounted voice control systems and the like. In these scenarios, user interaction with the device relies on accurate detection of voice signals to distinguish between voice activity and silence segments. In practical applications complex non-stationary background noise environments are often faced, such as traffic noise, industrial noise, human noise interference, and bursty noise. These noises have non-stationary characteristics, and their statistical characteristics change rapidly with time, resulting in a significant degradation of the detection performance of the conventional voice endpoint detection algorithm. If in the voice conference system, the traditional voice endpoint detection algorithm may erroneously determine the sudden noise as a voice signal, which leads to the system to erroneously reserve noise segments and affect the voice transcription quality, and if in the vehicle-mounted voice control system, the traditional voice endpoint detection algorithm may miss the weak voice signal, which leads to the system failing to respond to the user demand. Under the non-stationary noise environment, false detection or omission is easy to generate, and the accuracy of voice endpoint detection is poor. Disclosure of Invention The embodiment of the application provides a voice endpoint detection method, a storage medium and a program product, which are used for improving the accuracy of voice endpoint detection. In a first aspect, an embodiment of the present application provides a method for detecting a voice endpoint, including: Obtaining a voice signal to be detected, and carrying out framing windowing on the voice signal to obtain a framing windowing signal, wherein the voice signal comprises noise; Performing first nonlinear filtering processing and second nonlinear filtering processing on the framing windowed signal to obtain a first enhancement signal and a second enhancement signal, wherein delay parameters corresponding to the first nonlinear filtering processing are smaller than delay parameters corresponding to the second nonlinear filtering processing; Performing cepstral domain feature extraction and fusion processing on the first enhancement signal and the second enhancement signal to obtain fusion features; And detecting the voice signal according to the fusion characteristics to obtain a detection result. In one possible implementation manner, performing cepstral domain feature extraction processing and fusion processing on the first enhancement signal and the second enhancement signal to obtain a fusion feature, where the method includes: Performing cepstrum domain feature extraction processing on the first enhancement signal to obtain a first cepstrum peak feature; performing cepstrum domain feature extraction processing on the second enhanced signal to obtain a second cepstrum peak feature; And carrying out weighted fusion processing on the first cepstrum peak characteristics and the second cepstrum peak characteristics to obtain fusion characteristics. In one possible implementation manner, performing cepstral domain feature extraction processing on the first enhancement signal to obtain a first cepstral peak feature, including: performing discrete Fourier transform processing on the first enhancement signal to obtain a first amplitude spectrum; performing recursive smoothing on the first amplitude spectrum to obtain a first power spectrum; Carrying out logarithmic processing and inverse discrete Fourier transform processing on the first power spectrum to obtain a first cepstrum; And determining a maximum value corresponding to the first cepstrum according to a preset cepstrum search interval, and determining the maximum value as a first cepstrum peak value characteristic. In one possible implementation manner, the weighted fusion processing is performed on the first cepstrum peak feature and the second cepstrum peak feature to obtain a fusion feature, which includes: acquiring a preset fusion feature, a first preset recursive average feature and a second preset recursive average feature; Determining a first current recursive average feature according to the first preset recursive average feature, the first cepstrum peak feature and the second cepstrum peak feature; Determining a second current recursive average feature according to the second preset recursive average feature, the first cepstrum peak feature and the second cepstrum peak feature; and determining fusion features according to the prese