CN-121983031-A - Voice quality detection method, device, medium and equipment

CN121983031ACN 121983031 ACN121983031 ACN 121983031ACN-121983031-A

Abstract

The embodiment of the specification discloses a voice quality detection method, which comprises the steps of determining a voice classification result of each frame through a preset voice activity detection algorithm, dividing the audio data into a voice section and a non-voice section, eliminating each frame of audio with the classification result of voice data in the non-voice section as an interference frame, and calculating a signal to noise ratio by the voice section to determine a quality detection result. And eliminating the interference frame misjudged as voice from the non-voice section to obtain purer noise estimation. Based on the calculated signal-to-noise ratio, the voice signal quality can be reflected more truly, and the problem of noise power overestimation and signal-to-noise ratio underestimation caused by VAD false detection is effectively solved, so that the accuracy and reliability of voice quality detection are improved.

Inventors

WANG TAO
LIU JIAN
ZHANG CHANGHAO

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260505
Application Date: 20260105

Claims (10)

1. A method of voice quality detection, the method being applied to a head-mounted device, the method comprising: acquiring audio data to be detected; determining a voice classification result of each frame through a preset voice activity detection algorithm so as to divide the audio data into voice segments and non-voice segments; according to the voice classification result of each frame, determining each frame of audio with the classification result of voice data in the non-voice section as an interference frame; and deleting the interference frames in the non-voice section, calculating a signal-to-noise ratio with the voice section, and determining a voice quality detection result of the audio data at least based on the signal-to-noise ratio.
2. The method of claim 1, dividing the audio data into speech segments and non-speech segments, comprising: Determining a plurality of voice segments to be selected according to the voice classification result of each frame; recording the voice segments to be selected, wherein the voice segments to be selected are determined to be lower than the preset pronunciation time length, and the voice segments to be selected are determined to be not lower than the preset pronunciation time length and serve as basic voice segments; According to the preset expansion duration, the basic audio segment is expanded back and forth to determine a voice segment, and the voice segment of an expansion part is recorded; the rest of the audio data is taken as a non-speech segment.
3. The method of claim 2, wherein determining, as the interference frame, each frame of audio in which the classification result in the non-speech segment is speech data according to the speech classification result of each frame, specifically comprises: According to the recorded data, determining a voice segment to be selected, the duration of which is lower than the preset pronunciation duration; and taking each frame in the recorded voice section to be selected as an interference frame in the non-voice section.
4. A method according to claim 3, wherein each frame in the speech segment to be selected in the recording is used as an interference frame in the non-speech segment, and specifically comprises: taking the voice section to be selected in the record as an interference section to be selected; And aiming at each interference section to be selected, determining each interference frame by taking the part which is not divided into voice sections in the interference section to be selected as interference audio.
5. The method of claim 2, wherein after deleting the interference frame in the non-speech segment, calculating a signal-to-noise ratio with the speech segment, specifically comprising: deleting the expansion part in the voice section according to the voice section of the recorded expansion part; and deleting the interference frame in the non-voice section, and calculating the signal to noise ratio according to the voice section of the deleted expansion part and the non-voice section of the deleted interference frame.
6. The method of claim 1, after deleting the interfering frames in the non-speech segment, and before calculating a signal-to-noise ratio for the speech segment, the method further comprising: Pulse detection is carried out on the non-voice section according to a threshold value set by the pulse; When the pulse exists, the pulse part in the non-voice section is deleted or zeroed.
7. The method of claim 2, determining a voice quality detection result of the audio data based at least on the signal-to-noise ratio, comprising: deleting the expansion part in the voice section according to the voice section of the recorded expansion part; determining a duration and a sound pressure value according to the voice segment of the deleted expansion part; And determining a voice quality detection result of the audio data through a preset quality detection model based on the signal-to-noise ratio, the duration and the sound pressure value.
8. An apparatus for voice quality detection, applied to a head-mounted device, comprising: the acquisition module is used for acquiring the audio data to be detected; the classification module is used for determining a voice classification result of each frame through a preset voice activity detection algorithm so as to divide the audio data into voice segments and non-voice segments; The interference shielding module is used for determining each frame of audio of which the classification result is voice data in the non-voice section as an interference frame according to the voice classification result of each frame; and the quality detection module is used for calculating a signal-to-noise ratio with the voice segment after deleting the interference frame in the non-voice segment, and determining a voice quality detection result of the audio data at least based on the signal-to-noise ratio.
9. A computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding claims 1-7 when the program is executed.

Description

Voice quality detection method, device, medium and equipment Technical Field The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a storage medium, and a device for detecting voice quality. Background In recent years, with rapid development of technologies such as Augmented Reality (AR) and Virtual Reality (VR), wearable devices are becoming popular, and among them, head-mounted devices such as smart glasses are attracting attention due to portability and immersive experience advantages thereof. Voice interaction is a mainstream interaction scheme of intelligent glasses by virtue of naturalness and low power consumption characteristics. The voice quality can be evaluated firstly in the voice interaction process, so that when the voice quality is low, the voice interaction process is stopped in time, and the situation of poor user experience caused by interaction failure due to low voice quality is avoided. In the prior art, voice activity detection (voice activity detection, VAD) is generally performed on the collected audio data to separate out voice segments and non-voice segments. And then, calculating the signal-to-noise ratio of the voice section relative to the non-voice section by calculating the voice section and the non-voice section, and determining the voice quality by combining the sound pressure value and the time length of the voice section. However, in the voice interaction scene of the head-mounted device, it is difficult to adapt to a complex and dynamic acoustic environment by directly applying the VAD algorithm, so that overstock and omission easily occur. In order to ensure the integrity of the voice, the expansion duration strategy in the VAD algorithm is easy to erroneously incorporate the non-semantic human voice collected by the head-mounted equipment, such as breathing sound, cough sound and word of speech. In order to improve the purity of the speech segment, the VAD algorithm has a higher threshold for low-energy audio filtering, resulting in weak consonants or tailtones collected by the headset being incorrectly classified into non-speech segments. Based on this, the present specification provides a voice quality detection method to partially solve the problems existing in the prior art, especially in the head-mounted device. Disclosure of Invention Embodiments of the present disclosure provide a method, an apparatus, a storage medium, and an electronic device for detecting voice quality, so as to partially solve the problems in the prior art. The embodiment of the specification adopts the following technical scheme: the present specification provides a voice quality detection method, which is applied to a head-mounted type wearable device, and includes: acquiring audio data to be detected; determining a voice classification result of each frame through a preset voice activity detection algorithm so as to divide the audio data into voice segments and non-voice segments; according to the voice classification result of each frame, determining each frame of audio with the classification result of voice data in the non-voice section as an interference frame; and deleting the interference frames in the non-voice section, calculating a signal-to-noise ratio with the voice section, and determining a voice quality detection result of the audio data at least based on the signal-to-noise ratio. The device for detecting voice quality is applied to a head-mounted device, and comprises: the acquisition module is used for acquiring the audio data to be detected; the classification module is used for determining a voice classification result of each frame through a preset voice activity detection algorithm so as to divide the audio data into voice segments and non-voice segments; The interference shielding module is used for determining each frame of audio of which the classification result is voice data in the non-voice section as an interference frame according to the voice classification result of each frame; and the quality detection module is used for calculating a signal-to-noise ratio with the voice segment after deleting the interference frame in the non-voice segment, and determining a voice quality detection result of the audio data at least based on the signal-to-noise ratio. A computer readable storage medium is provided in the present specification, the storage medium storing a computer program, which when executed by a processor, implements the above-described voice quality detection method. The electronic device provided by the specification comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the voice quality detection method when executing the program. The above-mentioned at least one technical scheme that this description embodiment adopted can reach following beneficial effect: The embodiment of the specification disclo