CN-121985068-A - Voice processing method and electronic equipment

CN121985068ACN 121985068 ACN121985068 ACN 121985068ACN-121985068-A

Abstract

A voice processing method and electronic equipment relate to the technical field of voice processing and are used for solving the problems of voice hearing feeling smouldering caused by bandwidth limitation, voice intermittent blurring caused by spectrum damage and reduced intelligibility in the prior art. The voice processing method comprises the steps of conducting bandwidth limitation detection on a first voice signal to judge whether the current call is limited in bandwidth, expanding the bandwidth of the first voice signal to be broadband to obtain a second voice signal if the current call is limited in bandwidth, conducting frequency spectrum damage detection on the second voice signal to judge whether the frequency spectrum of the second voice signal is damaged, and repairing the frequency spectrum of the second voice signal to obtain a repaired third voice signal if the frequency spectrum of the second voice signal is damaged.

Inventors

GENG JIANHUA

Assignees

荣耀终端股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (11)

1. A method of speech processing, the method comprising: performing bandwidth limitation detection on the first voice signal, and judging whether the current call is bandwidth limited or not; If the current call bandwidth is limited, expanding the bandwidth of the first voice signal into a broadband to obtain a second voice signal; Detecting the spectrum damage of the second voice signal, and judging whether the spectrum of the second voice signal is damaged; and if the frequency spectrum of the second voice signal is damaged, repairing the frequency spectrum of the second voice signal to obtain a repaired third voice signal.
2. The method of claim 1, wherein the performing bandwidth limited detection on the first voice signal to determine whether the current call is bandwidth limited comprises: acquiring spectrum information of the first voice signal; According to the spectrum information of the first voice signal, respectively obtaining low-frequency average energy and high-frequency average energy of the first voice signal; and if the ratio of the low-frequency average energy to the high-frequency average energy is larger than or equal to a first threshold value, determining that the current call bandwidth is limited, otherwise, determining that the current call bandwidth is not limited.
3. The method according to claim 1 or 2, wherein said expanding the bandwidth of the first speech signal to a wideband, resulting in a second speech signal, comprises: The amplitude prediction model predicts high-frequency amplitude information of the second voice signal according to the frequency spectrum information of the first voice signal; the phase prediction model predicts high-frequency phase information of the second voice signal according to the frequency spectrum information of the first voice signal; The fusion module obtains a high-frequency complex spectrum of the second voice signal according to the high-frequency amplitude information and the high-frequency phase information; The fusion module is used for splicing the high-frequency complex spectrum of the second voice signal and the low-frequency complex spectrum of the first voice signal to obtain a broadband complex spectrum, and performing inverse short-time Fourier transform on the broadband complex spectrum to obtain a time domain waveform of the second voice signal.
4. A method according to claim 3, characterized in that the output of the intermediate layer of the amplitude prediction model is also used for adjusting the input of the intermediate layer of the phase prediction model, which is also used for adjusting the input of the intermediate layer of the amplitude prediction model.
5. The method of claim 3, wherein the amplitude prediction model comprises a convolution layer at an input layer of the model, a middle layer of the model, and a full-connection layer at an output layer of the model, wherein the convolution layer at the input layer is used for extracting local spectrum information of the first voice signal, the convolution layer at the middle layer is used for expanding a receptive field, and the full-connection layer at the output layer is used for mapping all local spectrum information to the high-frequency amplitude information or the high-frequency phase information after being fused.
6. The method according to claim 1 or 2, wherein the performing spectrum impairment detection on the second speech signal to determine whether the spectrum of the second speech signal is impaired, comprises: Acquiring spectrum information of the second voice signal; The spectrum information of the second voice signal is input into a pre-trained voice quality evaluation model to obtain a mean opinion score of the second voice signal, if the mean opinion score is smaller than or equal to a second threshold value, the spectrum of the second voice signal is determined to be damaged, otherwise, the spectrum of the second voice signal is determined to be not damaged.
7. The method according to claim 1 or 2, wherein repairing the spectrum of the second speech signal to obtain a repaired third speech signal comprises: inputting the second voice signal into an encoder and a decoder, and estimating the logarithmic mel spectrum of the repaired voice signal; and carrying out voice reconstruction on the logarithmic Mel spectrum by adopting a vocoder to generate a repaired third voice signal.
8. The method of claim 7, wherein the convolutional layer of at least one of the encoder and the decoder comprises a gated convolution that is a gated mask element-wise multiplication of a main convolution branch for extracting local spectral information of a second speech signal and a gated branch for identifying corrupted and uncorrupted regions in the spectral information of the second speech signal and generating a gated mask; For the undamaged region, a gating mask is greater than or equal to a first mask threshold; For boundaries in the damaged area adjacent to the undamaged area, a gating mask is less than the first mask threshold and greater than a second mask threshold; For regions of the damaged region other than the boundary, a gating mask is less than the second mask threshold.
9. An electronic device comprising a memory and one or more processors; the memory having stored therein computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any of claims 1-8.
10. A computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-8.
11. A computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of any of claims 1-8.

Description

Voice processing method and electronic equipment Technical Field The embodiment of the application relates to the technical field of voice processing, in particular to a voice processing method and electronic equipment. Background In current voice communication systems under a weak network environment (such as underground parking lot, elevator, subway, etc.), in order to ensure call continuity, electronic devices typically degrade the bandwidth of a voice signal from a wide bandwidth (e.g., 0-8 kHz) to a limited bandwidth (e.g., 0-4 kHz) for transmission, and the high frequency part of the voice signal is lost, resulting in a smoothy voice. In addition, the uplink causes spectrum damage of the voice signal due to factors such as packet loss, high compression ratio or strong noise interference, so that voice is intermittent and fuzzy, and the intelligibility is seriously affected. Disclosure of Invention The application provides a voice processing method and electronic equipment, which are used for solving the problems of voice hearing feeling smouldering caused by bandwidth limitation, intermittent blurring of voice caused by spectrum damage and reduced intelligibility in the prior art. In order to achieve the above purpose, the application adopts the following technical scheme: In a first aspect, a voice processing method is provided and applied to an electronic device, and the method includes detecting bandwidth limitation of a first voice signal and judging whether a current call is bandwidth limited or not. And if the current call bandwidth is limited, expanding the bandwidth of the first voice signal into a broadband to obtain a second voice signal. And if the current call bandwidth is not limited, taking the first voice signal as the second voice signal. And detecting the spectrum damage of the second voice signal, and judging whether the spectrum of the second voice signal is damaged. And if the frequency spectrum of the second voice signal is damaged, repairing the frequency spectrum of the second voice signal to obtain a repaired third voice signal. The voice processing method provided by the embodiment of the application comprises the steps of firstly carrying out bandwidth limited detection, then carrying out bandwidth expansion, then carrying out spectrum damage detection, and finally carrying out a cascading processing flow of spectrum restoration, thereby realizing the grading and refining processing of voice signals, and when the conversation bandwidth limitation is detected, recovering lost high-frequency components through the bandwidth expansion, improving the brightness and naturalness of the hearing of the voice and solving the problem of voice smoldering in a weak network environment. When the spectrum is detected to be damaged, the fuzzy intermittent part is complemented by the generation type repair technology, so that the speech intelligibility is improved, and the problem of poor speech hearing in a weak network environment is solved. The bandwidth expansion and the spectrum repair are guaranteed to have no interference, meanwhile, the on-demand processing is realized through the detection of the bandwidth limitation, the unnecessary calculation cost is avoided, and finally, the conversation experience of the user in the weak network environment is comprehensively improved. In one possible implementation, the bandwidth limitation detection is performed on the first voice signal to determine whether the current call is bandwidth limited, and the method includes the steps of obtaining spectrum information of the first voice signal, respectively obtaining low-frequency average energy and high-frequency average energy of the first voice signal according to the spectrum information of the first voice signal, determining that the current call is bandwidth limited if the ratio of the low-frequency average energy to the high-frequency average energy is greater than or equal to a first threshold value, and otherwise determining that the current call is not bandwidth limited. According to the embodiment, the ratio of the low-frequency average energy to the high-frequency average energy of the first voice signal is calculated, so that the bandwidth limitation detection method which is simple to calculate and high in instantaneity is achieved, the high-frequency average energy of the bandwidth unlimited call is relatively high due to the fact that high-frequency components of the bandwidth unlimited call are rich, the ratio of the low-frequency energy to the high-frequency energy is small, the ratio of the bandwidth limited call is remarkably increased due to the fact that the high-frequency energy is missing, the bandwidth unlimited call and the bandwidth limited call can be accurately distinguished by setting a reasonable threshold value, the bandwidth limited call can be completed only by calculating the frequency domain energy without complex model reasoning, the calculation complexity and