CN-122024752-A - Electric power scene audio enhancement method and system based on deep learning

CN122024752ACN 122024752 ACN122024752 ACN 122024752ACN-122024752-A

Abstract

The invention discloses a deep learning-based electric power scene audio enhancement method and a deep learning-based electric power scene audio enhancement system, which relate to the technical field of audio signal processing and voice enhancement, and the method comprises the steps of constructing candidate harmonic frequency bands around power frequency of a power grid and multi-order harmonic thereof and generating a continuous soft suppression weight map, forming smooth pits in the harmonic frequency bands, keeping energy continuity in non-harmonic frequency bands, reducing the perceptibility of harmonic buzzing and simultaneously reducing accidental injury to voice low-frequency information; in addition, the bandwidth of the candidate frequency band is determined in a self-adaptive mode according to the spectral peak broadening characteristic, the local noise bottom and the frequency resolution, and the network outputs the adjustment quantity of the suppression intensity, the bandwidth and the central position, so that the suppression range and the suppression intensity are updated along with the change of the spectral shape in the frame, and the suitability of noise forms of different sites and different equipment is improved.

Inventors

LI WENHUI
CHENG XIAOWEI
DOU CHENCHEN
YUAN PINGLIANG
WANG YUTING
ZHAN WENHAO
WANG KEMIN

Assignees

国网甘肃省电力公司
国网甘肃省电力公司信息通信公司

Dates

Publication Date: 20260512
Application Date: 20260131

Claims (10)

1. The electric power scene audio enhancement method based on deep learning is characterized by comprising the following steps of: step S1, framing the noisy time domain audio and performing short-time Fourier transform to obtain a complex time spectrum; s2, inputting the complex time spectrum into a harmonic suppression branch network and a transient repair branch network in parallel; The harmonic suppression branch network generates a continuous frequency domain suppression weight graph according to preset or estimated power frequency of the power grid and harmonic frequency thereof, and weights the complex time frequency spectrum to suppress steady harmonic noise so as to obtain a first enhancement spectrum; The transient repair branch network outputs a transient indication map, and generates a substitution spectrum based on the neighborhood context through a gating convolution repair network for a time-frequency region marked by the transient indication map to obtain a second enhancement spectrum; and S3, calculating a dynamic fusion weight according to the confidence coefficient respectively output by the two branches, carrying out weighted fusion on the first enhancement spectrum and the second enhancement spectrum to obtain a fusion spectrum, and carrying out inverse short-time Fourier transform on the fusion spectrum to obtain the enhanced time domain audio.
2. The deep learning-based power scene audio enhancement method according to claim 1, wherein the estimation of the power frequency of the power grid comprises the steps of carrying out peak search or autocorrelation analysis on the amplitude spectrum of the complex time spectrum within a preset power frequency candidate range to obtain a power frequency candidate value, and selecting the power frequency candidate value with the largest energy as the power frequency of the power grid.
3. The deep learning-based power scene audio enhancement method according to claim 1, wherein the generation of the continuous frequency domain suppression weights comprises the steps of determining candidate harmonic frequency bands by taking power frequency of the power grid and harmonic frequencies thereof as centers, adaptively determining frequency band boundaries and suppression intensities by a neural network according to spectral features inside and outside the candidate harmonic frequency bands, and outputting a frequency domain suppression weight graph with continuous values; In the generation process of the continuous frequency domain suppression weight map, when a candidate harmonic frequency band is built around power frequency and a multi-order harmonic center thereof, the bandwidth of the candidate harmonic frequency band is adaptively determined according to the broadening characteristic of a harmonic spectrum peak, a local noise bottom and frequency resolution, a soft suppression window for continuous transition is adopted in the candidate harmonic frequency band to form a basic soft mask, the neural network outputs adaptive adjustment quantity for adjusting suppression intensity, frequency bandwidth and center position according to spectral characteristics inside and outside the candidate harmonic frequency band, a frequency domain suppression weight map which continuously changes at a frequency band boundary is obtained, and meanwhile, minimum reservation constraint is set for a low-frequency band where a voice fundamental frequency is located and constraint is applied to the change amplitude of the frequency domain suppression weight.
4. The deep learning-based power scene audio enhancement method according to claim 1, wherein the transient indication map is a continuous valued map with the same resolution as the complex time spectrum, characterizes confidence that each time-frequency unit is transient impulse noise, and obtains a transient region through thresholding.
5. The deep learning-based power scene audio enhancement method according to claim 1, wherein the gating convolution repair network takes neighborhood time-frequency characteristics of a transient region and a corresponding region mask thereof as condition input, regulates and controls contribution of neighborhood information to repair results through a gating mechanism, and outputs a substitution spectrum for filling the transient region; The gating convolution repair network generates candidate repair features through main branch convolution and gating branches in each network layer respectively, generates a threshold value, and conducts point-to-point modulation on the candidate repair features by the threshold value to control contribution of neighborhood information to repair results, the transient region mask updates among the network layers according to the threshold value to enable a region to be repaired to gradually shrink along with layer progression, and in an output stage, a continuous transition mask is constructed based on the transient region mask, and accordingly continuous transition fusion is conducted on a substitution spectrum and an original time spectrum.
6. The deep learning based power scene audio enhancement method of claim 1, wherein the confidence level comprises a harmonic residual metric of a harmonic rejection branch output and a transient region identification determination metric of a transient repair branch output.
7. The deep learning-based power scene audio enhancement method according to claim 6, wherein the dynamic fusion weight is calculated according to a time frame or a time-frequency unit and is obtained by normalized mapping of the confidence coefficient, so that a branch with higher confidence coefficient occupies a higher fusion weight at a corresponding position.
8. The deep learning-based power scene audio enhancement system, based on the deep learning-based power scene audio enhancement method of any one of claims 1 to 7, is characterized by comprising: the time-frequency conversion module is used for converting the noisy time-domain audio into a complex time spectrum; The harmonic suppression branch network is used for generating frequency domain suppression weights and outputting a first enhancement spectrum; The transient repair branch network is used for generating a transient indication map and outputting a second enhancement spectrum; the fusion module is used for calculating dynamic fusion weight according to the confidence coefficient and outputting a fusion spectrum; and the reconstruction module is used for executing inverse short-time Fourier transform on the fusion spectrum to output enhanced time domain audio.
9. The deep learning based power scene audio enhancement system of claim 8, wherein the reconstruction module uses an overlap-add time domain synthesis when performing an inverse short time fourier transform and uses phase information corresponding to the fused spectrum or phase information output by a phase recovery network.
10. The deep learning based power scene audio enhancement system of claim 8, wherein the harmonic rejection bypass network, transient repair bypass network, and fusion module are deployed on the same computing device or edge computing device.

Description

Electric power scene audio enhancement method and system based on deep learning Technical Field The invention relates to the technical field of audio signal processing and voice enhancement, in particular to an electric power scene audio enhancement method and system based on deep learning. Background The power dispatching/station control type teleconference is an important communication scene for carrying out daily dispatching, emergency command and collaborative operation in the power industry, voice acquisition is easy to be interfered by industrial noise in a meeting/duty environment of a station control room or an adjacent equipment area of a transformer substation, and background noise commonly comprises: (1) Continuous low-frequency buzzing caused by magnetostriction and structural vibration of equipment iron cores of a power transformer, a reactor and the like, the frequency spectrum of the continuous low-frequency buzzing is usually represented as a discrete spectral line (possibly accompanied by power frequency and other low-frequency components) mainly based on frequency multiplication of power frequency 2 and harmonic waves thereof, and the continuous low-frequency buzzing has obvious low-frequency and pure-tone characteristics; (2) Transient impact sounds (operating table keys/cabinet action sounds can be overlapped indoors) caused by the switching equipment such as a breaker/isolating switch and the like for switching on and off, relay protection and related actuator actions, and the noise generally has the characteristics of instantaneous high energy and wider frequency spectrum. In the prior art, a voice/audio enhancement method based on deep learning is mostly adopted to carry out masking estimation or mapping reconstruction in a time frequency domain, and part of schemes adopt a time domain and frequency domain dual-branch structure to fuse different characterizations, if a stronger low-frequency band global suppression or fixed masking strategy is adopted for the continuous low-frequency pure tone interference, voice fundamental frequency and low-order harmonic components thereof can be weakened while buzzing is reduced, so that voice naturalness and thickness sense are influenced, and for transient impact noise, residual noise and processing artifacts (such as time diffusion/tailing, music noise and the like) can appear in part of cases by an enhancement method based on time-frequency masking estimation, so that hearing continuity and intelligibility are influenced. Some schemes introduce an attention mechanism or a gating cycle structure to model long-range dependence and time sequence dynamics, but under the conditions that low-frequency steady-state pure tones coexist with transient impulse noise in a power scene and the difference between energy and statistical characteristics of the two are obvious, how to realize self-adaptive suppression of the two types of noise and give consideration to voice fidelity still has a further optimization space. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. The invention provides a deep learning-based power scene audio enhancement method and a deep learning-based power scene audio enhancement system, which solve the problems that a power scene contains strong power frequency harmonic buzzing and transient switching pulse, low-frequency voice is easy to damage and tailing distortion is generated in the prior art. In order to solve the technical problems, the invention provides the following technical scheme: in a first aspect, an embodiment of the present invention provides a power scene audio enhancement method based on deep learning, including: step S1, framing the noisy time domain audio and performing short-time Fourier transform to obtain a complex time spectrum; s2, inputting the complex time spectrum into a harmonic suppression branch network and a transient repair branch network in parallel; The harmonic suppression branch network generates a continuous frequency domain suppression weight graph according to preset or estimated power frequency of the power grid and harmonic frequency thereof, and weights the complex time frequency spectrum to suppress steady harmonic noise so as to obtain a first enhancement spectrum; The transient repair branch network outputs a transient indication map, and generates a substitution spectrum based on the neighborhood context through a gating convolution repair network for a time-frequency region marked by the transient indication map to obtain a second enhancement spectrum; and S3, calculating a dynamic fusion weight according to the confidence coefficient respectively output by the two branches, carrying out weighted fusion on the first enhancement spectrum and the second enhancement spectrum to obtain a fusion spectrum, and carrying out inverse short-time Fourier transform on the fusion spectrum to obtain the enhanced time doma