CN-121983072-A - Speech reconstruction method and system based on gating recalibration and routing weighting

CN121983072ACN 121983072 ACN121983072 ACN 121983072ACN-121983072-A

Abstract

The invention discloses a voice reconstruction method and a system based on gating recalibration and routing weighting, a intra-group channel gating recalibration module is introduced at the coding end of a nerve vocoder model, and the coding characteristics are adaptively recalibrated through channel grouping and a gating mechanism, so that the effectiveness and the quantization efficiency of characteristic expression are improved. Meanwhile, a routing network is introduced in a training stage of the neural vocoder model, and based on coding feature output probability distribution, self-adaptive weighted reconstruction loss aiming at the Mel filter bank channel segments is constructed, so that the model focuses on optimizing the reconstruction error of the Mel filter bank channel segments corresponding to the perception key frequency region under low bit rate. The routing network only participates in loss calculation and parameter updating in a training stage, and an reasoning stage is removed, so that no extra routing reasoning calculation overhead is introduced. The invention can improve the perceived quality and stability of voice reconstruction under the extremely low bit rate, and is suitable for bandwidth-limited scenes such as satellite communication, short-wave communication and the like.

Inventors

LI YE
REN SHUXIAN
CAI TIANYU
WANG JINGXIANG
ZHANG PENG

Assignees

齐鲁工业大学(山东省科学院)
山东省计算中心（国家超级计算济南中心）

Dates

Publication Date: 20260505
Application Date: 20260408

Claims (10)

1. A voice reconstruction method based on gating recalibration and routing weighting is characterized by comprising the following steps: Acquiring an original voice signal and preprocessing; Inputting the preprocessed voice signals into a nerve vocoder model, and outputting to obtain reconstructed voice signals; The processing procedure of the preprocessed voice signal in the nerve vocoder model comprises the following steps: The input voice signal firstly enters an encoder, the encoder is composed of a multi-stage one-dimensional convolution downsampling structure and a sequence modeling structure to obtain coding features, then the coding features are input into an intra-group channel gating recalibration module, channel grouping is carried out on the coding features, group-level global statistics is extracted, group gating weights are generated, self-adaptive recalibration is carried out on the coding features based on the group gating weights to obtain recalibrated features, the recalibrated features are input into a residual vector quantizer to be discretized to obtain a discrete index sequence, during decoding, quantization features are recovered according to the discrete index sequence, the recovered quantization features are input into a decoder, and the reconstructed voice is obtained through up-sampling and waveform reconstruction.
2. The method for reconstructing voice based on gate recalibration and route weighting according to claim 1, wherein the intra-group channel gate recalibration module comprises a feature statistics module, a gate generation module and a broadcast recalibration module.
3. The voice reconstruction method based on gate control recalibration and routing weighting according to claim 2, wherein the coding features enter a feature statistics module, global average pooling is performed on the coding features to obtain channel statistics vectors, then channel dimensions are divided according to a preset group number, the statistics vectors are rearranged, and average aggregation is performed on the channel dimensions in the group to obtain group-level description vectors; the group level description vector is further input into a gating generation module, and the gating generation module is composed of two full-connection layers and is used for generating group gating weights; And the group of gating weights are then input into a broadcast recalibration module to obtain a feature representation after recalibration.
4. The method for reconstructing voice based on gate-controlled recalibration and route weighting according to claim 1, wherein a route-guided adaptive weighted reconstruction loss mechanism is introduced into a training framework of a neural vocoder model, and the loss calculation and parameter updating are participated in the training phase.
5. The method for speech reconstruction based on gating recalibration and route weighting according to claim 4, wherein the route guidance based adaptive weighting reconstruction loss mechanism comprises: inputting the remarked characteristics into a lightweight routing network, wherein the routing network is formed by two layers of one-dimensional convolution, and obtaining probability distribution on a preset routing state based on channel statistical information of time frames, wherein the probability distribution is used for determining a combination mode of reconstruction loss weighting coefficients; Then, aggregating the frame-by-frame probability along a time dimension to obtain a sample level routing coefficient, and linearly combining the coefficient with a preset weight template to form a channel segment weight vector acting on a Mel spectrum domain; In the reconstruction loss calculation, the weight vector is applied to the Mel spectrum difference to obtain the route guided adaptive weighted reconstruction loss.
6. The voice reconstruction method based on gating recalibration and routing weighting according to claim 5 is characterized in that the preset weighting templates are three Mel channel section weighting templates with different weighting characteristics, wherein the first template is a low-medium frequency channel section weighting template, the second template is a smooth transition template, and the third template is a high-frequency channel section weighting template.
7. A voice reconstruction system based on gate recalibration and route weighting, characterized in that a voice reconstruction method based on gate recalibration and route weighting as set forth in any one of claims 1-6 is adopted, comprising: The signal acquisition module acquires an original voice signal and performs preprocessing; The voice reconstruction module inputs the preprocessed voice signals into the nerve vocoder model and outputs the reconstructed voice signals; The processing procedure of the preprocessed voice signal in the nerve vocoder model comprises the following steps: The input voice signal firstly enters an encoder, the encoder is composed of a multi-stage one-dimensional convolution downsampling structure and a sequence modeling structure to obtain coding features, then the coding features are input into an intra-group channel gating recalibration module, channel grouping is carried out on the coding features, group-level global statistics is extracted, group gating weights are generated, self-adaptive recalibration is carried out on the coding features based on the group gating weights to obtain recalibrated features, the recalibrated features are input into a residual vector quantizer to be discretized to obtain a discrete index sequence, during decoding, quantization features are recovered according to the discrete index sequence, the recovered quantization features are input into a decoder, and the reconstructed voice is obtained through up-sampling and waveform reconstruction.
8. An electronic device, comprising: a memory for non-transitory storage of computer readable instructions, and a processor for executing the computer readable instructions, Wherein the computer readable instructions, when executed by the processor, perform a gating recalibration and routing weighting based speech reconstruction method according to any one of the preceding claims 1-6.
9. A storage medium storing computer readable instructions non-transitory, wherein the non-transitory computer readable instructions, when executed by a computer, perform a gating recalibration and routing weighting based speech reconstruction method according to any of the preceding claims 1-6.
10. A computer program product comprising a computer program for implementing a gated recalibration and routing weighting based speech reconstruction method as claimed in any of the preceding claims 1-6 when run on one or more processors.

Description

Speech reconstruction method and system based on gating recalibration and routing weighting Technical Field The invention relates to the technical field of voice signal processing, in particular to a voice reconstruction method and a voice reconstruction system based on gating recalibration and routing weighting. Background The low-rate speech coding technology has important practical value and urgent need in application scenes with limited bandwidth, such as satellite communication, short-wave communication, underwater acoustic communication, secret communication and the like and complex channel environment. By adopting the low bit rate voice coding scheme, bandwidth and transmission resources can be effectively saved, the utilization rate of a communication link is improved, and a larger realization space is reserved for voice encryption and safe transmission, so that the voice coding scheme becomes a core support technology for voice communication under severe channel and narrow-band conditions. In recent years, a nerve vocoder based on deep learning has better effect in a medium-high code rate speech generation task, but has obvious bottleneck in a low bit rate scene that coding features have the problems of information redundancy and uneven distribution in channel dimension, quantization resources are easy to occupy by feature channels with lower perception contribution degree, the expression capability of key perception details such as high-frequency sound scraping, transient mutation and the like is insufficient, and the reconstructed speech has the phenomena of high-frequency blurring, boundary passivation, distortion aggravation and the like. Disclosure of Invention In order to solve the defects in the prior art, the invention provides a voice reconstruction method and a voice reconstruction system based on gating recalibration and routing weighting, which improve the definition, naturalness and detail fidelity of reconstructed voice through the routing weighting of a lightweight gating recalibration and training stage on the premise of keeping the main flow of an encoder-quantizer-decoder unchanged, and particularly improve the problem that high-frequency wiping details and transient structures are easy to damage under low bit rate. In one aspect, a method for reconstructing voice based on gating recalibration and routing weighting is provided, including: Acquiring an original voice signal and preprocessing; Inputting the preprocessed voice signals into a nerve vocoder model, and outputting to obtain reconstructed voice signals; The processing procedure of the preprocessed voice signal in the nerve vocoder model comprises the following steps: The input voice signal firstly enters an encoder, the encoder is composed of a multi-stage one-dimensional convolution downsampling structure and a sequence modeling structure to obtain coding features, then the coding features are input into an intra-group channel gating recalibration module, channel grouping is carried out on the coding features, group-level global statistics is extracted, group gating weights are generated, self-adaptive recalibration is carried out on the coding features based on the group gating weights to obtain recalibrated features, the recalibrated features are input into a residual vector quantizer to be discretized to obtain a discrete index sequence, during decoding, quantization features are recovered according to the discrete index sequence, the recovered quantization features are input into a decoder, and the reconstructed voice is obtained through up-sampling and waveform reconstruction. Further, the intra-group channel gating recalibration module comprises a feature statistics module, a gating generation module and a broadcast recalibration module. Further, the coding features enter a feature statistics module, global average pooling is carried out on the coding features to obtain channel statistics vectors, then channel dimensions are divided according to the preset group number, the statistics vectors are rearranged, and average aggregation is carried out on the channel dimensions in the group to obtain group-level description vectors; the group level description vector is further input into a gating generation module, and the gating generation module is composed of two full-connection layers and is used for generating group gating weights; The set of gating weights is then input to a broadcast recalibration module to obtain a recalibrated feature representation. Furthermore, a route guidance-based self-adaptive weighted reconstruction loss mechanism is introduced into a neural vocoder model training framework, and loss calculation and parameter updating are participated in a training stage. Further, the route guidance-based adaptive weighted reconstruction loss mechanism comprises: inputting the remarked characteristics into a lightweight routing network, wherein the routing network is formed by two layers of one-dimension