CN-121983068-A - Speech coding method, device, electronic equipment and computer storage medium

CN121983068ACN 121983068 ACN121983068 ACN 121983068ACN-121983068-A

Abstract

The invention provides a voice coding method, a voice coding device, electronic equipment and a computer storage medium, which are used for generating a voice coding result by fusing coded semantic features output by a pre-training feature coder, video features output by a video coder and coding features output by a current voice coder according to different conditions by utilizing a neural network voice coding model after voice data are received, and effectively improving the tone quality of coded voice.

Inventors

GUO YAO
AI YANG
DU HUIPENG
LING ZHENHUA

Assignees

中国科学技术大学

Dates

Publication Date: 20260505
Application Date: 20260228

Claims (10)

1. A method of speech coding, comprising: Receiving voice data; If a target video stream exists currently, inputting the voice data and the target video stream into a neural network voice coding model, and outputting to obtain a voice coding result, wherein the target video stream is a video stream synchronous with the voice data; If no target video stream exists at present, inputting the voice data into the neural network voice coding model, and outputting to obtain a voice coding result; The neural network voice coding model at least comprises a video encoder, a pre-training feature encoder, an information fusion module and a voice encoder; the video encoder is used for encoding the target video stream to obtain video characteristics; The pre-training feature encoder is used for encoding semantic features to obtain encoded semantic features, wherein the semantic features are obtained by extracting the voice data; the voice encoder is used for encoding the voice to obtain encoding characteristics; if a target video stream exists currently, the information fusion module is used for generating a voice coding result based on multi-mode high-level features and the coding features, wherein the multi-mode high-level features are obtained by fusion based on the video features and the coded semantic features; And if the target video stream does not exist currently, the information fusion module is used for generating a voice coding result based on the coded semantic features, the coded features and the distillation loss.
2. The speech coding method according to claim 1, wherein the speech coder comprises a pre-processing module, an encoder backbone network, and a first post-processing module; the preprocessing module comprises a 1-dimensional deconvolution layer and a first-layer normalization processing unit; Each convolution network in the encoder backbone network comprises a 1-dimensional depth separable convolution layer, a second layer normalization processing unit, a first linear layer, a global response normalization processing unit and a first activation function; the first post-processing module comprises a layer normalization processing unit, a second linear layer, a downsampling 1-dimensional convolution layer and a first 1-dimensional convolution layer.
3. The speech encoding method according to claim 1, wherein the video encoder comprises a first analysis module, a second analysis module, and a second post-processing module; the first analysis module comprises a first three-dimensional convolution layer, a first batch normalization unit and a second activation function; the second analysis module comprises a second three-dimensional convolution layer, a second batch normalization unit, a third activation function and a pooling layer; The second post-processing module includes a third linear layer and a cascaded 1-dimensional convolution layer.
4. The speech coding method of claim 1, wherein the pre-training feature encoder comprises a second 1-dimensional convolutional layer, a first coding module, and a second coding module.
5. The speech coding method according to claim 4, wherein the encoded semantic features include a first semantic sub-feature and a second semantic sub-feature, the first semantic sub-feature being output by the first encoding module, the second semantic sub-feature being output by the second encoding module, the multi-modal high-level features being obtained by fusing the video features and the encoded semantic features, comprising: Determining an attention weight based on the first semantic sub-feature and the video feature; Based on the attention weight and the second semantic sub-feature, a multi-modal high-level feature is generated.
6. The speech coding method according to claim 1, wherein the information fusion module is configured to generate a speech coding result based on the multi-modal higher-layer feature and the coding feature if the target video stream is currently present, and the method comprises: If a target video stream exists currently, the information fusion module splices the multi-mode high-level features and first intermediate features to obtain first splicing features, wherein the first intermediate features are coding features generated in the coding process of the ith layer of convolution network of the voice coder; Performing dimension reduction processing on the first spliced feature to obtain a first dimension reduction feature; and inputting the first dimension reduction feature into an i+1 layer convolution network in the voice encoder, wherein the i+1 layer convolution network of the voice encoder continues to encode based on the first dimension reduction feature, and a voice encoding result is generated.
7. The speech coding method according to claim 1, wherein if the target video stream does not currently exist, the information fusion module is configured to generate a speech coding result based on the encoded semantic feature, the encoding feature, and the distillation loss, and the method comprises: If no target video stream exists at present, the information fusion module splices the encoded semantic features with second intermediate features to obtain second spliced features, wherein the second intermediate features are encoding features generated in the encoding process of the ith layer convolution network of the voice encoder; performing dimension reduction processing on the second spliced feature to obtain a second dimension reduction feature; generating a third dimension-reduction feature based on the second dimension-reduction feature and the distillation loss; And inputting the third dimension reduction feature into an i+1 layer convolution network in the voice encoder, wherein the i+1 layer convolution network of the voice encoder continues to encode based on the third dimension reduction feature, and a voice encoding result is generated.
8. A speech coder, comprising: A receiving unit for receiving voice data; the input unit is used for inputting the voice data and the target video stream into a neural network voice coding model and outputting a voice coding result if the target video stream exists currently, wherein the target video stream is a video stream synchronous with the voice data; The input unit is further configured to input the voice data into the neural network voice coding model if the target video stream does not exist currently, and output a voice coding result; The neural network voice coding model at least comprises a video encoder, a pre-training feature encoder, an information fusion module and a voice encoder; the video encoder is used for encoding the target video stream to obtain video characteristics; The pre-training feature encoder is used for encoding semantic features to obtain encoded semantic features, wherein the semantic features are obtained by extracting the voice data; the voice encoder is used for encoding the voice to obtain encoding characteristics; if a target video stream exists currently, the information fusion module is used for generating a voice coding result based on multi-mode high-level features and the coding features, wherein the multi-mode high-level features are obtained by fusion based on the video features and the coded semantic features; And if the target video stream does not exist currently, the information fusion module is used for generating a voice coding result based on the coded semantic features, the coded features and the distillation loss.
9. An electronic device, comprising: One or more processors; A storage device having one or more programs stored thereon; The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech coding method of any of claims 1 to 7.
10. A computer storage medium, having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the speech coding method according to any of claims 1 to 7.

Description

Speech coding method, device, electronic equipment and computer storage medium Technical Field The present invention relates to the field of speech signal processing technologies, and in particular, to a speech coding method, a speech coding device, an electronic device, and a computer storage medium. Background In the field of voice signal processing, a nerve voice coding algorithm is used as a key technology for realizing voice efficient compression and high-quality reconstruction, and has important significance for applications such as voice communication, voice recognition, voice synthesis and the like. Traditional neural speech coding algorithms commonly employ Frame (Frame) techniques to pre-process speech when processing speech signals. The basic principle of framing techniques is to divide a continuous speech signal into a series of short-time frames, each of which is typically between 10-30 milliseconds in length. This processing is intended to convert the speech signal into a short-term signature sequence suitable for neural network processing. During the framing process, the algorithm focuses mainly on short-time information in the current frame, and the speech is characterized by extracting acoustic features (such as frequency spectrum, energy, etc.) in the frame. However, there are significant limitations to this approach. Due to the short frame length, it is difficult for conventional algorithms to adequately capture contextual information in speech. Speech signals have a high degree of timing and context dependence, and there is often a close semantic and acoustic association between adjacent frames. For example, in speech recognition, the speech content of the current frame may form a complete semantic unit together with the speech content of the previous frames. However, the traditional framing technology only focuses on the characteristics of the current frame, ignores the context relationship between frames, and results in losing part of important voice information in the processing process. This problem of insufficient context attention is particularly pronounced in the sound quality of the decoded speech. The decoded speech often has a reduced quality, and is characterized by a speech that does not reach an ideal level in terms of naturalness, smoothness, intelligibility, and the like. For example, in a speech synthesis scenario, the synthesized speech may sound hard and inconclusive due to lack of support for contextual information, failing to accurately restore prosodic and emotional characteristics of the original speech. Disclosure of Invention In view of the above, the present invention provides a method, apparatus, electronic device and computer storage medium for encoding speech, which effectively improves the quality of the encoded speech. The first aspect of the present invention provides a speech coding method, comprising: Receiving voice data; If a target video stream exists currently, inputting the voice data and the target video stream into a neural network voice coding model, and outputting to obtain a voice coding result, wherein the target video stream is a video stream synchronous with the voice data; If no target video stream exists at present, inputting the voice data into the neural network voice coding model, and outputting to obtain a voice coding result; The neural network voice coding model at least comprises a video encoder, a pre-training feature encoder, an information fusion module and a voice encoder; the video encoder is used for encoding the target video stream to obtain video characteristics; The pre-training feature encoder is used for encoding semantic features to obtain encoded semantic features, wherein the semantic features are obtained by extracting the voice data; the voice encoder is used for encoding the voice to obtain encoding characteristics; if a target video stream exists currently, the information fusion module is used for generating a voice coding result based on multi-mode high-level features and the coding features, wherein the multi-mode high-level features are obtained by fusion based on the video features and the coded semantic features; And if the target video stream does not exist currently, the information fusion module is used for generating a voice coding result based on the coded semantic features, the coded features and the distillation loss. Optionally, the speech coder comprises a preprocessing module, a coder backbone network and a first post-processing module; the preprocessing module comprises a 1-dimensional deconvolution layer and a first-layer normalization processing unit; Each convolution network in the encoder backbone network comprises a 1-dimensional depth separable convolution layer, a second layer normalization processing unit, a first linear layer, a global response normalization processing unit and a first activation function; the first post-processing module comprises a layer normalization processing unit, a s