CN-121619609-B - Self-adaptive voice semantic communication method based on hierarchical time sequence importance

CN121619609BCN 121619609 BCN121619609 BCN 121619609BCN-121619609-B

Abstract

The invention relates to a self-adaptive voice semantic communication method based on hierarchical time sequence importance, which belongs to the field of voice analysis and comprises the steps of carrying out priority division on a discrete feature matrix by a transmitting end according to the hierarchy and time sequence attribute of voice features, screening the features according to importance scores in a feature matrix packaging and selecting link, interacting with wireless communication by a channel self-adaptive scheduling module, introducing a channel feedback mechanism, dynamically adjusting transmission power and a resource allocation strategy by sensing current channel state information, inputting the acquired sparse feature stream into a voice restoration module after signal demodulation is completed by a receiving end, and carrying out global reconstruction on the received features by utilizing the voice priori knowledge of deep pre-training. The invention combines the hierarchical voice characteristic extraction technology based on the discrete codebook and the generation type semantic restoration technology to furthest reduce the voice word error rate in the severe channel environment and improve the semantic intelligibility of a receiving end.

Inventors

HUANG CHUAN
HONG YUHENG

Assignees

电子科技大学(深圳)高等研究院

Dates

Publication Date: 20260508
Application Date: 20260203

Claims (6)

1. A self-adaptive voice semantic communication method based on hierarchical time sequence importance is characterized by comprising the following steps: firstly, a transmitting end carries out priority classification on a discrete feature matrix according to the level and time sequence attribute of voice features, and filters the features according to importance scores in a feature matrix packaging and selecting link; step two, the filtered feature packets enter a physical layer, interact with wireless communication through a channel self-adaptive scheduling module, introduce a channel feedback mechanism, and dynamically adjust transmission power and resource allocation strategies by sensing current channel state information; Step three, after finishing signal demodulation, the receiving end inputs the acquired sparse feature stream to a voice repair module, and global reconstruction is carried out on the received features by utilizing voice priori knowledge of deep pre-training; The step one of prioritizing the discrete feature matrix includes a two-dimensional importance assessment of a hierarchical dimension importance assessment and a time sequence dimension importance assessment, and a priority mapping and discarding strategy; The hierarchical dimension importance evaluation comprises the steps of analyzing 8 quantization layers in a discrete feature matrix extracted by a discrete codebook by utilizing a semantic evaluation module, measuring the contribution degree of each layer to original voice semantic restoration, identifying a 0 th layer carrying core semantic information as a semantic base according to the contribution degree score, dividing the 0 th layer into a first priority area, ensuring that the layer has the highest protection level in the transmission process, and taking the 0 th layer as the basis of subsequent voice reconstruction; The time sequence dimension importance evaluation comprises the steps of introducing a key time sequence frame sampling mechanism in a time domain dimension, and screening a characteristic column in time sequence by setting a fixed sampling step length K to obtain a key time sequence frame, wherein the characteristic column meeting a sampling condition is defined as a time sequence skeleton, and the key time sequence frame is divided into a second priority region in a characteristic layer except a 0 th layer; Mapping the evaluation results of the level dimension importance evaluation and the time sequence dimension importance evaluation to a unified importance weight map, wherein the importance weight map comprises a core protection area and an elastic detail area; The core protection area consists of a first priority area and a second priority area, and the complete transmission of the multi-level discrete features of the core protection area is preferentially ensured during resource scheduling; The elastic detail area comprises a third priority area, when the wireless channel environment is bad or the bandwidth is limited, the multi-level discrete features of the area are actively abandoned according to the self-adaptive rule, and the generated acoustic detail is empty, and the voice repair module is combined with the semantic base and the time sequence framework to carry out generation type complementation at the receiving end, so that high-quality communication is realized.
2. The method for adaptive voice semantic communication based on hierarchical timing importance according to claim 1, wherein the step one of screening comprises encapsulating core features with highest priority into an enhanced protection transmission queue, and entering mask processing or discarding detail features with low importance according to preset rules or real-time channel states, so as to realize semantic-level content adaptive compression at source.
3. The method for adaptive speech semantic communication based on hierarchical time sequence importance according to claim 1, wherein the third step is realized by constructing a comprehensive importance scoring model, adaptively selecting strategies for channel perception, dynamically threshold mapping, mask generation and sparse transmission, and optimizing objective functions for generation type reconstruction.
4. The method for adaptive speech semantic communication based on hierarchical temporal importance according to claim 3, wherein the constructing of the comprehensive importance scoring model comprises: Let the discrete feature matrix extracted by the discrete codebook be Wherein In order to be the number of layers, For any feature cell in the matrix for the number of time series frames Its comprehensive importance score Defined as a linear weighted combination of hierarchical semantic weights and temporal structure weights I (l, T) =α·h (l) +β·t (T), where H (l) is a hierarchical importance decay function, As a function of the sampling of the importance of the time sequence, As the weighting coefficient(s), Represents layer 1 A comprehensive transmission priority score of the frame feature unit; Considering that layer 0 carries core semantic information, and subsequent layers carry acoustic detail information, modeling the hierarchical weights using a segment index model to obtain H (l) =i (l=0) +ηe -μl · (1-I (l=0)), to ensure that the semantic pedestal has the globally highest base weight, where, In order to indicate the function, As the attenuation coefficient of the level of the layer, Scaling factors for the detail layer; in order to preserve the skeleton frame maintaining the prosody structure of speech in the time domain, a discrete dressing function definition is used When t is an integer multiple of K, The frame is identified as a critical timing frame, Is Croneck The function of the function is that, And the sampling step length is a preset Grid sampling step length.
5. The adaptive voice semantic communication method based on hierarchical timing importance according to claim 4, wherein the channel aware adaptive selection strategy comprises extracting SNR parameters according to channel state information fed back by a physical layer in real time And an available bandwidth B, and constructing a dynamic cut-off threshold based on the channel state And generates a binary transmission mask matrix ; The dynamic threshold mapping includes that in order to achieve smooth adaptive control, the following steps are taken Is set as , And Corresponding to the threshold boundaries under the bad channel and the ideal channel respectively, As a coefficient of sensitivity, a reference number, The mapping relation enables the system to automatically lift the decision threshold under the low signal-to-noise ratio environment for the central offset, and only high priority characteristics are reserved.
6. The method for adaptive speech semantic communication based on hierarchical timing importance according to claim 5, wherein the mask generation and sparse transmission comprises transmitting a mask matrix By a unit step function Generating to obtain Finally entering the sparse feature matrix of the physical layer transmission queue Hadamard product of original matrix and mask matrix, i.e ; The objective function optimization of the generative reconstruction comprises that at a receiving end, a voice repair module is configured to solve a generative optimal solution under the constraint condition, and the generation optimal solution is set Is as the parameter of Depth generation network of (a) reconstructed speech Its training and reasoning process aims at minimizing multi-objective joint loss function I.e. , wherein, In the event of a loss of semantic consistency, To combat the perceived loss.

Description

Self-adaptive voice semantic communication method based on hierarchical time sequence importance Technical Field The invention relates to the field of voice analysis, in particular to a self-adaptive voice semantic communication method based on hierarchical timing importance. Background In the face of urgent demands for high-reliability low-delay voice interaction in future communication systems, a sixth generation mobile network (6G) needs to realize higher-quality semantic transmission under limited frequency spectrum resources, and semantic communication can reduce the data volume and simultaneously maintain the validity of information by extracting deep semantic features of original voice by utilizing artificial intelligence technology. Currently, speech editing techniques based on discrete codebooks (e.g., speechTokenizer) are a research hotspot in this field due to their high compression rate and anti-interference potential. However, the existing voice semantic communication technology still faces the significant problems of 1 and missing feature importance assessment that the prior art generally regards multi-level discrete features (Tokens) extracted by a neural network as equally important information flows, in fact, semantic content (SEMANTICS) and acoustic details (statistics) of voice are distributed at different feature levels, and a framework (Skeleton) and redundancy are different on a time axis, and the prior art lacks of fine distinction of the level-time sequence dual importance, so that channel resources are evenly distributed and core information cannot be protected seriously. 2. The robustness under extreme channels is insufficient, that is, under the environment of extremely low signal-to-noise ratio (such as SNR is less than or equal to-15 dB), the traditional uniform protection strategy can cause noise pollution to all data packets, the decoded voice often contains a large amount of noise and is even completely unintelligible, and the existing method lacks a non-uniform protection mechanism based on channel perception to ensure the survival of a core skeleton. 3. The receiving end repairing mechanism is single, when part of non-key features are lost in transmission, the system cannot repair the missing acoustic details by using the prior knowledge of the generated model by relying on the multi-dependency deterministic decoding algorithm of the existing semantic receiver, so that the hearing quality of the reconstructed voice is sharply reduced under a severe channel. Disclosure of Invention The invention aims to overcome the defects of the prior art, provides a self-adaptive voice semantic communication method based on hierarchical timing importance, and solves the defects in the prior art. The aim of the invention is achieved by the following technical scheme that the self-adaptive voice semantic communication method based on the hierarchical time sequence importance comprises the following steps: firstly, a transmitting end carries out priority classification on a discrete feature matrix according to the level and time sequence attribute of voice features, and filters the features according to importance scores in a feature matrix packaging and selecting link; step two, the filtered feature packets enter a physical layer, interact with wireless communication through a channel self-adaptive scheduling module, introduce a channel feedback mechanism, and dynamically adjust transmission power and resource allocation strategies by sensing current channel state information; And thirdly, after the receiving end completes signal demodulation, inputting the acquired sparse feature stream into a voice repair module, and performing global reconstruction on the received features by utilizing the voice priori knowledge of deep pre-training. The first step of screening includes that core features with highest priority are packaged into a transmission queue with enhanced protection, and detail features with low importance enter mask processing or discarding according to preset rules or real-time channel states, so that semantic-level content self-adaptive compression is realized at the source. The step one of prioritizing the discrete feature matrix includes a two-dimensional importance assessment of a hierarchical dimension importance assessment and a temporal dimension importance assessment, and a priority mapping and discarding policy. The hierarchical dimension importance assessment comprises the steps of analyzing 8 quantization layers in a discrete feature matrix extracted by a discrete codebook by utilizing a semantic assessment module, measuring contribution degree of each layer to original voice semantic restoration, identifying a 0 th layer carrying core semantic information as a semantic base according to contribution degree scoring, dividing the 0 th layer into a first priority area, and ensuring that the layer has the highest protection level in the transmission process and serving a