CN-122024700-A - Voice synthesis method suitable for offline scene of Internet of things

CN122024700ACN 122024700 ACN122024700 ACN 122024700ACN-122024700-A

Abstract

The invention discloses a voice synthesis method suitable for an offline scene of the Internet of things, and relates to the technical field of voice data processing. The method comprises the steps of obtaining an original human voice syllable sample and a target syllable sequence to be synthesized, calculating short-time root mean square energy of each frame of the original voice, carrying out endpoint detection, framing, windowing and FFT (fast Fourier transform) to obtain a frequency domain spectrum, extracting amplitude values higher than a preset multiple of an average amplitude value and corresponding frequencies in each frame of the spectrum as characteristic wave peaks, packaging the characteristic wave peaks into compressed characteristic data packets, storing the compressed data packets in local, extracting the compressed data packets corresponding to the target syllables from the local, supplementing the frequency spectrum amplitude values among the nodes through linear interpolation by taking the characteristic wave peaks of each frame as nodes, adjusting the amplitude values according to the short-time energy of each frame to obtain a time domain voice frame, and carrying out frame splicing and tone adjustment to output high-quality human voice. The invention can meet the requirements of low storage requirement of voice synthesis and instantaneity and clear voice quality.

Inventors

ZHANG YAN
ZHANG GUOFENG
FENG LING
ZHU XIANGCAI

Assignees

泰山学院

Dates

Publication Date: 20260512
Application Date: 20260212

Claims (7)

1. The voice synthesis method suitable for the offline scene of the Internet of things is characterized by comprising the following steps of: Acquiring an original human voice syllable sample covering a target language type and a syllable sequence of a voice to be synthesized; The method comprises the steps of determining short-time root mean square energy of each frame of original human voice syllable sample, sequentially carrying out endpoint detection, framing, windowing and FFT (fast Fourier transform) on the original human voice syllable sample to obtain a frequency domain voice spectrum of each frame of original human voice syllable sample, extracting amplitude values which are larger than 1.5 times of average amplitude value of each frame of frequency domain voice spectrum in each frame frequency domain voice spectrum, sequencing the amplitude values according to descending order, selecting 5-8 amplitude values which are ranked at the front and a frequency value corresponding to each amplitude value as characteristic wave peaks, packaging the short-time root mean square energy and the characteristic wave peaks into compressed characteristic data packets, and importing the compressed characteristic data packets into a local Flash for storage, wherein the compressed characteristic data corresponding to each syllable comprises a plurality of voice frames, the short-time root mean square energy corresponding to each voice frame and a plurality of characteristic wave peaks corresponding to each voice frame; Taking the characteristic wave peak of each frame in the compressed characteristic data packet corresponding to each syllable as a node, complementing the frequency domain voice spectrum amplitude between the nodes through linear interpolation, distributing fixed phase offset within the range of 0 to pi/2 to the complemented frequency domain voice spectrum amplitude, and keeping the phase between adjacent nodes in linear transition to obtain the complemented frequency domain voice spectrum; And sequentially performing frame splicing and tone adjustment on the time domain voice frames of the syllable sequence of the voice to be synthesized to obtain the human audio voice of the syllable sequence of the voice to be synthesized.
2. The method of claim 1, wherein the sampling frequency of the original human phonetic syllable samples is 16kHz.
3. The method of claim 1, wherein the detecting the endpoint of the original human phonetic syllable sample specifically comprises: Calculating short-time root mean square energy of each 10ms time window in the original human voice syllable sample, distinguishing a mute segment and a voice segment in the original human voice syllable sample, and dividing the mute segment and the voice segment into a plurality of syllable signals; and calculating the number of zero crossing times of the amplitude signal in each 10ms time window in the original human voice syllable sample, distinguishing an unvoiced sound segment from a voiced sound segment and dividing the unvoiced sound segment into a plurality of syllable signals.
4. The method for synthesizing speech according to claim 3, the method is characterized by framing and windowing the original human voice syllable sample, and specifically comprises the following steps: Setting the frame shift to be 1/2 of the frame length, and framing the segmented syllable signals; ha Mingchuang is applied to each syllable signal after framing, and the formula is: ; In the formula, Is the first Ha Mingchuang of the signal of a syllable, For the total frame length of each syllable signal.
5. The method for synthesizing speech suitable for offline scene of internet of things according to claim 2 or 4, wherein performing FFT transformation on the original human phonetic syllable samples specifically comprises: Carrying out 1024-point FFT (fast Fourier transform) on each frame of time domain signal in each windowed syllable signal, and converting the original human voice syllable sample in the time domain into a frequency domain voice frequency spectrum, wherein the 1024-point FFT can cover a frequency band of 0kHz-8kHz under the sampling rate of 16kHz, and the frequency spectrum resolution is 15.6Hz; if the computing power of the internet of things equipment cannot support 1024-point FFT conversion, replacing the 1024-point FFT with 512-point FFT.
6. The method of claim 1, wherein the adjusting the frequency domain voice spectrum after the completion according to the short-time root mean square energy of each frame in the compressed feature data packet corresponding to each syllable specifically comprises: and multiplying the short-time root mean square energy of each frame in the compressed characteristic data packet corresponding to each syllable by the amplitude of the frequency domain voice spectrum after the completion so as to adjust the frequency domain voice spectrum after the completion.
7. The method for synthesizing speech applicable to offline scenes of the internet of things as claimed in claim 4, wherein the frame splicing and the pitch adjustment are sequentially performed on the time domain speech frames of the syllable sequence of the speech to be synthesized, specifically comprising: Because the frame is shifted to 1/2 of the frame length, 50% of overlapping area exists between adjacent voice frames of each syllable in the syllable sequence of the voice to be synthesized; performing linear weighted superposition on the overlapping area of the next frame and the overlapping area of the previous frame, and splicing the time domain voice frames of the syllable sequence of the voice to be synthesized; according to the voice synthesis requirement preset by the user, adjusting the fundamental frequency of the spliced time domain voice frame comprises the following steps: extracting the fundamental frequency of the spliced time domain voice frame by adopting an autocorrelation algorithm; If the voice synthesis requirement preset by the user is raised tone, multiplying the fundamental frequency of the voice frame by 1.2 to 1.5; If the user preset speech synthesis requirement is to reduce the pitch, the fundamental frequency of the speech frame is multiplied by 0.7 to 0.9.

Description

Voice synthesis method suitable for offline scene of Internet of things Technical Field The application relates to the technical field of voice data processing, in particular to a voice synthesis method suitable for an offline scene of the Internet of things. Background The internet of things (IoT) device is widely applied to the fields of smart home (such as an intelligent switch and a temperature controller), industrial control (such as a sensor and a small controller), intelligent wearing (such as a low-power-consumption bracelet), and the like, offline voice interaction is one of core functions, devices in most scenes have no stable network (such as industrial workshops and remote area devices), the devices need to be separated from online computing power to independently complete voice synthesis, and hardware limitations of 'small storage capacity (mostly MB-level Flash) and low computing power (mostly low-power-consumption MCU)' are commonly existed in the devices. In the prior art, the current offline voice synthesis technology of the internet of things is mainly divided into two types, one type is a pre-storage splicing scheme, a common voice segment is recorded in advance, an audio compression algorithm (such as MP3 and AAC) is utilized for compression or non-compression storage to equipment, corresponding segments are called according to requirements during synthesis, complete voice is generated through frame splicing and smoothing, the other type is a traditional parameter synthesis scheme, parameters such as Linear Prediction Coefficients (LPC), fundamental frequency (F0) and formants of voice signals are extracted, the parameters are stored instead of complete voice, and voice waveforms are reconstructed through LPC inverse filtering during synthesis. However, the pre-storage splicing scheme needs to store a large number of complete voice fragments, a single short sentence is about 5-10KB, a few MB of common scenes are required to be stored, KB-level storage redundancy of far-beyond parts of Internet of things equipment (such as intelligent sensors) is achieved, the combination of pre-stored fragments can be synthesized, non-pre-stored contents can not be dynamically generated, the traditional parameter synthesis scheme has the problems that storage and tone quality are difficult to consider, part delay is high, and the requirements of low storage, instantaneity and clear tone quality of the Internet of things equipment can not be met at the same time. Therefore, a speech synthesis method suitable for the offline scene of the internet of things is needed to meet the requirements of instantaneity and sound clarity while meeting the low storage requirement. Disclosure of Invention Based on this, it is necessary to provide a voice synthesis method suitable for the offline scene of the internet of things in order to solve the above technical problems. The invention adopts the following technical scheme: the invention provides a voice synthesis method suitable for an offline scene of the Internet of things, which comprises the following steps: Acquiring an original human voice syllable sample covering a target language type and a syllable sequence of a voice to be synthesized; The method comprises the steps of determining short-time root mean square energy of each frame of original human voice syllable sample, sequentially carrying out endpoint detection, framing, windowing and FFT (fast Fourier transform) on the original human voice syllable sample to obtain a frequency domain voice spectrum of each frame of original human voice syllable sample, extracting amplitude values which are larger than 1.5 times of average amplitude value of each frame of frequency domain voice spectrum in each frame frequency domain voice spectrum, sequencing the amplitude values according to descending order, selecting 5-8 amplitude values which are ranked at the front and a frequency value corresponding to each amplitude value as characteristic wave peaks, packaging the short-time root mean square energy and the characteristic wave peaks into compressed characteristic data packets, and importing the compressed characteristic data packets into a local Flash for storage, wherein the compressed characteristic data corresponding to each syllable comprises a plurality of voice frames, the short-time root mean square energy corresponding to each voice frame and a plurality of characteristic wave peaks corresponding to each voice frame; Taking the characteristic wave peak of each frame in the compressed characteristic data packet corresponding to each syllable as a node, complementing the frequency domain voice spectrum amplitude between the nodes through linear interpolation, distributing fixed phase offset within the range of 0 to pi/2 to the complemented frequency domain voice spectrum amplitude, and keeping the phase between adjacent nodes in linear transition to obtain the complemented frequency domain voice spectrum; And