US-12620386-B1 - Synthesizing personalized speech through adaptive excitation signal generation
Abstract
A speech synthesis system is described and may include at least one microphone; a speaker; a sensing system, and memory storing processor-executable instructions, which when executed by the processor, cause the processor to: detect speech-related signals emanating from the subject; generate a variable excitation signal; shape the generated variable excitation signal according to previously stored speech recordings; and cause, from the speaker and based on the shaped variable excitation signal, produced speech content that approximates the matched one or more voice characteristics in the previously stored speech recordings.
Inventors
- John Woodruff
- James E. Kemler
- Gina Vess
- Sam Altonji
Assignees
- INCENTMED IP, LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20251008
Claims (20)
- 1 . A speech synthesis system comprising: at least one microphone; a speaker configured to be positioned within an oral cavity of a subject; a sensing system configured to detect speech-related signals; at least one processor operatively coupled to the sensing system, the speaker, and memory storing processor-executable instructions, which when executed by the processor, cause the processor to: detect, using the sensing system, speech-related signals emanating from the subject; generate, based on the detected speech-related signals emanating from the subject, a variable excitation signal, the generating comprising: automatically varying an excitation signal over time and predicting an upcoming trajectory of fundamental frequencies associated with the excitation signal, and adjusting the predicted fundamental frequencies associated with the excitation signal at predetermined time intervals to capture natural intonation patterns for the subject; shape the generated variable excitation signal according to previously stored speech recordings, the shaping comprising comparing the generated variable excitation signal to match one or more voice characteristics in the previously stored speech recordings; and cause, from the speaker and based on the shaped variable excitation signal, produced speech content that approximates the matched one or more voice characteristics in the previously stored speech recordings.
- 2 . The system of claim 1 , wherein the speaker is a straw conduit configured to audibly transmit the produced speech content into the oral cavity of the subject.
- 3 . The system of claim 1 , wherein predicting the upcoming trajectory comprises: using a first machine learning model to predict an initial excitation state corresponding to a state of the trajectory of one or more of the predicted fundamental frequencies, the states comprising an inactive state, an unvoiced state, and a voiced state; and using a second machine learning model to predict a pitch sequence when the excitation state is predicted to be voiced.
- 4 . The system of claim 1 , predicting the upcoming trajectory comprises determining upcoming time periods in which the variable excitation signal is to include white noise with a lack of a fundamental frequency.
- 5 . The system of claim 1 , wherein the at least one microphone is positioned to detect acoustic signals from speech attempts performed by the subject, and wherein the system further comprises: at least one sensor positioned on the subject to detect physiological indicators of speech initiation.
- 6 . The system of claim 5 , wherein the at least one sensor is configured to detect movement associated with one or more anatomical structures of the subject and generate control signals for activating and deactivating the speaker and the at least one microphone.
- 7 . The system of claim 1 , wherein the predetermined time intervals are about 5 milliseconds to about 50 milliseconds.
- 8 . The system of claim 1 , wherein the previously stored speech recordings correspond to one or more of: digital audio recordings of speech produced by the subject, digital audio recordings of speech produced by subjects other than the subject, a combination of the digital audio recordings of speech produced by the subject and the digital audio recordings of speech produced by subjects other than the subject.
- 9 . A computer-implemented method for generating a personalized excitation signal for a subject, the method comprising: detecting acoustic signals from an oral cavity of the subject; processing the detected signals through at least one artificial intelligence algorithm trained on banked speech corresponding to the subject; predicting upcoming excitation signals based on the processed signals, the predicting comprising processing the signals in temporal segments and determining excitation signal parameters for subsequent temporal segments; generating, based on the predicting, new excitation signals acoustically shaped according to one or more characteristics in the banked speech corresponding to the subject; and causing production of speech according to the new excitation signals, wherein the produced speech substantially matches patterns and intonation in the banked speech.
- 10 . The computer-implemented method of claim 9 , wherein causing the production of speech comprises emission of the produced speech as output through an intraoral speaker provided in the oral cavity of the subject.
- 11 . The computer-implemented method of claim 9 , wherein predicting the upcoming excitation signals comprises: comparing the detected acoustic signals to one or more characteristics in the banked speech corresponding to the subject; and minimizing differences between the generated speech and the one or more characteristics in the banked speech corresponding to the subject.
- 12 . The computer-implemented method of claim 11 , wherein the one or more characteristics comprise at least one of: audio characteristics in voice recordings captured from the subject prior to a medical procedure; and voice characteristics selected from a voice library.
- 13 . The computer-implemented method of claim 9 , wherein detecting the acoustic signals from the oral cavity are performed by a sensing system comprising: a microphone positioned to detect acoustic signals from speech attempts performed by the subject; and at least one sensor positioned on the subject to detect physiological indicators of speech initiation.
- 14 . The computer-implemented method of claim 13 , wherein the at least one sensor is positioned in a neck region or a jaw region on the subject, the at least one sensor being configured to detect movement associated with one or more anatomical structures of the oral cavity of the subject, and generate control signals for activating and deactivating a speaker and a microphone, wherein the speaker and the microphone are within a predetermined range of the neck region or the jaw region of the subject.
- 15 . A computer-implemented method for generating speech from brain signals of a subject, the method comprising: detecting, based on a brain-computer interface coupled to the subject, neural signals associated with intended speech from the subject; decoding intended speech content from the detected neural signals; predicting, based on the decoded intended speech content, an excitation signal for use in producing speech corresponding to the intended speech content; generating, based on the predicted excitation signal, a variable excitation signal that automatically changes over time to match intonation patterns associated with the intended speech content; and causing, based on the variable excitation signal, intelligible speech output corresponding to the intended speech content, wherein the intelligible speech output comprises the intended speech acoustically shaped according to one or more voice characteristics in banked speech audio recordings of the subject.
- 16 . The computer-implemented method of claim 15 , wherein detecting the neural signals comprises utilizing a machine learning model trained to recognize neural patterns associated with a plurality of predefined phonemes, words, and speech intentions.
- 17 . The computer-implemented method of claim 15 , wherein detecting the neural signals comprises: capturing neural data in time blocks representing about 5 to about 50 milliseconds of neural activity associated with the subject; transforming high-rate neural signals into feature vectors suitable for real-time processing; and maintaining processing latency within a limit that preserves natural speech timing and intonation patterns for the subject.
- 18 . The computer-implemented method of claim 15 , wherein the generated variable excitation signal comprises multiple harmonic components configured to simulate spectral characteristics of natural vocal fold vibration for the intended speech content.
- 19 . The computer-implemented method of claim 15 , further comprising: comparing the intelligible speech output with predefined speech characteristics corresponding to the intended speech content and the banked speech audio recordings; and adjusting the generated variable excitation signal based on the comparing.
- 20 . The computer-implemented method of claim 15 , wherein predicting the excitation signals comprises: automatically adjusting a fundamental frequency of the variable excitation signal at predetermined time intervals of about 5 milliseconds to about 50 milliseconds to capture natural intonation patterns associated with the intended speech content.
Description
INCORPORATION BY REFERENCE All publications and patent applications mentioned in this specification are herein incorporated by reference in their entirety, as if each individual publication or patent application was specifically and individually indicated to be incorporated by reference in its entirety. TECHNICAL FIELD This disclosure relates generally to the field of speech synthesis, and more specifically to the field of voice restoration mimicry in subjects having vocal cord impairments. BACKGROUND Speech synthesis technology has evolved with the development of text-to-speech systems and voice conversion methods. Traditional electrolarynx devices provide basic voice replacement for patients with vocal cord damage and/or irreversible loss of voice, but produce robotic, monotone speech with no temporal variation in harmonics. SUMMARY There is a need for new and useful systems and methods for synthesizing personalized voices that map to a human vocal range and sound to recreate a natural voice of a subject. The systems described herein may synthesize personalized voices using training recordings, with systems like text-to-speech synthesis and voice conversion demonstrating regular patterns in excitation sequences when provided with linguistic information. A source-filter model of speech production may be used to identify that speech is generated by an excitation signal from the vocal folds, which may then be refined into intelligible speech by the oropharynx and/or the oral cavity through the tongue, palate, and lips. The described techniques relate to improved methods, systems, devices, and apparatuses that support techniques for generating personalized speech signals with real-time intonation and voice matching. In some aspects, the techniques described herein relate to a speech synthesis system including: at least one microphone; a speaker configured to be positioned within an oral cavity of a subject; a sensing system configured to detect speech-related signals; at least one processor operatively coupled to the sensing system, the speaker, and memory storing processor-executable instructions, which when executed by the processor, cause the processor to: detect, using the sensing system, speech-related signals emanating from the subject; generate, based on the detected speech-related signals emanating from the subject, a variable excitation signal, the generating including: automatically varying an excitation signal over time and predicting an upcoming trajectory of fundamental frequencies associated with the excitation signal, and adjusting the predicted fundamental frequencies associated with the excitation signal at predetermined time intervals to capture natural intonation patterns for the subject; shape the generated variable excitation signal according to previously stored speech recordings, the shaping including comparing the generated variable excitation signal to match one or more voice characteristics in the previously stored speech recordings; and cause, from the speaker and based on the shaped variable excitation signal, produced speech content that approximates the matched one or more voice characteristics in the previously stored speech recordings. In some aspects, the techniques described herein relate to a system, wherein the speaker is a straw conduit configured to audibly transmit the produced speech content into the oral cavity of the subject. In some aspects, the techniques described herein relate to a system, wherein predicting the upcoming trajectory includes: using a first machine learning model to predict an initial excitation state corresponding to a state of the trajectory of one or more of the predicted fundamental frequencies, the states including an inactive state, an unvoiced state, and a voiced state; and using a second machine learning model to predict a pitch sequence when the excitation state is predicted to be voiced. In some aspects, the techniques described herein relate to a system, predicting the upcoming trajectory includes determining upcoming time periods in which the variable excitation signal is to include white noise with a lack of a fundamental frequency. In some aspects, the techniques described herein relate to a system, wherein the at least one microphone is positioned to detect acoustic signals from speech attempts performed by the subject, and wherein the system further includes: at least one sensor positioned on the subject to detect physiological indicators of speech initiation. In some aspects, the techniques described herein relate to a system, wherein the at least one sensor is configured to detect movement associated with one or more anatomical structures of the subject and generate control signals for activating and deactivating the speaker and the at least one microphone. In some aspects, the techniques described herein relate to a system, wherein the predetermined time intervals are about 5 milliseconds to about 50 milliseconds. In some aspects,