US-20260128033-A1 - REAL-TIME VOICE GENERATOR SYSTEM WITH ARTIFICIAL INTELLIGENCE

US20260128033A1US 20260128033 A1US20260128033 A1US 20260128033A1US-20260128033-A1

Abstract

Embodiments of the present disclosure may include a real-time voice generator system with generative artificial intelligence (AI), including a processor.

Inventors

Mehmet Efe Akengin
Steve Gu

Assignees

BITHUMAN INC

Dates

Publication Date: 20260507
Application Date: 20241103

Claims (9)

1 . A real-time voice generator system with generative artificial intelligence (AI), comprising: a processor; a multi-modal user interface input unit coupled to the processor, wherein the multi-modal user interface input unit is configured to receive various types of inputs, wherein the various types of inputs comprise one or more of a first set of characteristics, wherein the one or more of the first set of characteristics comprise text prompts, voice personality descriptions, images, existing voice samples, documents and websites, videos, and multi-language personality profiles, wherein the text prompts are configured to describe desired voice characteristics, wherein the voice personality descriptions are configured to describe one or more of a second set of characteristics, wherein the one or more of the second set of characteristics comprise tone, pitch, accent, and gender, wherein the documents and websites are configured to match voice to content tone in the documents and websites, wherein the various types of inputs comprise contextual inputs such as language, intonation, and mood to further refine the generated voice; a real-time voice synthesis engine coupled to the processor, wherein the real-time voice synthesis engine is configured to analyze the various types of inputs and apply a generative AI model to synthesize a synthesized voice based on the various types of inputs, wherein the real-time voice synthesis engine is configured to create novel voice outputs by manipulating fundamental voice characteristics, wherein the processor is configured to transform the synthesized voice into an audio file or stream for real-time or post-generation playback, wherein the synthesized voice is configured to be customized and fine-tuned real-time based on user feedback and changing requirements; a voice persona creation engine coupled to the processor, wherein the voice persona creation engine is configured to define comprehensive voice profiles based on utility, objective, target audience, and tone; a voice mixing engine coupled to the processor, wherein the voice mixing engine is configured to mix and combine multiple high-quality base voices from multiple characters; a vector embedding system coupled to the processor, wherein the vector embedding system is configured to make precise adjustments to voice parameters; an observable voice system coupled to the processor, wherein the observable voice system coupled to the processor is configured to enable real-time monitoring and modification of voice outputs; and a feedback mechanism that adjusts the generated voice based on user corrections or preferences provided after an initial voice is synthesized.
2 . The real-time voice generator system with generative artificial intelligence of claim 1 , wherein the synthetic voice can be integrated into multimedia applications, virtual assistants, or live interactions with users in real-time.
3 . The real-time voice generator system with generative artificial intelligence of claim 1 , wherein the synthesized voice is automatically optimized for different output devices, including mobile, desktop, and smart speakers.
4 . A method with generative artificial intelligence (AI) for generating a synthetic voice from various types of inputs from one or more users: Receiving the various types of inputs from the one or more users via an user interface, wherein the various types of inputs comprise one or more of a first set of characteristics, wherein the one or more of the first set of characteristics comprise text prompts, voice personality descriptions, images, existing voice samples, documents and websites, videos and multi-language personality profiles, wherein the text prompts are configured to describe desired voice characteristics, wherein the voice personality descriptions are configured to describe one or more of a second set of characteristics, wherein the one or more of the second set of characteristics comprise tone, pitch, accent, and gender, wherein the documents and websites are configured to match voice to content tone in the documents and websites, wherein the various types of inputs comprise contextual inputs such as language, intonation, and mood to further refine the generated voice; Processing the various types of input through a generative AI model trained on a plurality of voices; Generating a synthetic voice based on the various types of inputs, wherein the synthesized voice is configured to be customized and fine-tuned real-time based on user feedback and changing requirements on the fly, wherein the synthetic voice could come from combing multiple high-quality base voices from multiple characters by the generative AI model; and Outputting the generated voice in an audio format.
5 . The method with generative artificial intelligence (AI) for generating a synthetic voice from various types of inputs from one or more users of claim 4 , wherein the voice synthesis engine integrates voice cloning techniques to imitate or blend existing voices with newly synthesized elements.
6 . The method with generative artificial intelligence (AI) for generating a synthetic voice from various types of inputs from one or more users of claim 4 , further comprising utilizing natural language processing algorithms to infer implicit voice characteristics from complex user prompts.
7 . A real-time voice generator system with generative artificial intelligence (AI), comprising: a processor; a multi-modal user interface input unit coupled to the processor, wherein the multi-modal user interface input unit is configured to receive various types of inputs, wherein the various types of inputs comprise one or more of a first set of characteristics, wherein the one or more of the first set of characteristics comprise text prompts, voice personality descriptions, images, existing voice samples, documents and websites, videos, and multi-language personality profiles; a real-time voice synthesis engine coupled to the processor, wherein the real-time voice synthesis engine is configured to analyze the various types of inputs and apply a generative AI model to synthesize a synthesized voice based on the various types of inputs, wherein the real-time voice synthesis engine is configured to create novel voice outputs by manipulating fundamental voice characteristics, wherein the processor is configured to transform the synthesized voice into an audio file or stream for real-time or post-generation playback; and a feedback mechanism that adjusts the generated voice based on user corrections or preferences provided after an initial voice is synthesized.
8 . The real-time voice generator system with generative artificial intelligence of claim 7 , wherein the synthetic voice can be integrated into multimedia applications, virtual assistants, or live interactions with users in real-time.
9 . The real-time voice generator system with generative artificial intelligence of claim 7 , wherein the synthesized voice is automatically optimized for different output devices, including mobile, desktop, and smart speakers.

Description

BACKGROUND OF THE INVENTION Embodiments of the present disclosure may include a real-time voice generator system with generative artificial intelligence (AI). BRIEF SUMMARY Embodiments of the present disclosure may include a real-time voice generator system with generative artificial intelligence (AI), including a processor. Embodiments may also include a multi-modal user interface input unit coupled to the processor. In some embodiments, the multi-modal user interface input unit may be configured to receive various types of inputs. In some embodiments, the various types of inputs may include one or more of a first set of characteristics. In some embodiments, the one or more of the first set of characteristics may include text prompts, voice personality descriptions, images, existing voice samples, documents and websites, videos, and multi-language personality profiles. In some embodiments, the text prompts may be configured to describe desired voice characteristics. In some embodiments, the voice personality descriptions may be configured to describe one or more of a second set of characteristics. In some embodiments, the one or more of the second set of characteristics may include tone, pitch, accent, and gender. In some embodiments, the documents and websites may be configured to match voice to content tone in the documents and websites. In some embodiments, the various types of inputs may include contextual inputs such as language, intonation, and mood to further refine the generated voice. Embodiments may also include a real-time voice synthesis engine coupled to the processor. In some embodiments, the real-time voice synthesis engine may be configured to analyze the various types of inputs and apply a generative AI model to synthesize a synthesized voice based on the various types of inputs. In some embodiments, the real-time voice synthesis engine may be configured to create novel voice outputs by manipulating fundamental voice characteristics. In some embodiments, the processor may be configured to transform the synthesized voice into an audio file or stream for real-time or post-generation playback. In some embodiments, the synthesized voice may be configured to be customized and fine-tuned real-time based on user feedback and changing requirements. Embodiments may also include a voice persona creation engine coupled to the processor. In some embodiments, the voice persona creation engine may be configured to define comprehensive voice profiles based on utility, objective, target audience, and tone. Embodiments may also include a voice mixing engine coupled to the processor. In some embodiments, the voice mixing engine may be configured to mix and combine multiple high-quality base voices from multiple characters. Embodiments may also include a vector embedding system coupled to the processor. In some embodiments, the vector embedding system may be configured to make precise adjustments to voice parameters. Embodiments may also include an observable voice system coupled to the processor. In some embodiments, the observable voice system coupled to the processor may be configured to enable real-time monitoring and modification of voice outputs. Embodiments may also include a feedback mechanism that adjusts the generated voice based on user corrections or preferences provided after an initial voice may be synthesized. In some embodiments, the synthetic voice can be integrated into multimedia applications, virtual assistants, or live interactions with users in real-time. In some embodiments, the synthesized voice may be automatically optimized for different output devices, including mobile, desktop, and smart speakers. Embodiments of the present disclosure may also include a method with generative artificial intelligence (AI)for generating a synthetic voice from various types of inputs from one or more users receiving the various types of inputs from the one or more users via an user interface. In some embodiments, the various types of inputs may include one or more of a first set of characteristics. In some embodiments, the one or more of the first set of characteristics may include text prompts, voice personality descriptions, images, existing voice samples, documents and websites, videos and multi-language personality profiles. In some embodiments, the text prompts may be configured to describe desired voice characteristics. In some embodiments, the voice personality descriptions may be configured to describe one or more of a second set of characteristics. In some embodiments, the one or more of the second set of characteristics may include tone, pitch, accent, and gender. In some embodiments, the documents and websites may be configured to match voice to content tone in the documents and websites. In some embodiments, the various types of inputs may include contextual inputs such as language, intonation, and mood to further refine the generated voice. Embodiments may also include processing the var