US-20260129270-A1 - GENERATING TRANSLATED MEDIA STREAMS WITH ACCENT

US20260129270A1US 20260129270 A1US20260129270 A1US 20260129270A1US-20260129270-A1

Abstract

Systems, methods and non-transitory computer readable media for generating media streams are provided. The generation of the media streams involves receiving an input media stream including an individual speaking in a first language; determining from the input media stream a set of speech-related characteristics associated with the individual; generating, using the set of speech-related characteristics associated with the individual, a synthesized voice for the individual; and producing an output media stream comprising for each set of words spoken by the individual in the first language in the input media stream a translation set of words in a target language using the synthesized voice.

Inventors

Ben Avi lngel
Ron Zass

Assignees

VIDUBLY LTD

Dates

Publication Date: 20260507
Application Date: 20260105

Claims (20)

1 .- 21 . (canceled)
22 . A computer program product for generating media streams, the computer program product embodied in a non-transitory computer-readable medium and including instructions that when executed by at least one processor cause the at least one processor to execute a method, the method comprising: receiving an input media stream including a first individual speaking first words in a first language and a second individual speaking second words associated with a second language; determining from the input media stream a first set of speech-related characteristics associated with the first individual; generating, using the first set of speech-related characteristics associated with the first individual, a first synthesized voice for the first individual; determining from the input media stream a second set of speech-related characteristics associated with the second individual; generating, using the second set of speech-related characteristics, a second synthesized voice for the second individual; producing an output media stream comprising for a first plurality of words corresponding to the first words articulated using the first synthesized voice and a second plurality of words corresponding to the second words articulated using the second synthesized voice; and wherein, both the first plurality of words and the second plurality of words are spoken in a target language, and the second plurality of words is spoken in the target language with an accent of the second language.
23 . The computer program product of claim 22 , wherein, when the first language is English, the second language is Russian, and the target language is German, the first plurality of words in the output media stream are spoken in German without a Russian accent and the second plurality of words in the output media stream are spoken in German with a Russian accent.
24 . The computer program product of claim 22 , wherein the second words recorded in the input media stream are spoken in the first language with an accent of the second language.
25 . The computer program product of claim 22 , wherein the second words recorded in the input media stream are spoken in the second language.
26 . The computer program product of claim 22 , wherein the method further includes determining a desired level of accent for the second synthesized voice to be introduced into the output media stream, and wherein the second plurality of words are spoken in the target language using the second synthesized voice with the desired level of accent.
27 . The computer program product of claim 22 , wherein the method further includes determining at least one factor indicative of a desired level of accent to be introduced into the output media stream, based on the target language.
28 . The computer program product of claim 22 , wherein the method further includes accessing one or more databases to determine at least one factor indicative of a desired level of accent to be introduced into the output media stream.
29 . The computer program product of claim 22 , wherein the method further includes accessing rules regarding which languages to dub with a level of accent.
30 . The computer program product of claim 22 , wherein the first words and the second words are translated into the target language in a manner that takes into account genders of the first individual and the second individual.
31 . The computer program product of claim 22 , wherein the first set of speech-related characteristics includes voice characteristics of the first individual.
32 . The computer program product of claim 31 , wherein the voice characteristics include at least one of: prosodic characteristics, pitch, loudness, intonation, or stress of a voice of the first individual.
33 . The computer program product of claim 22 , wherein the first set of speech-related characteristics includes articulation characteristics of the first individual.
34 . The computer program product of claim 33 , wherein the articulation characteristics include accent of the speech.
35 . The computer program product of claim 22 , wherein the method further includes causing a display in a graphical user interface (GUI) of information indicative of a plurality of available target languages; and receiving, via the GUI, a selection of the target language.
36 . The computer program product of claim 22 , wherein the input media stream is associated with a real-time conversation between the first individual, the second individual, and at least one other individual, and the method further includes determining the target language based on an identity of the at least one other individual.
37 . The computer program product of claim 22 , wherein the method further includes storing data associated with at least one of the first set of speech-related characteristics or the second set of speech-related characteristics for future generation of other media streams using at least one of the first synthesized voice or the second synthesized voice.
38 . The computer program product of claim 22 , wherein the input media stream is associated with a real-time conversation between the first individual, the second individual, and at least one other individual, and the method further includes changing at least one of the first synthesized voice or the second synthesized voice as the real-time conversation progresses to improve how at least one of the first individual or the second individual sounds when speaking the target language.
39 . The computer program product of claim 22 , wherein the method further includes receiving input indicative of a preferred accent level and determining a level of the accent of the second plurality of words spoken in the target language based on the received input.
40 . A system for generating translated media streams, comprising: a microphone for recording an input media stream of a first individual speaking first words in a first language and a second individual speaking second words associated with a second language; at least one processing device configured to: receive the input media stream; determine from the input media stream a first set of speech-related characteristics associated with the first individual; generate, using the first set of speech-related characteristics, a first synthesized voice for the first individual; determine from the input media stream a second set of speech-related characteristics associated with the second individual; generate, using the second set of speech-related characteristics, a second synthesized voice for the second individual; produce an output media stream comprising a first plurality of words corresponding to the first words articulated using the first synthesized voice and a second plurality of words corresponding to the second words articulated using the second synthesized voice; and wherein, both the first plurality of words and the second plurality of words are spoken in a target language, and the second plurality of words is spoken in the target language with an accent of the second language.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 19/027,088 filed Jan. 17, 2025 (pending), which is a continuation of U.S. patent application Ser. No. 18/643,486, filed Apr. 23, 2024 (now U.S. Pat. No. 12,279,022), which is a continuation of U.S. patent application Ser. No. 18/097,900, filed Jan. 17, 2023 (now U.S. Pat. No. 12,010,399), which is a continuation of U.S. patent application Ser. No. 17/460,644, filed Aug. 30, 2021 (now U.S. Pat. No. 11,595,738), which is a continuation of U.S. patent application Ser. No. 16/813,984, filed Mar. 10, 2020 (now U.S. Pat. No. 11,140,459), which claims the benefit of U.S. Provisional Patent Application No. 62/816,137, filed on Mar. 10, 2019, and U.S. Provisional Patent Application No. 62/822,856, filed on Mar. 23, 2019. This application is also a Continuation-in-Part of U.S. patent application Ser. No. 17/467,236 filed Sep. 5, 2021 (pending), which is a continuation of U.S. patent application Ser. No. 16/777,097, filed Jan. 30, 2020 (now U.S. Pat. No. 11,159,597), which claims the benefit of U.S. Provisional Patent Application No. 62/822,856, filed on Mar. 23, 2019, U.S. Provisional Patent Application No. 62/816,137, filed on Mar. 10, 2019 and U.S. Provisional Patent Application No. 62/799,970, filed on Feb. 1, 2019. The entire contents of all of the above-identified applications are herein incorporated by reference. BACKGROUND OF THE INVENTION Technological Field The disclosed embodiments generally relate to systems and methods for generating media streams. More particularly, the disclosed embodiments relate to systems and methods for generating personalized videos from textual information and user preferences. Background Information Thousands of original media streams are created for entertainment on a daily basis, such as, personal home videos, vblogs, TV series, movies, podcasts, live radio shows, and more. Without using the long and tedious process of professional dubbing services, the vast majority of these media streams are available for consumption by only a fraction of the world population. Existing technologies, such as neural machine translation services that can deliver real time subtitles, offer a partial solution to overcome the language barrier. Yet for many people consuming content with subtitles is not a viable option and for many others it is considered as less pleasant. The disclosed embodiments are directed to providing new and improved ways for generating artificial voice for dubbing, and more specifically to systems, methods, and devices for generating revoiced audio streams that sound as the individuals in an original audio stream speak the target language. SUMMARY OF THE INVENTION Embodiments consistent with the present disclosure provide systems, methods, and devices for generating media streams for dubbing purposes and for generating personalized media streams. In one embodiment, a method for artificially generating a revoiced media stream is provided. The method comprising: receiving a media stream including an individual speaking in an origin language, wherein the individual is associated with particular voice; obtaining a transcript of the media stream including utterances spoken in the origin language; translating the transcript of the media stream to a target language, wherein the translated transcript includes a set of words in the target language for each of at least some of the utterances spoken in the origin language; analyzing the media stream to determine a voice profile for the individual, wherein the voice profile includes characteristics of the particular voice; determining a synthesized voice for a virtual entity intended to dub the individual, wherein the synthesized voice has characteristics identical to the characteristics of the particular voice; and generating a revoiced media stream in which the translated transcript in the target language is spoken by the virtual entity. In another embodiment, a method for artificially generating a revoiced media stream is provided. The method comprising: receiving a media stream including a plurality of first individuals speaking in a primary language and at least one second individual speaking in a secondary language; obtaining a transcript of the received media stream associated with utterances in the primary language and utterances in the secondary language; determining that dubbing of the utterances in the primary language to a target language is needed and that dubbing of the utterances in the secondary language to the target language is unneeded; analyzing the received media stream to determine a set of voice parameters for each of the plurality of first individuals; determining a voice profile for each of the plurality of first individuals based on an associated set of voice parameters; and using the determined voice profiles and a translated version of the transcript to artificially generate a revoiced media stream i