US-12627724-B2 - Systems and methods for artificial dubbing

US12627724B2US 12627724 B2US12627724 B2US 12627724B2US-12627724-B2

Abstract

Methods, systems, and computer-readable media for artificially generating a revoiced media stream are provided. In one implementation, a system may receive a media stream including an individual with particular voice speaking in an origin language. The system may obtain a transcript of the media stream including utterances spoken in the origin language and translate the transcript to a target language. The translated transcript may include a set of words in the target language for each of at least some of the utterances spoken in the origin language. The system may analyze the media stream to determine a voice profile for the individual. Thereafter, the system may determine a synthesized voice for a virtual entity intended to dub the individual that is similar to the particular voice. Then, the system may generate a revoiced media stream in which the translated transcript in the target language is spoken by the virtual entity.

Inventors

Ben Avi Ingel
Ron Zass

Assignees

VIDUBLY LTD

Dates

Publication Date: 20260512
Application Date: 20210905

Claims (20)

1 . A computer program product for artificially generating a revoiced media stream, the computer program product embodied in a non-transitory computer-readable medium and including instructions for causing at least one processor to execute a method comprising: receiving a single media stream including utterances spoken in an origin language by an individual and sounds from a sound-emanating object, wherein the individual is associated with a particular voice; analyzing the single media stream to identify a first word in which the individual spoke while being in a first emotional state and a second word in which the individual spoke while being in a second emotional state; using a neural network to process the single media stream for determining a voice profile specific to the individual recorded in the single media stream, wherein the voice profile is indicative of a manner in which the first word and the second word spoken in the origin language are pronounced by the individual in the received single media stream, the voice profile includes first characteristics of the particular voice for a first speech segment associated with the first emotional state of the individual and second characteristics of the particular voice for a second speech segment associated with the second emotional state of the individual, the second characteristics differ from the first characteristics; receiving an indication of a desired value of at least one characteristic in the voice profile specific to the individual recorded in the single media stream; updating the voice profile specific to the individual recorded in the single media stream based on the received indication; based on the updated voice profile, determining a synthesized voice for dubbing the first speech segment and the second speech segment to a target language, wherein the synthesized voice sounds like the particular voice; generating an artificial dubbed version of the received single media stream, the artificial dubbed version includes a dubbed version of the first speech segment having the first characteristics associated with the first emotional state of the individual and a dubbed version of the second speech segment having the second characteristics associated with the second emotional state of the individual; determining auditory relationship between the individual and the sound-emanating object, wherein the auditory relationship is indicative of a ratio of volume levels between the utterances spoken by the individual in the original language and the sounds from the sound-emanating object as they are recorded in the single media stream; and determining a category of the sound-emanating object; and wherein: when the sound-emanating object is from a first category, a ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is maintained substantially identical to the ratio of volume levels between utterances spoken in the original language and sounds from the sound-emanating object as they are recorded in the single media stream; and when the sound-emanating object is from a second category, the ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is to be changed to reduce a relative volume level of the sound-emanating object.
2 . The computer program product of claim 1 , wherein the single media stream is a real-time conversation including at least one additional individual speaking a secondary language, and the method further includes: identifying specific utterances spoken by the at least one additional individual as background chatter; and generating an artificial dubbed version of the received single media stream in which the individual speaks the target language using the synthesized voice and the at least one additional individual speaks the secondary language at a reduce volume.
3 . The computer program product of claim 1 , wherein the single media stream including at least one additional individual speaking a secondary language, and the method further includes: receiving input indicative of user preferences indicating that utterances spoken in the secondary language should be dubbed; and based on the input indicative of user preferences, generating an artificial dubbed version of the received single media stream in which the individual and the at least one additional individual speak the target language using synthesized voices.
4 . The computer program product of claim 1 , wherein in the single media stream, the individual speaks first words with an accent in a second language and second words without accent in the second language, and the method further includes: using the neural network to update the voice profile for indicating that the first words were pronounced with the accent in the second language while the second words were not pronounced with the accent; and using the synthesized voice to artificially generate the dubbed version of the received single media stream in which the individual speaks the first words in the target language with accent in the second language and speaks the second words in the target language without accent in the second language.
5 . The computer program product of claim 1 , further comprising: obtaining a transcript of the single media stream; using an artificial neural network to analyze the transcript and to determine whether dubbing of words is needed in different languages; based on at least one rule for revising transcripts of media streams, automatically revising a first part of the transcript and avoid from revising a second part of the transcript; and generating the artificial dubbed version of the received single media stream using the synthesized voice that includes a dubbed version of the first and second parts of the transcript in the target language.
6 . The computer program product of claim 1 , wherein the single media stream is destined to a particular user, and the method further includes: obtaining a transcript of the single media stream; determining a user category indicative of a desired vocabulary based on demographic or behavioral data associated with the particular user; based on the determined user category for the particular user, revising the transcript of the single media stream in accordance with the desired vocabulary associated with the user category; and using the synthesized voice and the revised transcript to artificially generate the artificial dubbed version of the received single media stream.
7 . The computer program product of claim 1 , wherein the single media stream is destined to a particular user, and the method further includes: obtaining a transcript of the single media stream; receiving preferred language characteristics associated with the particular user, the preferred language characteristics including at least one of a language register, dialect, style, or level of slang; translating the transcript of the single media stream to the target language based on received preferred language characteristics associated with the particular user; and using the synthesized voice and the translated transcript to artificially generate the artificial dubbed version of the received single media stream.
8 . The computer program product of claim 1 , wherein the method further includes: obtaining a transcript of the single media stream; analyzing the transcript to determine that the transcript includes a subject likely to be unfamiliar with users associated with the target language, wherein the subject comprising at least one of a public figure, event, or cultural reference; and presenting in the artificial dubbed version of the received single media stream a visual explanation in the target language to the subject discussed in the origin language.
9 . The computer program product of claim 1 , wherein the method further includes: obtaining a transcript of the single media stream; analyzing the transcript to determine that an original name of a character in the received single media stream is likely to cause antagonism with users that speak the target language, wherein the determination is based on at least one of pronunciation difficulty, religious significance, historical association, or resemblance to a public figure; and wherein in the artificial dubbed version of the received single media stream the character has a substitute name.
10 . The computer program product of claim 1 , wherein the method further includes: obtaining a transcript of the single media stream; using a trained machine learning model to analyze the transcript and determine that the transcript includes a first sentence ending with a first utterance that rhymes with a second utterance that ends a second sentence; and translating the first sentence such that it ends with a first word in the target language, and translating the second sentence such that it ends with a second word in the target language, wherein the second word rhymes with the first word.
11 . The computer program product of claim 1 , wherein the voice profile is indicative of changes of voice intonation of the individual during the single media stream, and the method further includes: determining that the first speech segment is a question and the second speech segment is a statement, and the artificial dubbed version of the received single media stream includes a dubbed version of the first speech segment having a question intonation of and a dubbed version of the second speech segment having a statement intonation.
12 . The computer program product of claim 1 , wherein the method further includes: analyzing the received single media stream to determine visual data, wherein the visual data includes facial images of the individual; and using the visual data to determine the first emotional state and the second emotional state.
13 . The computer program product of claim 1 , wherein the method further includes: obtaining a transcript of the single media stream; analyzing the received single media stream to determine visual data indicative of a number of people the individual is speaking to; and translating the transcript to the target language based on the visual data using a language register appropriate for the number of people.
14 . The computer program product of claim 1 , wherein the method further includes analyzing the single media stream to determine visual data that includes text written in the origin language, determining an importance level for the text, wherein the artificial dubbed version of the received single media stream provides a visual translation in the target language to the text written in the origin language when the importance level exceeds a threshold.
15 . A method for artificially generating a revoiced media stream, the method comprising: receiving a single media stream including utterances spoken in an origin language by an individual and sounds from a sound-emanating object, wherein the individual is associated with a particular voice; analyzing the single media stream to identify a first word in which the individual spoke while being in a first emotional state and a second word in which the individual spoke while being in a second emotional state; using a neural network to process the single media stream for determining a voice profile specific to the individual recorded in the single media stream, wherein the voice profile is indicative of a manner in which the first word and the second word spoken in the origin language are pronounced by the individual in the received single media stream, the voice profile includes first characteristics of the particular voice for a first speech segment associated with the first emotional state of the individual and second characteristics of the particular voice for a second speech segment associated with the second emotional state of the individual, the second characteristics differ from the first characteristics; receiving input indicative of user preferences about characteristics of the particular voice; based on the voice profile and the received input, determining a synthesized voice for dubbing the first speech segment and the second speech segment in a target language, wherein the synthesized voice sounds like the particular voice; generating an artificial dubbed version of the received single media stream, the artificial dubbed version includes a dubbed version of the first speech segment having the first characteristics associated with the first emotional state of the individual and a dubbed version of the second speech segment having the second characteristics associated with the second emotional state of the individual; determining auditory relationship between the individual and the sound-emanating object, wherein the auditory relationship is indicative of a ratio of volume levels between the utterances spoken by the individual in the original language and the sounds from the sound-emanating object as they are recorded in the single media stream; and determining a category of the sound-emanating object; and wherein: when the sound-emanating object is from a first category, a ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is maintained substantially identical to the ratio of volume levels between utterances spoken in the original language and sounds from the sound-emanating object as they are recorded in the single media stream; and when the sound-emanating object is from a second category, the ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is to be changed to reduce a relative volume level of the sound-emanating object.
16 . A system for artificially generating a revoiced media stream, the system comprising: at least one processing device configured to: receive a single media stream including utterances spoken in an origin language by an individual and sounds from a sound-emanating object, wherein the individual is associated with a particular voice; analyze the single media stream to identify a first word in which the individual spoke while being in a first emotional state and a second word in which the individual spoke while being in a second emotional state; use a neural network to process the single media stream for determining a voice profile specific to the individual recorded in the single media stream, wherein the voice profile is indicative of a manner in which the first word and the second word spoken in the origin language are pronounced by the individual in the received single media stream, the voice profile includes first characteristics of the particular voice for a first speech segment associated with the first emotional state of the individual and second characteristics of the particular voice for a second speech segment associated with the second emotional state of the individual, the second characteristics differ from the first characteristics; receive input indicative of user preferences about characteristics of the particular voice; based on the voice profile and the received input, determine a synthesized voice for dubbing the first speech segment and the second speech segment in a target language, wherein the synthesized voice sounds like the particular voice; generate an artificial dubbed version of the received single media stream, the artificial dubbed version includes a dubbed version of the first speech segment having the first characteristics associated with the first emotional state of the individual and a dubbed version of the second speech segment having the second characteristics associated with the second emotional state of the individual; determine auditory relationship between the individual and the sound-emanating object, wherein the auditory relationship is indicative of a ratio of volume levels between the utterances spoken by the individual in the original language and the sounds from the sound-emanating object as they are recorded in the single media stream; and determine a category of the sound-emanating object; and wherein: when the sound-emanating object is from a first category, a ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is maintained substantially identical to the ratio of volume levels between utterances spoken in the original language and sounds from the sound-emanating object as they are recorded in the single media stream; and when the sound-emanating object is from a second category, the ratio of the volume levels between utterances spoken in the target language in the artificial dubbed version of the received single media stream and sounds from the sound-emanating object is to be changed to reduce a relative volume level of the sound-emanating object.
17 . The computer program product of claim 1 , wherein determining the voice profile for the individual involves extracting from the received single media stream spectral features including at least one of: spectral centroid, spectral spread, spectral skewness, spectral kurtosis, spectral slope, spectral decrease, spectral roll-off, or spectral variation for each of the first speech segment and second speech segment, and using the neural network to generate the voice profile based on the extracted spectral features.
18 . The computer program product of claim 1 , wherein the voice profile is associated with a first vector representing explicit characteristics of the individual's voice in the first speech segment that includes loudness, rhythm pattern, or pitch and a second vector representing explicit characteristics of the individual's voice in the second speech segment that includes loudness, rhythm pattern, or pitch, wherein a first distance between the first vector and the second vector is smaller than a second distance between the first vector and another vector extracted from a voice of another individual.
19 . The computer program product of claim 1 , wherein the method further includes: using a trained neural network model analyzing the received single media stream to identify contextual information based on visual data and audio data extracted from the received single media stream; and using the contextual information to determine the first emotional state and the second emotional state.
20 . The computer program product of claim 1 , wherein the voice profile includes a first set of characteristics of the particular voice for a first social activity and a second set of characteristics of the particular voice for a second social activity, the first set of characteristics differ from the second set of characteristics.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This is a continuation of U.S. patent application Ser. No. 16/777,097, filed Jan. 30, 2020 (pending), which claims the benefit of U.S. Provisional Patent Application No. 62/799,970, filed on Feb. 1, 2019, U.S. Provisional Patent Application No. 62/816,137, filed on Mar. 10, 2019, and U.S. Provisional Patent Application No. 62/822,856, filed on Mar. 23, 2019. The entire contents of all of the above-identified applications are herein incorporated by reference. BACKGROUND I. Technical Field The present disclosure relates generally to the field of audio processing. More specifically, the present disclosure relates to systems, methods, and devices for generating audio streams for dubbing purposes. II. Background Information Thousands of original media streams are created for entertainment on a daily basis, such as, personal home videos, vblogs, TV series, movies, podcasts, live radio shows, and more. Without using the long and tedious process of professional dubbing services, the vast majority of these media streams are available for consumption by only a fraction of the world population. Existing technologies, such as neural machine translation services that can deliver real time subtitles, offer a partial solution to overcome the language barrier. Yet for many people consuming content with subtitles is not a viable option and for many others it is considered as less pleasant. The disclosed embodiments are directed to providing new and improved ways for generating artificial voice for dubbing, and more specifically to systems, methods, and devices for generating revoiced audio streams that sound as the individuals in an original audio stream speak the target language. SUMMARY Embodiments consistent with the present disclosure provide systems, methods, and devices for generating media streams for dubbing purposes and for generating personalized media streams. In one embodiment, a method for artificially generating a revoiced media stream is provided. The method comprising: receiving a media stream including an individual speaking in an origin language, wherein the individual is associated with particular voice; obtaining a transcript of the media stream including utterances spoken in the origin language; translating the transcript of the media stream to a target language, wherein the translated transcript includes a set of words in the target language for each of at least some of the utterances spoken in the origin language; analyzing the media stream to determine a voice profile for the individual, wherein the voice profile includes characteristics of the particular voice; determining a synthesized voice for a virtual entity intended to dub the individual, wherein the synthesized voice has characteristics identical to the characteristics of the particular voice; and generating a revoiced media stream in which the translated transcript in the target language is spoken by the virtual entity. In another embodiment, a method for artificially generating a revoiced media stream is provided. The method comprising: receiving a media stream including a plurality of first individuals speaking in a primary language and at least one second individual speaking in a secondary language; obtaining a transcript of the received media stream associated with utterances in the primary language and utterances in the secondary language; determining that dubbing of the utterances in the primary language to a target language is needed and that dubbing of the utterances in the secondary language to the target language is unneeded; analyzing the received media stream to determine a set of voice parameters for each of the plurality of first individuals; determining a voice profile for each of the plurality of first individuals based on an associated set of voice parameters; and using the determined voice profiles and a translated version of the transcript to artificially generate a revoiced media stream in which the plurality of first individuals speak the target language and the at least one second individual speaks the secondary language. In another embodiment, a method for artificially generating a revoiced media stream is provided. The method comprising: receiving an input media stream including a first individual speaking in a first language and a second individual speaking in a second language; obtaining a transcript of the input media stream associated with utterances in the first language and utterances in the second language; analyzing the received media stream to determine a first set of voice parameters of the first individual and a second set of voice parameters of the second individual; determining a first voice profile of the first individual based on the first set of voice parameters; determining a second voice profile of the second individual based on the second set of voice parameters; and using the determined voice profiles and a translated version of the transcript to artificially