RU-2861587-C2 - METHOD FOR ADAPTING VIDEO AND AUDIO PRODUCTS TO NATIONAL LANGUAGE
Abstract
FIELD: computing. SUBSTANCE: invention relates to computing for processing audio data. Carrying out the steps of translating the text of a speech file into the national language by a professional translator, creating an additional audio track, merging the files into one according to timestamps created during the splitting of the original file, wherein a trained neural network and a vocoder convert this audio track into the speech of a selected speaker, outputting an audio file with the effect of dubbing in the national language with a voice identical to the original speaker, extracting intonation, emotional and other characteristics of the original speech, adapting the synthesised speech using the extracted characteristics with modification of intonation, emotional expression and accent correction. EFFECT: enabling the creation of an audio track in the national language by replacing the translator's voice with the original voice, conveying a reliable accurate meaning and intonation, taking into account the national characteristics of the cloned speech sample of the selected speaker in the base language. 1 cl
Inventors
- Shabalin Maksim Konstantinovich
Dates
- Publication Date
- 20260506
- Application Date
- 20240718
Claims (1)
- A method for translating audio tracks of video and audio files into a national language using neural networks to synthesize the speech of a selected speaker with the transmission of reliable intonation, emotional content of speech, the absence of an accent, which consists in the fact that it includes deep training of a neural network based on a training dataset and obtaining at the output a mel-spectrogram of the voice of the selected speaker, converting the mel-spectrogram using a vocoder with the receipt at the output of an audio file in WAV format, characterized in that the text of the speech file is translated into the national language by a professional translator with the creation of an additional audio track with the merging of files into one according to the time codes created when splitting the original file, the trained neural network and vocoder convert this audio track into the speech of the selected speaker, obtaining at the output an audio file with the effect of dubbing in the national language by a voice identical to the original speaker, intonation, emotional and other characteristics of the original speech are extracted, the synthesized speech is adapted using the extracted characteristics with modification of intonation, emotional expression and correction of accent.
Description
Field of technology to which the invention relates The invention relates to the field of methods and devices for recognizing, processing, analyzing and synthesizing speech, namely to methods for synthesizing speech using artificial neural networks, and can be used for cloning and synthesizing the speech of a selected speaker, with the transmission of reliable intonation, emotional content of speech and the absence of an accent corresponding to the original language of the cloned sample. Description of the prior art Various technical solutions are known in the general state of the art for methods and devices for recognizing, processing, analyzing, and synthesizing speech. Some of these solutions involve the use of artificial neural networks in speech processing, analysis, and synthesis. The primary objective of such speech synthesis techniques is to convert text into audible speech by pre-training the neural network to the voice of a selected speaker. The closest approach to the proposed method is "Method for Speech Synthesis with Reliable Intonation of a Cloned Sample," patent RU 2754920. The main drawback of this method is that it synthesizes speech from text, as the resulting audio track may not fully match the meaning and length of audio and video intervals. This method for creating an audio track in a national language from text has several drawbacks, namely, it will not fully match the original, and in the case of a video file, it will not fully match the audio intervals (mouth opening). Currently, no existing technical solution in this area offers a fully-fledged hardware-software method for synthesizing any speech in any natural language, including Russian or other complex languages, from any speaker, while accurately conveying the cloned sample's intonation in all its aspects and maximizing the synthesized voice's fidelity to the real human speaker. As a result of the translation, the intonation and emotional coloring of the original voice are lost, including the appearance of a noticeable accent. This is due to the fact that existing voice cloning systems lack this feature because they do not use emotional speech synthesis technology. As for accent, more in-depth processing is required to not only clone the voice but also to imitate pronunciation characteristics. This requires modeling the phonetic characteristics of the source language. Furthermore, the timing characteristics of the audio track created from the translated text will not fully match the timing of the source file, creating a discrepancy between the synthesized voice and the visual image of the speaker. This is an additional problem that has not yet been resolved. Our proposed method for adapting video and audio products to the national language eliminates these drawbacks. Disclosure of invention The method is implemented using a neural network, which, after loading an audio or video file, completely controls all stages of translation into the selected language. As a result, a digital audio track is created in the selected national language, using a digital clone of the original audio recording's voice, which is then embedded into the original file as a duplicate speech track. The developed neural network is composed of a complex of programs and neural networks that perform specific functions. However, it is not only a multimodal neural network, but also controls all stages of the final product's creation—a video or audio file containing an audio track with a voice completely identical to the original in the selected language. To achieve this result, a method was developed for extracting intonation and emotional characteristics of original speech using a neural network, and a block for adapting synthesized speech was introduced using the extracted characteristics to preserve intonation, emotional coloring, and correct accent. Processing a file by a neural network consists of several stages: 1. Conversion. 2. Import audio/video. 3. Selecting an audio track (if a video file was loaded). 4. Extraction of the original speech signal. 5. Construction of a mel-spectrogram of the original speech. 6. Deep learning of a neural network on a mel-spectrogram of the original speech. 7. Extraction of intonation, emotional and other characteristics of the original speech. 8. Translation of the text of the speech file into the national language by a professional translator (human). 9. Preliminary speech synthesis in the national language using a trained neural network (generation of a mel spectrogram, application of a vocoder). 10. Adaptation of synthesized speech using extracted characteristics (modification of intonation, application of emotional expression, correction of accent). 11. Final post-processing of the resulting audio file. 12. Merge the translated audio with the original video (if a video file was uploaded). 13. Final export of the adapted audio or video product. The invention can be illustrated by the following examples: Example 1. When