US-12620387-B2 - Voice generation method and apparatus, device, and computer readable medium

US12620387B2US 12620387 B2US12620387 B2US 12620387B2US-12620387-B2

Abstract

A voice generation method and apparatus, an electronic device, and a computer readable storage medium. Said method comprises: performing speaker segmentation on an original voice to determine starting time and ending time of each speaking voice segment in the original voice, so as to obtain segmented voices; determining a voiceprint feature vector corresponding to each speaking voice segment in the original voice; converting a text corresponding to each speaking voice segment in the original voice into a target language text, to obtain a target language text corresponding to each speaking voice segment in the original voice; and generating a target voice on the basis of the starting time and the ending time of each speaking voice segment in the original voice, the voiceprint feature vectors corresponding to the speaking voice segments and the target language texts corresponding to the speaking voice segments.

Inventors

Meng Cai
Yalu KONG

Assignees

BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20210730
Priority Date: 20200817

Claims (15)

1 . A speech generating method, comprising: determining starting time and ending time of each speaking speech fragment in original speech by performing speaker segmentation on the original speech to obtain segmented speech; determining a voiceprint feature vector corresponding to each speaking speech fragment in the original speech; converting a text corresponding to each speaking speech fragment in the original speech into a target language text, to obtain the target language text corresponding to each speaking speech fragment in the original speech; and generating target speech based on the starting time and the ending time of each speaking speech fragment in the original speech, the voiceprint feature vector corresponding to each speaking speech fragment, and the target language text corresponding to each speaking speech fragment, wherein generating the target speech comprises: generating a set of target speaking speech fragments based on the voiceprint feature vector and the target language text corresponding to each speaking speech fragment in the original speech, determining the starting time and the ending time of each speaking speech fragment as starting time and ending time of a corresponding target speaking speech fragment, and generating a silent speech fragment between every two adjacent target speaking speech fragments, wherein a duration of the silent speech fragment is determined based on a difference between starting time of a latter target speaking speech fragment and ending time of the former target speaking speech fragment in the two adjacent target speaking speech fragments.
2 . The method according to claim 1 , wherein determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech comprises: determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech by performing speaker clustering on the segmented speech.
3 . The method according to claim 1 , further comprising: splicing the set of target speaking speech fragments based on the starting time and the ending time of each speaking speech fragment in the original speech to obtain the target speech.
4 . The method according to claim 1 , further comprising: inputting the voiceprint feature vector corresponding to each speaking speech fragment and the target language text corresponding to each speaking speech fragment into a pre-trained speech synthesis network to obtain the set of target speaking speech fragments.
5 . The method according to claim 1 , wherein before converting the text corresponding to each speaking speech fragment in the original speech into the target language text, the method comprises: performing speech recognition and automatic sentence division on the original speech, to obtain the text corresponding to each speaking speech fragment in the original speech.
6 . An electronic device, comprising: one or more processors; and a memory, storing one or more programs therein; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement operations comprising: determining starting time and ending time of each speaking speech fragment in original speech by performing speaker segmentation on the original speech to obtain segmented speech; determining a voiceprint feature vector corresponding to each speaking speech fragment in the original speech; converting a text corresponding to each speaking speech fragment in the original speech into a target language text, to obtain the target language text corresponding to each speaking speech fragment in the original speech; and generating target speech based on the starting time and the ending time of each speaking speech fragment in the original speech, the voiceprint feature vector corresponding to each speaking speech fragment, and the target language text corresponding to each speaking speech fragment, wherein generating the target speech comprises: generating a set of target speaking speech fragments based on the voiceprint feature vector and the target language text corresponding to each speaking speech fragment in the original speech, determining the starting time and the ending time of each speaking speech fragment as starting time and ending time of a corresponding target speaking speech fragment, and generating a silent speech fragment between every two adjacent target speaking speech fragments, wherein a duration of the silent speech fragment is determined based on a difference between starting time of a latter target speaking speech fragment and ending time of the former target speaking speech fragment in the two adjacent target speaking speech fragments.
7 . A non-transitory computer readable medium, storing a computer program therein, wherein the program, when executed by a processor, cause the processor to implement operations comprising: determining starting time and ending time of each speaking speech fragment in original speech by performing speaker segmentation on the original speech to obtain segmented speech; determining a voiceprint feature vector corresponding to each speaking speech fragment in the original speech; converting a text corresponding to each speaking speech fragment in the original speech into a target language text, to obtain the target language text corresponding to each speaking speech fragment in the original speech; and generating target speech based on the starting time and the ending time of each speaking speech fragment in the original speech, the voiceprint feature vector corresponding to each speaking speech fragment, and the target language text corresponding to each speaking speech fragment, wherein generating the target speech comprises: generating a set of target speaking speech fragments based on the voiceprint feature vector and the target language text corresponding to each speaking speech fragment in the original speech, determining the starting time and the ending time of each speaking speech fragment as starting time and ending time of a corresponding target speaking speech fragment, and generating a silent speech fragment between every two adjacent target speaking speech fragments, wherein a duration of the silent speech fragment is determined based on a difference between starting time of a latter target speaking speech fragment and ending time of the former target speaking speech fragment in the two adjacent target speaking speech fragments.
8 . The electronic device according to claim 6 , wherein determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech comprises: determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech by performing speaker clustering on the segmented speech.
9 . The electronic device according to claim 6 , the operations further comprising: splicing the set of target speaking speech fragments based on the starting time and the ending time of each speaking speech fragment in the original speech to obtain the target speech.
10 . The electronic device according to claim 6 , the operations further comprising: inputting the voiceprint feature vector corresponding to each speaking speech fragment and the target language text corresponding to each speaking speech fragment into a pre-trained speech synthesis network to obtain the set of target speaking speech fragments.
11 . The electronic device according to claim 6 , wherein before converting the text corresponding to each speaking speech fragment in the original speech into the target language text, the operations comprise: performing speech recognition and automatic sentence division on the original speech, to obtain the text corresponding to each speaking speech fragment in the original speech.
12 . The non-transitory computer readable medium according to claim 7 , wherein determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech comprises: determining the voiceprint feature vector corresponding to each speaking speech fragment in the original speech by performing speaker clustering on the segmented speech.
13 . The non-transitory computer readable medium according to claim 7 , the operations further comprising: splicing the set of target speaking speech fragments based on the starting time and the ending time of each speaking speech fragment in the original speech to obtain the target speech.
14 . The non-transitory computer readable medium according to claim 7 , the operations further comprising: inputting the voiceprint feature vector corresponding to each speaking speech fragment and the target language text corresponding to each speaking speech fragment into a pre-trained speech synthesis network to obtain the set of target speaking speech fragments.
15 . The non-transitory computer readable medium according to claim 7 , wherein before converting the text corresponding to each speaking speech fragment in the original speech into the target language text, the operations comprise: performing speech recognition and automatic sentence division on the original speech, to obtain the text corresponding to each speaking speech fragment in the original speech.

Description

CROSS REFERENCE TO RELATED APPLICATIONS The present application is the U.S. National Stage of International Application No. PCT/CN2021/109550, filed on Jul. 30, 2021, which claims priority to the Chinese Patent Application No. 202010823774.X, filed to China Patent Office on Aug. 17, 2020, and entitled “VOICE GENERATION METHOD AND APPARATUS, DEVICE, AND COMPUTER READABLE MEDIUM”, both of which are incorporated herein by reference in their entireties. FIELD Embodiments of the present disclosure relate to the technical field of computers, in particular to a speech generating method and apparatus, a device and a computer readable medium. BACKGROUND It takes a lot of manpower to convert an audio/video file in a first language into an audio/video file in a second language. It cannot be ensured that a time node of a speaking speech fragment in the audio/video file in the second language corresponds to that in the audio/video file in the first language, and it cannot be ensured that voiceprint features thereof are similar, so an automatic translating and dubbing technology is in demand. SUMMARY The summary part of the present disclosure is used to introduce ideas in a simple way. Such ideas will be described in detail in the following implementation part. The summary part of the present disclosure is not intended to mark any key features or essential features of claimed technical solutions, nor is it used to limit the scope of the claimed technical solutions. Some embodiments of the present disclosure propose a speech generating method and apparatus, a device and a computer readable medium, so as to solve the technical problem mentioned in the above background. In a first aspect, some embodiments of the present disclosure provide a speech generating method, comprising: determining starting time and ending time of each speaking speech fragment in the original speech by performing speaker segmentation on an original speech to obtain a segmented speech; determining a voiceprint feature vector corresponding to each speaking speech fragment in the original speech; converting a text corresponding to each speaking speech fragment in the original speech into a target language text, to obtain the target language text corresponding to each speaking speech fragment in the original speech; and generating a target speech based on the starting time and ending time of each speaking speech fragment in the original speech, the voiceprint feature vector corresponding to each speaking speech fragment and the target language text corresponding to the speaking speech fragment. In a second aspect, some embodiments of the present disclosure provide a speech generating apparatus, comprising: a first determining unit configured to determine starting time and ending time of each speaking speech fragment in the original speech by performing speaker segmentation on an original speech to obtain a segmented speech; a second determining unit configured to determine a voiceprint feature vector corresponding to each speaking speech fragment in the original speech; a converting unit configured to convert a text corresponding to each speaking speech fragment in the original speech into a target language text, to obtain the target language text corresponding to each speaking speech fragment in the original speech; and a generating unit configured to generate a target speech based on the starting time and ending time of each speaking speech fragment in the original speech, the voiceprint feature vector corresponding to each speaking speech fragment and the target language text corresponding to the speaking speech fragment. In a third aspect, some embodiments of the present disclosure provide an electronic device, including: one or more processors; and a storage apparatus, one or more programs are stored therein. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement any one of methods in the first aspect. In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium, computer programs are stored therein. The programs, when executed by a processor, cause the processor to implement any one of the methods in the first aspect. One of the above embodiments of the present disclosure has the following beneficial effects: automatically converting an audio/video file in a first language into an audio/video file in a second language. BRIEF DESCRIPTION OF THE DRAWINGS The above and other features, advantages and aspects of the embodiments of the present disclosure will become clearer in combination with the accompanying drawings and with reference to the following specific implementations. Throughout the accompanying drawings, the same or similar reference signs represent the same or similar elements. It should be understood that the accompanying drawings are merely illustrative and members and elements are not necessarily drawn in proportion. FIG. 1 is a schematic