US-12620399-B2 - Voice processing method and apparatus, electronic device, and computer readable medium

US12620399B2US 12620399 B2US12620399 B2US 12620399B2US-12620399-B2

Abstract

A voice processing method, comprising: segmenting a voice to be processed into at least one voice segment; generating at least one first voice on the basis of a clustering result of the at least one voice segment; performing feature extraction on each of the at least one first voice, to obtain a voiceprint feature vector corresponding to each first voice; and generating a second voice on the basis of the voiceprint feature vector, the second voice being an unmixed voice of the same sound source. Further disclosed are a voice processing apparatus, an electronic device, and a computer readable medium. By performing feature extraction on the first voice and further performing voice separation on the first voice, a more accurate second voice is obtained, thereby improving the overall voice segmentation effect.

Inventors

Meng Cai

Assignees

BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20210729
Priority Date: 20200817

Claims (12)

1 . A speech processing method, comprising: segmenting a to-be-processed speech into a plurality of speech fragments based on a preset duration; generating a plurality of initial first speeches based on a result of clustering the plurality of speech fragments, wherein the result comprises a plurality of speech fragment clusters; generating a plurality of speech frame clusters by performing segmentation and clustering on audio frames in each of the plurality of initial first speeches that are generated based on the result of clustering the plurality of speech fragments of the to-be-processed speech; generating a plurality of first speeches by splicing speech frames in each of the plurality of speech frame clusters; performing feature extraction on each first speech in the plurality of first speeches to obtain a voiceprint feature vector corresponding to each first speech; and generating a second speech based on the voiceprint feature vectors, wherein the second speech is an unmixed speech of a same sound source.
2 . The method according to claim 1 , wherein each first speech in the plurality of first speeches comprises at least one of the unmixed speech and a mixed speech.
3 . The method according to claim 1 , wherein the voiceprint feature vector corresponding to each first speech comprises at least one of a voiceprint feature vector corresponding to the unmixed speech and a voiceprint feature vector corresponding to a mixed speech.
4 . The method according to claim 3 , wherein generating the second speech based on the voiceprint feature vectors comprises: inputting the voiceprint feature vectors into a pre-trained time-domain audio separation network to generate the second speech, wherein the time-domain audio separation network is used to generate an unmixed speech of a target sound source according to the voiceprint feature vectors.
5 . An electronic device, comprising: one or more processors; and a storage apparatus, one or more programs are stored therein, the one or more programs, when executed by the one or more processors, cause the one or more processors to implement operations comprising: segmenting a to-be-processed speech into a plurality of speech fragments based on a preset duration; generating a plurality of initial first speeches based on a result of clustering the plurality of speech fragments, wherein the result comprises a plurality of speech fragment clusters; generating a plurality of speech frame clusters by performing segmentation and clustering on audio frames in each of the plurality of initial first speeches that are generated based on the result of clustering the plurality of speech fragments of the to-be-processed speech; generating a plurality of first speeches by splicing speech frames in each of the plurality of speech frame clusters; performing feature extraction on each first speech in the plurality of first speeches to obtain a voiceprint feature vector corresponding to each first speech; and generating a second speech based on the voiceprint feature vectors, wherein the second speech is an unmixed speech of a same sound source.
6 . The electronic device according to claim 5 , wherein each first speech in the-plurality of first speeches comprises at least one of the unmixed speech and a mixed speech.
7 . The electronic device according to claim 5 , wherein the voiceprint feature vector corresponding to each first speech comprises at least one of a voiceprint feature vector corresponding to the unmixed speech and a voiceprint feature vector corresponding to a mixed speech.
8 . The electronic device according to claim 7 , wherein generating the second speech based on the voiceprint feature vectors comprises: inputting the voiceprint feature vectors into a pre-trained time-domain audio separation network to generate the second speech, wherein the time-domain audio separation network is used to generate an unmixed speech of a target sound source according to the voiceprint feature vectors.
9 . A non-transitory computer readable medium, computer programs are stored therein, wherein, the programs, when executed by a processor, cause the processor to implement operations comprising: segmenting a to-be-processed speech into a plurality of speech fragments based on a preset duration; generating a plurality of initial first speeches based on a result of clustering the plurality of speech fragments, wherein the result comprises a plurality of speech fragment clusters; generating a plurality of speech frame clusters by performing segmentation and clustering on audio frames in each of the plurality of initial first speeches that are generated based on the result of clustering the plurality of speech fragments of the to-be-processed speech; generating a plurality of first speeches by splicing speech frames in each of the plurality of speech frame clusters; performing feature extraction on each first speech in the plurality of first speeches to obtain a voiceprint feature vector corresponding to each first speech; and generating a second speech based on the voiceprint feature vectors, wherein the second speech is an unmixed speech of a same sound source.
10 . The non-transitory computer readable medium according to claim 9 , wherein each first speech in the-plurality of first speeches comprises at least one of the unmixed speech and a mixed speech.
11 . The non-transitory computer readable medium according to claim 9 , wherein the voiceprint feature vector corresponding to each first speech comprises at least one of the following: a voiceprint feature vector corresponding to the unmixed speech and a voiceprint feature vector corresponding to a mixed speech.
12 . The non-transitory computer readable medium according to claim 11 , wherein generating the second speech based on the voiceprint feature vectors comprises: inputting the voiceprint feature vectors into a pre-trained time-domain audio separation network to generate the second speech, wherein the time-domain audio separation network is used to generate an unmixed speech of a target sound source according to the voiceprint feature vectors.

Description

CROSS REFERENCE TO RELATED APPLICATIONS The present application is the U.S. National Stage of International Application No. PCT/CN2021/109283, filed on Jul. 29, 2021, which claims priority to the Chinese Patent Application No. 202010824772.2, filed to China Patent Office on Aug. 17, 2020, and entitled “VOICE PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND COMPUTER READABLE MEDIUM”, the entire contents of both of which are incorporated herein by reference in their entireties. FIELD Embodiments of the present disclosure relate to the technical field of computers, in particular to a speech processing method and apparatus, a device and a computer readable medium. BACKGROUND At present, in the process of speech separation, it is often necessary to separate a target speech from a given speech. At present, a related approach can be to adopt a segmentation clustering method to obtain a target speech from a given speech. However, the target speech obtained by adopting the segmentation clustering method is not high in precision rate. SUMMARY The content of the present disclosure is used to introduce ideas in a brief form, which will be described in detail in the later specific embodiments. The content of the present disclosure is not intended to identify the key features or necessary features of the technical solution claimed, nor intended to limit the scope of the technical solution claimed. Some embodiments of the present disclosure propose a speech processing method, device, electronic device and computer-readable media to solve the technical problems mentioned in the above background. In a first aspect, some embodiments of the present disclosure provide a speech processing method, comprising: segmenting a to-be-processed speech into at least one speech fragment, wherein the speech fragment is a fragment from beginning to end of a segment of speech of the same sound source; generating at least one first speech based on a clustering result of the at least one speech fragment, wherein the first speech contains at least one speech fragment of the same sound source; performing feature extraction on each first speech in the at least one first speech to obtain a voiceprint feature vector corresponding to each first speech; and generating a second speech based on the voiceprint feature vectors, wherein the second speech is an unmixed speech of the same sound source. In a second aspect, some embodiments of the present disclosure provide A speech processing apparatus, comprising: a segmentation unit configured to segment a to-be-processed speech into at least one speech fragment, wherein the speech fragment is a fragment from beginning to end of a segment of speech of the same sound source; a first generating unit configured to generate at least one first speech based on a clustering result of the at least one speech fragment, wherein the first speech contains at least one speech fragment of the same sound source; a feature extraction unit configured to perform feature extraction on each first speech in the at least one first speech to obtain the voiceprint feature vectors corresponding to each first speech; and a second generating unit configured to generate a second speech based on the voiceprint feature vectors, wherein the second speech is an unmixed speech of the same sound source. In a third aspect, some embodiments of the present disclosure provide an electronic device, including: one or more processors; and a storage apparatus, one or more programs are stored therein. The one or more programs, when executed by the one or more processors, cause the one or more processors to implement any one of methods in the first aspect. In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium, computer programs are stored therein. The programs, when executed by a processor, cause the processor to implement any one of the methods in the first aspect. One embodiment in the above embodiments of the present disclosure has the following beneficial effects: first, a to-be-processed speech is segmented into at least one speech fragment, the above speech fragment being a fragment from beginning to end of a segment of speech of the same sound source; and then, at least one first speech is generated based on a clustering result of the above at least one speech fragment, the above first speech containing at least one speech fragment of the same sound source. Through the above process, a certain precision of speech segmentation may be performed on the target speech, so as to lay a foundation for generating a second speech below. Furthermore, feature extraction is performed on each first speech in the above at least one first speech, so as to obtain a voiceprint feature vector corresponding to each first speech; and the second speech is generated based on the above voiceprint feature vector, the second speech being an unmixed speech of the same sound source. The more precise second speech is ob