CN-122024699-A - Speech synthesis method, device, equipment and medium

CN122024699ACN 122024699 ACN122024699 ACN 122024699ACN-122024699-A

Abstract

The invention relates to the technical field of data processing and discloses a voice synthesis method, device, equipment and medium, which comprise the steps of obtaining a target text to be subjected to voice synthesis and target audio for providing reference for voice synthesis, determining a target query vector according to the target text and the target audio, projecting the target query vector into a style retrieval space of a preset index library to obtain a target style query vector, and matching the target style query vector from reference style vectors of the preset index library according to the target style query vector, wherein the preset index library comprises a mapping relation between the reference style vector and a corresponding reference tone vector and a reference speaker feature vector, and performing voice synthesis according to the target text, the target audio and the target style vector to obtain a voice synthesis result. The invention can be applied to the fields of finance and technology and medical treatment, and improves the accuracy of the retrieval enhancement voice synthesis.

Inventors

SHI YAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260114

Claims (10)

1. A method of speech synthesis, comprising: Acquiring a target text to be subjected to voice synthesis and a target audio providing a reference for the voice synthesis; determining a target query vector according to the target text and the target audio, and projecting the target query vector to a style retrieval space of a preset index library to obtain a target style query vector, wherein the preset index library comprises a mapping relation between a reference style vector and a corresponding reference tone vector and a reference speaker feature vector; according to the target style query vector, matching from the reference style vector of the preset index library to obtain a target style vector; and performing voice synthesis according to the target text, the target audio and the target style vector to obtain a voice synthesis result.
2. The method of claim 1, wherein determining a target query vector according to the target text and the target audio, projecting the target query vector into a style retrieval space of a preset index library, and obtaining a target style query vector, comprises: Encoding the target text to obtain a text vector, and encoding the target audio to obtain an audio vector; splicing the text vector and the audio vector to obtain the target query vector; And projecting the target query vector to a style retrieval space of the preset index library through a preset multi-layer perceptron to obtain the target style query vector.
3. The method according to claim 1, wherein the matching the target style vector from the reference style vectors in the preset index library according to the target style query vector includes: respectively calculating similarity scores between the target style query vector and each reference style vector in the preset index library; And matching the target style vector from the reference style vector of the preset index library according to the similarity score.
4. The method according to claim 1, wherein the performing speech synthesis according to the target text, the target audio, and the target style vector to obtain a speech synthesis result includes: Encoding the target text and the target style vector to obtain intermediate features containing semantic information of the target text and style information of the target style vector; and extracting the characteristics of the target audio to obtain target tone characteristics, and performing voice synthesis according to the target tone characteristics and the intermediate characteristics to obtain the voice synthesis result.
5. The method of speech synthesis according to claim 1, wherein the method of speech synthesis further comprises: Acquiring a reference audio; For any reference audio, performing style feature extraction on the reference audio to obtain a reference style vector, performing tone feature extraction on the reference audio to obtain a reference tone vector, and performing speaker feature extraction on the reference audio to obtain a reference speaker feature vector; and storing the reference style vector, the reference timbre vector and the reference speaker characteristic vector in an associated mode to the preset index library.
6. The method according to claim 1, further comprising, after the matching the target style vector from the style vectors in the preset index library according to the target style query vector: Obtaining a target tone color vector associated with the target tone color vector from the reference tone color vector of the preset index library according to the target tone color vector; And performing voice synthesis according to the target text, the target style vector and the target tone color vector to obtain the voice synthesis result.
7. The method according to claim 1, further comprising, after the matching the target style vector from the style vectors in the preset index library according to the target style query vector: obtaining a target speaker characteristic vector associated with the target style vector from the reference speaker characteristic vector of the preset index library according to the target style vector; And performing speech synthesis according to the target text, the target audio, the target style vector and the target speaker feature vector to obtain the speech synthesis result.
8. A speech synthesis apparatus, a speech synthesis method comprising: the first acquisition module is used for acquiring target text to be subjected to voice synthesis and target audio for providing reference for the voice synthesis; The query determining module is used for determining a target query vector according to the target text and the target audio, and projecting the target query vector to a style retrieval space of a preset index library to obtain a target style query vector, wherein the preset index library comprises a mapping relation between a reference style vector and a corresponding reference tone vector and a reference speaker characteristic vector; the matching module is used for matching the target style query vector from the reference style vector of the preset index library to obtain a target style vector; And the first synthesis module is used for carrying out voice synthesis according to the target text, the target audio and the target style vector to obtain a voice synthesis result.
9. A computer device, characterized in that it comprises a processor, a memory and a computer program stored in the memory and executable on the processor, which processor implements the speech synthesis method according to any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of speech synthesis according to any one of claims 1 to 7 when the processor executes the computer program.

Description

Speech synthesis method, device, equipment and medium Technical Field The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a medium for speech synthesis. Background In recent years, the combination of zero sample Speech synthesis (Text To Speech, TTS) and a retrieval enhancement generation technology (RETRIEVAL-Augmented Generation, RAG) provides possibility for realizing highly personalized and emotional intelligent Speech service in the financial insurance field, but the prior method has certain limitation in the process of migrating the retrieval enhancement technology To Speech synthesis, namely, retrieval space semantic deviation, mainstream contrast learning and semantic level embedding encouraged Text and audio are aligned in a shared semantic space, a result retrieval model tends To return reference audio similar in semantics but not matched in style (for example, in a customer's insurance scene, query "we turn deep table about and assist full force" possibly To retrieve regular service Speech with smooth language, common conditions and urgent conditions cannot be met, influence customer experience and service profession, for example, in a medical scene, query "you's inspection result is already out, please get" possibly retrieve healthy Speech of easy or routine business, cannot embody the influence of the tangential health risk and the severe conditions, the psychological dependence of a patient cannot be met, and the psychological dependence is not well-defined by the ideal Text is mapped To the same as in the customer's insurance scene, and the psychological dependence is not matched with the ideal Text is required To be translated in the situation, and the psychological dependence is not matched with the desired To the desired Text is well-being expressed in the situation, and the desired To be satisfied. The existing deterministic search mechanism cannot dynamically adapt to the diversity, so that the individuation style inference is more difficult to be carried out on the unseen customer tone, the synthesized voice lacks expressive force and scene adaptability, and the landing effect of high emotion interaction links such as insurance sales, claim settling and pacifying, risk prompting and the like is limited. Therefore, how to improve the accuracy of speech synthesis based on search enhancement is a problem to be solved. Disclosure of Invention The embodiment of the invention provides a voice synthesis method, a device, equipment and a medium, which are used for solving the problem of how to improve the accuracy of voice synthesis based on retrieval enhancement. In a first aspect, a method for synthesizing speech is provided, including: Acquiring a target text to be subjected to voice synthesis and a target audio providing a reference for the voice synthesis; determining a target query vector according to the target text and the target audio, and projecting the target query vector to a style retrieval space of a preset index library to obtain a target style query vector, wherein the preset index library comprises a mapping relation between a reference style vector and a corresponding reference tone vector and a reference speaker feature vector; according to the target style query vector, matching from the reference style vector of the preset index library to obtain a target style vector; and performing voice synthesis according to the target text, the target audio and the target style vector to obtain a voice synthesis result. In a second aspect, there is provided a speech synthesis apparatus comprising: the first acquisition module is used for acquiring target text to be subjected to voice synthesis and target audio for providing reference for the voice synthesis; The query determining module is used for determining a target query vector according to the target text and the target audio, and projecting the target query vector to a style retrieval space of a preset index library to obtain a target style query vector, wherein the preset index library comprises a mapping relation between a reference style vector and a corresponding reference tone vector and a reference speaker characteristic vector; the matching module is used for matching the target style query vector from the reference style vector of the preset index library to obtain a target style vector; And the first synthesis module is used for carrying out voice synthesis according to the target text, the target audio and the target style vector to obtain a voice synthesis result. In a third aspect, a computer device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above-mentioned method of synthesizing sound when executing the computer program. In a fourth aspect, a computer readable storage medium is provided, the computer readable storage medium s