CN-115132206-B - Voice message processing method and device and electronic equipment

CN115132206BCN 115132206 BCN115132206 BCN 115132206BCN-115132206-B

Abstract

The application discloses a voice message processing method and device and electronic equipment, and belongs to the technical field of computers. The method comprises the steps of obtaining a voice message to be processed, determining a key voice message corresponding to the voice message to be processed, carrying out voice correction on the key voice message based on the similarity between the key voice message and the voice message to be processed through a trained voice correction model, and obtaining a target voice message, wherein the trained voice correction model is obtained through sample voice fine adjustment training of a target object, the target object is a message sending object corresponding to the voice message to be processed, and the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

Inventors

LEI XIAFEI

Assignees

维沃移动通信有限公司

Dates

Publication Date: 20260505
Application Date: 20220628

Claims (11)

1. A method for processing a voice message, comprising: Acquiring a voice message to be processed; converting the voice message to be processed into a message text through a voice conversion model; extracting key content from the message text through a text extraction model to obtain a key text; converting the key text into a key voice message based on a text voice mapping relation corresponding to the target object through a text conversion model; performing voice correction on the key voice message based on the similarity between the key voice message and the voice message to be processed through a trained voice correction model to obtain a target voice message; The text-to-speech mapping relation comprises a mapping relation between characters and at least one speech segment, the at least one speech segment has voiceprint characteristics of the target object, and the speech characteristics of each speech segment are different; the trained voice correction model is obtained by utilizing sample voice fine adjustment training of a target object, wherein the target object is a message sending object corresponding to the voice message to be processed, and the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object; the training step of the voice correction model comprises the following steps: Acquiring voice characteristics of at least two sample voice messages in universal sample voice through a voice network of a voice correction model to be trained, wherein a preset similarity degree exists between any two sample voice messages; Acquiring text-to-speech combination characteristics of each sample speech message based on the speech characteristics of the at least two sample speech messages through the text network of the speech correction model to be trained; determining the similarity degree between any two sample voice messages based on text-voice combination characteristics of the any two sample voice messages through a similarity evaluation network of the voice correction model to be trained; Training the voice network and the text network based on the difference between the similarity between any two sample voice messages and the preset similarity between any two sample voice messages until the similarity between any two sample voice messages is greater than or equal to the preset similarity, so as to obtain a pre-trained voice correction model; And performing fine tuning training on the pre-trained voice correction model by utilizing sample voice of the target object until the training is finished, and obtaining the trained voice correction model.
2. The method of claim 1, wherein the extracting key content from the message text by the text extraction model, before obtaining key text, further comprises: performing error correction processing on the message text through a text error correction model; extracting key content from the message text through a text extraction model to obtain a key text, wherein the method comprises the following steps: and extracting key content from the message text subjected to error correction processing through a text extraction model to obtain the key text.
3. The method of claim 1, wherein converting the voice message to be processed into message text by a voice conversion model comprises: Converting the voice message to be processed into the message text based on a text-to-speech mapping relation corresponding to the target object through a speech conversion model; The text-to-speech mapping relation is obtained by performing fine tuning training on the speech correction model by using sample speech of the target object.
4. The method of claim 1, wherein the text-to-speech mapping is obtained by fine-tuning the speech modification model using the sample speech of the target object.
5. The method according to claim 1, wherein the performing, by the trained voice correction model, voice correction on the key voice message based on the similarity between the key voice message and the voice message to be processed to obtain the target voice message includes: determining the similarity degree between the key voice message and the voice message to be processed through a voice correction model; and under the condition that the similarity degree is smaller than a preset threshold value, transmitting the similarity degree to at least one of the voice conversion model and the text conversion model so that at least one of the voice conversion model and the text conversion model adjusts respective output results until the similarity degree between the key voice message and the voice message to be processed is larger than or equal to the preset threshold value, and obtaining the target voice message.
6. The method of claim 5, wherein said determining, by a voice correction model, a degree of similarity between the critical voice message and the voice message to be processed comprises: Acquiring the voice characteristics of the key voice message and the voice characteristics of the voice message to be processed through a voice network of a voice correction model; acquiring text-to-speech combination characteristics of the key speech message based on the speech characteristics of the key speech message through a text network of a speech correction model, and acquiring the text-to-speech combination characteristics of the to-be-processed speech message based on the speech characteristics of the to-be-processed speech message; and determining the similarity degree between the key voice message and the voice message to be processed based on the text voice combination characteristic of the key voice message and the text voice combination characteristic of the voice message to be processed through a similarity evaluation network of the voice correction model.
7. The method according to any one of claims 3 or 4, wherein the step of obtaining the text-to-speech mapping relation includes: acquiring a plurality of voice fragments from sample voices of a target object; acquiring characters corresponding to each voice fragment through a text network of the voice correction model; And determining the text-to-speech mapping relation according to each speech segment and the text corresponding to each speech segment.
8. The method of claim 1, wherein the obtaining the voice message to be processed comprises: receiving at least one voice message sent by a target object in a session interface; The method further comprises the steps of: And displaying the target voice message in the session interface.
9.A voice message processing apparatus, comprising: The acquisition module is used for acquiring the voice message to be processed; the system comprises a determining module, a text extraction module, a text conversion module and a text processing module, wherein the determining module is used for converting the voice message to be processed into a message text through a voice conversion model; The correction module is used for carrying out voice correction on the key voice message based on the similarity degree between the key voice message and the voice message to be processed through a trained voice correction model to obtain a target voice message; The text-to-speech mapping relation comprises a mapping relation between characters and at least one speech segment, the at least one speech segment has voiceprint characteristics of the target object, and the speech characteristics of each speech segment are different; the trained voice correction model is obtained by utilizing sample voice fine adjustment training of a target object, wherein the target object is a message sending object corresponding to the voice message to be processed, and the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object; the correction module is used for: Acquiring voice characteristics of at least two sample voice messages in universal sample voice through a voice network of a voice correction model to be trained, wherein a preset similarity degree exists between any two sample voice messages; Acquiring text-to-speech combination characteristics of each sample speech message based on the speech characteristics of the at least two sample speech messages through the text network of the speech correction model to be trained; determining the similarity degree between any two sample voice messages based on text-voice combination characteristics of the any two sample voice messages through a similarity evaluation network of the voice correction model to be trained; Training the voice network and the text network based on the difference between the similarity between any two sample voice messages and the preset similarity between any two sample voice messages until the similarity between any two sample voice messages is greater than or equal to the preset similarity, so as to obtain a pre-trained voice correction model; And performing fine tuning training on the pre-trained voice correction model by utilizing sample voice of the target object until the training is finished, and obtaining the trained voice correction model.
10. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the voice message processing method of any of claims 1-8.
11. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the voice message processing method according to any of claims 1-8.

Description

Voice message processing method and device and electronic equipment Technical Field The application belongs to the technical field of computers, and particularly relates to a voice message processing method, a voice message processing device and electronic equipment. Background Along with the continuous promotion of science and technology, the frequency of people's use electronic equipment is also higher and higher, often can use the voice message function in some application when people contact, and voice message brings very convenient, and it still has lively and has stronger user's characteristic, characteristics such as information loss is less in the propagation. However, in the use process, when a user receives a plurality of voice messages with longer duration, the user needs to click on the voice messages in turn to hear the content, and the efficiency of acquiring information from the voice messages is low because of fewer effective messages in the voice messages. Disclosure of Invention The embodiment of the application provides a voice message processing method, a voice message processing device and electronic equipment, which can solve the problem that the efficiency of acquiring information from voice messages is low because the time spent for sequentially listening to the voice messages with a plurality of times is long in the prior art. In a first aspect, an embodiment of the present application provides a method for processing a voice message, where the method includes: Acquiring a voice message to be processed; Determining a key voice message corresponding to the voice message to be processed; performing voice correction on the key voice message based on the similarity between the key voice message and the voice message to be processed through a trained voice correction model to obtain a target voice message; The trained voice correction model is obtained by training sample voice fine adjustment of a target object, the target object is a message sending object corresponding to the voice message to be processed, and the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object. In a second aspect, an embodiment of the present application provides a voice message processing apparatus, including: The acquisition module is used for acquiring the voice message to be processed; The determining module is used for determining the key voice message corresponding to the voice message to be processed; The correction module is used for carrying out voice correction on the key voice message based on the similarity degree between the key voice message and the voice message to be processed through a trained voice correction model to obtain a target voice message; The trained voice correction model is obtained by training sample voice fine adjustment of a target object, the target object is a message sending object corresponding to the voice message to be processed, and the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object. In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a program or instruction stored on the memory and executable on the processor, the program or instruction implementing the steps of the method according to the first aspect when executed by the processor. In a fourth aspect, embodiments of the present application provide a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect. In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and where the processor is configured to execute a program or instructions to implement a method according to the first aspect. In the embodiment of the application, firstly, a voice message to be processed is obtained, then, a key voice message corresponding to the voice message to be processed is determined, finally, a trained voice correction model is obtained through sample voice fine adjustment training by utilizing a target object, and based on the similarity degree between the key voice message and the voice message to be processed, the key voice message is subjected to voice correction, and the target voice message with the voice characteristics of the voice message to be processed and the voiceprint characteristics of the target object is obtained. According to the embodiment of the application, the key voice message is obtained by processing at least one voice message, the key voice message is corrected by utilizing the trained voice correction model, so that the target voice message with the voice characteristics of the voice message to be processed a