CN-121985197-A - System audio generation method, device, equipment and medium based on multi-round dialogue

CN121985197ACN 121985197 ACN121985197 ACN 121985197ACN-121985197-A

Abstract

The application belongs to the field of artificial intelligence, and relates to a system audio generation method, device, equipment and medium based on multi-round dialogue, which comprise the steps of obtaining user text and audio of each dialogue round generated by interaction of a user and a multi-round dialogue system, and respectively fusing the user text and the audio and the system text and the audio of each dialogue round to obtain short-term abstract characteristics of the user and the system. Based on the system text, the short-term abstract features of the user and the system abstract features are obtained through two-way semantic fusion. All summary features are interacted with to obtain long-term features based on system text. And obtaining the total dialogue turn number, mapping the total dialogue turn number into an intimacy embedding vector, and generating target system audio of the next dialogue turn by combining the target system text, long-term characteristics and intimacy embedding vector of the next dialogue turn. The application can be applied to the business fields of finance, science, insurance, medical treatment and the like, and can realize personalized voice interaction requirements.

Inventors

WU JING
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260107

Claims (10)

1. A system audio generation method based on multi-round conversations, comprising the steps of: acquiring user texts, user audios, system texts and system audios of a plurality of dialogue rounds generated by interaction of a user and a multi-round dialogue system; For each dialogue round, respectively fusing the user text and the user audio, and fusing the system text and the system audio to obtain a user short-term abstract feature and a system short-term abstract feature of each dialogue round; based on the system text of each dialogue turn, performing two-way semantic fusion processing on the user short-term abstract feature and the system short-term abstract feature of each dialogue turn to obtain the user abstract feature and the system abstract feature of each dialogue turn; Based on the system texts of the dialog rounds, performing feature interaction on all the user abstract features and the system abstract features to obtain long-term features; the total dialogue round number between the user and the multi-round dialogue system is obtained, and the dialogue round number is mapped into an intimacy embedding vector; And obtaining a target system text of the multi-round dialogue system in the next dialogue round, and carrying out fusion conversion processing on the target system text, the long-term characteristics and the intimacy embedded vector to generate target system audio of the multi-round dialogue system in the next dialogue round.
2. The method according to claim 1, wherein the step of fusing the user text and the user audio, the system text and the system audio for each dialog turn to obtain the user short-term summary feature and the system short-term summary feature for each dialog turn, comprises: A text encoder is adopted to respectively carry out semantic representation on the user text and the system text of the user and the multi-turn dialogue system in each dialogue turn, and a user text vector and a system text vector of each dialogue turn are obtained; An audio encoder is adopted to respectively extract acoustic characteristics of the user audio and the system audio of the user and the multi-turn dialogue system in each dialogue turn, and a user voice vector and a system voice vector of each dialogue turn are obtained; And carrying out joint modeling on the user text vector, the user voice vector, the system text vector and the system voice vector through a fusion network based on a multi-layer perceptron to obtain the user short-term abstract feature and the system short-term abstract feature of each dialog turn.
3. The method according to claim 1, wherein the step of performing a bi-directional semantic fusion process on the user short-term summary feature and the system short-term summary feature of each dialog turn based on the system text of each dialog turn to obtain the user summary feature and the system summary feature of each dialog turn specifically includes: performing feature integration processing on the system text of each dialogue round, the user short-term abstract feature of each dialogue round and the system short-term abstract feature to obtain a joint feature sequence of each dialogue round; Constructing a query vector, a key vector and a value vector based on the joint feature sequence, and performing attention calculation based on the query vector, the key vector and the value vector to obtain joint features of each dialog turn; Splitting the joint feature into a system side sub-feature and a user side sub-feature based on the original sequence length of the user short-term summary feature and the system short-term summary feature of each dialog turn; Nonlinear characteristic conversion is carried out on the system side sub-characteristics through a first characteristic enhancement network, so that the system abstract characteristics of each dialogue round are obtained; and carrying out nonlinear feature conversion on the user side sub-features through a second feature enhancement network to obtain user abstract features of each dialogue round.
4. The method of claim 3, wherein the step of performing feature integration processing on the system text of each dialog turn, the user short-term summary feature of each dialog turn, and the system short-term summary feature to obtain the joint feature sequence of each dialog turn specifically includes: Normalizing the user short-term abstract feature and the system short-term abstract feature of each dialogue round respectively to obtain normalized user short-term abstract feature and normalized system short-term abstract feature; Fusing the system text vector corresponding to the system text of each dialogue round with the normalized short-term summary feature of the system for each dialogue round to obtain the fused feature; scaling and shifting the fused features to obtain system side scaling shifting features, and performing linear transformation on the system side scaling shifting features to obtain system side line transformation features; Scaling and shifting the normalized short-term summary features of the user for each dialog turn to obtain scaling and shifting features of the user side, and performing linear transformation on the scaling and shifting features of the user side to obtain user side line transformation features; And performing sequence dimension splicing on the system side linear transformation characteristics and the user side linear transformation characteristics to obtain a combined characteristic sequence of each dialogue round.
5. The method according to claim 1, wherein the step of performing feature interaction on all the user summary features and the system summary features based on the system text of the plurality of dialog turns to obtain long-term features specifically comprises: Splicing the user abstract features and the system abstract features of each dialog turn according to the sequence dimension to obtain the spliced features of each dialog turn; stacking the splicing features of each dialogue round according to the round dimension to obtain a multi-round joint feature sequence; element-level fusion is carried out on the system text characteristics corresponding to the system text of each dialogue round and the splicing characteristics of the corresponding dialogue round, so as to obtain fused multi-round fusion characteristics; Performing feature enhancement and interactive modeling on the multi-round fusion features to obtain interactive features; inputting text features corresponding to the system texts of all dialogue rounds into a gating adjustment unit to obtain gating weight vectors; carrying out weighted fusion on the interaction characteristic and the gating weight vector to obtain a coded characteristic after gating; and performing depth coding and global feature extraction on the coded features after gating through a multi-wheel feature depth coding unit to obtain long-term features.
6. The method according to claim 5, wherein the step of performing feature enhancement and interaction modeling on the multi-round fusion feature to obtain an interaction feature specifically comprises: Normalizing the multi-round fusion features to obtain normalized multi-round fusion features, and scaling and offsetting the normalized multi-round fusion features to obtain scaled and offset features; Performing multi-layer perceptron transformation on the scaled and offset features to obtain local nonlinear features; Performing attention modeling on the scaled and offset features to obtain global dependent features; and fusing the local nonlinear feature and the global dependent feature according to element level to obtain an interaction feature.
7. The method according to claim 1, wherein the step of generating the target system audio of the multi-turn dialog system in the next dialog turn by performing fusion conversion processing on the target system text, the long-term features and the intimacy embedding vector specifically includes: Inputting the target system text into a phoneme sequence generation model to obtain a phoneme sequence, and extracting the phoneme hidden characteristic from the phoneme sequence by adopting a phoneme hidden characteristic extraction model; Splicing the long-term features, the intimacy embedding vector and the phoneme hidden features to obtain fusion features; performing acoustic feature conversion processing on the fusion features through an acoustic decoder to generate a target Mel frequency spectrum; and converting the target Mel frequency spectrum into waveform voice through a vocoder to obtain the target system audio of the multi-turn dialogue system in the next dialogue round.
8. A system audio generating device based on multiple rounds of dialogue, comprising: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user texts, user audios, system texts and system audios of a plurality of dialogue rounds generated by interaction of a user and a multi-round dialogue system; The fusion module is used for respectively fusing the user text and the user audio, the system text and the system audio for each dialogue round to obtain the user short-term abstract feature and the system short-term abstract feature of each dialogue round; The semantic fusion module is used for carrying out two-way semantic fusion processing on the user short-term abstract features and the system short-term abstract features of each dialogue round based on the system text of each dialogue round to obtain the user abstract features and the system abstract features of each dialogue round; The interaction module is used for carrying out characteristic interaction on all the user abstract characteristics and the system abstract characteristics based on the system texts of the plurality of dialogue rounds to obtain long-term characteristics; The mapping module is used for obtaining the total dialogue turn number between the user and the multi-turn dialogue system and mapping the dialogue turn number into an intimacy embedding vector; And the fusion conversion module is used for acquiring a target system text of the multi-turn dialogue system in the next dialogue round, carrying out fusion conversion processing on the target system text, the long-term characteristics and the intimacy embedding vector, and generating target system audio of the multi-turn dialogue system in the next dialogue round.
9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the multi-round dialog based system audio generation method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the multi-round dialog based system audio generation method of any of claims 1 to 7.

Description

System audio generation method, device, equipment and medium based on multi-round dialogue Technical Field The application relates to the technical field of artificial intelligence, is applied to online processing of business scenes of finance technology, insurance, medical treatment and the like, and particularly relates to a system audio generation method, device, equipment and medium based on multi-round dialogue. Background The intelligent dialog system is significantly advanced under the promotion of the development of large language model (Large Language Mode, LLM) technology. LLM possesses powerful internal reasoning, external knowledge acquisition and accurate ability of holding dialogue context, lets intelligent dialogue system can produce high quality, laminating situation's reply text, brings more intelligent nature to exchange experience for the user. After LLM completes Text generation, text-to-Speech (TTS) technology is a key to endow a dialog system with real voice interaction capability. The method can convert the text content into smooth and natural voice, so that communication is more visual and convenient. After the 'alternate speaking' mechanism is introduced, the system can memorize historical dialogue turns, better simulate human communication rhythm, update user data according to long-term interaction, know in depth and adapt to user preference, and improve communication individuation degree. However, the prior art has a significantly short panel in terms of voice interaction. Although LLM-based text generation already has advanced features such as "gradual update", "active initiate dialog", etc., the subsequent TTS modules are not sufficiently adaptable. At present, the TTS stage system outputs single language and style, and can be primarily controlled according to emotion labels, but the TTS stage system can not be controlled in a dynamic continuous variation manner at fine granularity levels such as speech speed, intonation and the like in the face of interaction times, intimacy and context change, and the requirements of users on high-quality and personalized voice interaction are difficult to meet. Disclosure of Invention The embodiment of the application aims to provide a system audio generation method, a device, computer equipment and a storage medium based on multi-round dialogue, which are used for solving the problem that the fine-granularity dynamic style characteristics of output voice cannot be controlled at the fine-granularity level when the conventional dialogue system faces multi-round dialogue and interactive change. In a first aspect, a method for generating system audio based on multiple rounds of dialogue is provided, which adopts the following technical scheme: The method comprises the steps of obtaining user texts, user audios, system texts and system audios of a plurality of dialogue turns generated by interaction of a user and a multi-turn dialogue system, respectively fusing the user texts, the user audios, the system texts and the system audios for each dialogue turn to obtain user short-term summary features and system short-term summary features of each dialogue turn, carrying out two-way semantic fusion processing on the user short-term summary features and the system short-term summary features of each dialogue turn based on the system texts of each dialogue turn to obtain user summary features and system summary features of each dialogue turn, carrying out feature interaction on all the user summary features and the system summary features based on the system texts of the plurality of dialogue turns to obtain long-term features, obtaining total dialogue turn numbers between the user and the multi-turn dialogue system, mapping the dialogue turn numbers into affinity embedded vectors, obtaining target system texts of the multi-turn dialogue system in the next dialogue turn, carrying out fusion conversion processing on the affinity system texts, the long-term features and the affinity embedded vectors, and generating target system audios of the multi-turn dialogue system in the next dialogue turn. In a second aspect, a system audio generating device based on multiple rounds of dialogue is provided, which adopts the following technical scheme: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring user texts, user audios, system texts and system audios of a plurality of dialogue rounds generated by interaction of a user and a multi-round dialogue system; The fusion module is used for respectively fusing the user text and the user audio, and the system text and the system audio for each dialogue round to obtain the user short-term abstract feature and the system short-term abstract feature of each dialogue round; The semantic fusion module is used for carrying out two-way semantic fusion processing on the short-term abstract features of the user and the short-term abstract