CN-121999756-A - Method for determining dialogue model training sample and audio dialogue method

CN121999756ACN 121999756 ACN121999756 ACN 121999756ACN-121999756-A

Abstract

The present disclosure relates to the field of speech dialogue technology, and in particular, to a method for determining a dialogue model training sample and an audio dialogue method. The method for determining the dialogue model training samples comprises the steps of determining an audio dialogue data set, determining dialogue states of text data corresponding to the audio dialogue data set, taking the text data as model input, taking the dialogue states as model output, training a first model, marking the text data corresponding to the audio dialogue data set by utilizing the first model, and determining the dialogue states of the audio dialogue data set and the marks as training samples of a second model, wherein the sample demand of the second model is larger than a quantity threshold.

Inventors

Cao Anji
LI YUXIN
WEI HAIWEI
LIU KAI

Assignees

共道网络科技有限公司

Dates

Publication Date: 20260508
Application Date: 20260408

Claims (10)

1. A method for determining a dialog model training sample, the method comprising: determining an audio dialog data set comprising collected real dialog audio and generated synthesized dialog audio; Determining a dialogue state of text data corresponding to the audio dialogue data set; inputting the text data as a model, outputting the dialogue state as a model, and training a first model; And marking text data corresponding to the audio dialogue data set by using the first model, and determining the dialogue state of the audio dialogue data set and the marking as training samples of a second model, wherein the sample demand of the second model is greater than a quantity threshold.
2. The method according to claim 1, wherein the method further comprises: and training the second model by using the training sample, wherein the input of the second model is the audio dialogue data set, the output of the second model is a dialogue state, and the response time length of the second model is smaller than a time length threshold value.
3. The method of claim 2, wherein the second model includes an audio encoding module, an adaptation module, and a semantic module, wherein training the second model using the training samples comprises: freezing the semantic module and training the audio coding module and the adaptation module by utilizing voice recognition data so as to align audio features; and training the audio coding module, the adapting module and the semantic module by using the training sample.
4. An audio dialog method, characterized in that the method is implemented on the basis of the second model, which is the second model in the method according to any of claims 1 to 3, the method comprising: under the condition that the audio signal contains human voice, inputting the audio signal into the trained second model, and determining a dialogue state according to the output of the second model; Generating a response text for the audio signal by a third model in case the dialog state is a response; And converting the response text into audio and playing.
5. The method of claim 4, wherein in the event that the audio signal is determined to contain human voice, the method further comprises: And performing voice recognition on the audio signal to determine text data, wherein the text data is used for generating response text by the third model.
6. The method of claim 4, wherein converting the response text to audio and playing comprises: Determining a current dialogue state according to the second model under the condition that the response text is converted into audio; Playing the audio under the condition that the current dialogue state is a response; And discarding the audio in case the current dialogue state is listening.
7. The method according to claim 4, wherein the method further comprises: Under the condition that an audio signal containing voice is detected in the process of playing the audio, judging a dialogue state corresponding to the audio signal containing voice through the second model; and stopping playing the audio under the condition that the dialogue state corresponding to the audio signal containing the voice is not response.
8. The method of claim 7, wherein the method further comprises: Reducing the volume of playing the audio under the condition that an audio signal containing human voice is detected in the process of playing the audio; And in the first duration of reducing the volume of the audio to be played, if the dialogue state corresponding to the audio signal containing the voice has a response, the original volume is restored and the audio is continuously played.
9. An electronic device is characterized by comprising a processor and a memory; The memory is used for storing a computer program; The processor being adapted to perform the method of any of claims 1-8 by invoking the computer program.
10. A computer program product comprising a computer program which, when executed by a processor, implements the method as claimed in any one of claims 1 to 8.

Description

Method for determining dialogue model training sample and audio dialogue method Technical Field The present disclosure relates to the field of speech dialogue technology, and in particular, to a method for determining a dialogue model training sample and an audio dialogue method. Background In the field of speech dialogue technology, it is required to determine whether a user's speech needs to be responded, that is, for a received audio signal, whether the corresponding dialogue state is listening waiting (listen) or responding to a call back (speech). The related art may convert a received audio signal into text, and then determine whether a response (speech) or a listening (listen) to a user's utterance is required based on semantic analysis of the text. However, this method needs to wait for the conversion of the audio signal into the text and wait for the semantic analysis result of the text, which results in untimely judgment of the dialogue state, longer time delay and poorer user experience. Disclosure of Invention To overcome the problems in the related art, the present disclosure provides a method for determining a dialog model training sample and an audio dialog method, which can solve the above-mentioned problems. According to a first aspect of embodiments of the present disclosure, a method for determining a dialog model training sample is provided, where the method includes determining an audio dialog data set, where the audio dialog data set includes collected real dialog audio and generated synthesized dialog audio, determining a dialog state of text data corresponding to the audio dialog data set, inputting the text data as a model, outputting the dialog state as a model, training a first model, labeling the text data corresponding to the audio dialog data set with the first model, and determining the audio dialog data set and the labeled dialog state as training samples of a second model, where a sample requirement of the second model is greater than a quantity threshold. According to a second aspect of the embodiment of the present disclosure, an audio dialogue method is provided, the method is implemented based on the second model, the second model is the second model in the method of the first aspect, the method includes inputting an audio signal into the trained second model when it is determined that the audio signal contains human voice, determining a dialogue state according to output of the second model, generating response text for the audio signal through a third model when the dialogue state is a response, and converting the response text into audio and playing. According to a third aspect of the embodiments of the present disclosure, there is provided a determination apparatus of a dialogue model training sample, the apparatus including a first determination unit configured to determine an audio dialogue data set including collected real dialogue audio and generated synthesized dialogue audio, a second determination unit configured to determine a dialogue state of text data corresponding to the audio dialogue data set, a model training unit configured to input the text data as a model, the dialogue state being output as a model, train a first model, and a labeling unit configured to label the text data corresponding to the audio dialogue data set with the first model, determine the audio dialogue data set and the labeled dialogue state as training samples of a second model, the sample demand of the second model being greater than a quantity threshold. According to a fourth aspect of embodiments of the present disclosure, there is provided an audio dialogue apparatus implemented based on a second model, which is the second model in the method of the first aspect, the apparatus including an input unit configured to input an audio signal into the trained second model and determine a dialogue state according to an output of the second model if it is determined that the audio signal contains a human voice, a generation unit configured to generate a response text for the audio signal through a third model if the dialogue state is a response, and a playback unit configured to convert the response text into audio and play. According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor, a memory, the memory for storing a computer program, the processor for executing the method according to the first or second aspect by invoking the computer program. According to a sixth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first or second aspect. The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: The audio dialogue method provided by the disclosure can be used for directly determining the dialogue state according t