CN-119626209-B - Voice recognition model training method and intelligent customer service voice processing method

CN119626209BCN 119626209 BCN119626209 BCN 119626209BCN-119626209-B

Abstract

The invention provides a voice recognition model training method, a voice processing device, electronic equipment and a computer readable storage medium of intelligent customer service, which comprise the steps of obtaining voice sample data, generating labels for each voice sample data, training a pre-training model by using the voice sample data to obtain a trained pre-training model, performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect labels to obtain a dialect understanding model, and performing model fusion on the pre-training model and the dialect understanding model to obtain a language understanding model. According to the invention, the label is generated on the voice sample data, and the dialect training pre-training model is used, so that the dialect understanding model can accurately identify the dialect, the model after the pre-training model is fused can accurately identify the mandarin and the dialect, the answer of the user when asking can be more accurately matched, the communication efficiency is improved, and better service experience is brought to the user.

Inventors

GUO HONGLEI

Assignees

中电信人工智能科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20241114

Claims (10)

1. A method of training a speech recognition model, the method comprising: The method comprises the steps of obtaining voice sample data, and generating a label for each voice sample data, wherein the voice sample data comprises mandarin and dialects, the labels comprise mandarin labels and dialect labels, and the dialect labels are used for indicating the types of the dialects; Training a pre-training model by using the voice sample data to obtain a trained pre-training model; Performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect tag to obtain a dialect understanding model, wherein the dialect understanding model is used for recognizing a text recognition result of the dialect voice data, and the pre-training model is used for recognizing the text recognition result of the mandarin voice data; carrying out model fusion on the pre-training model and the dialect understanding model to obtain a language understanding model, wherein the language understanding model is used for recognizing text recognition results of dialect voice data and mandarin voice data; The model fusion is carried out on the pre-training model and the dialect understanding model to obtain a language understanding model, and the method comprises the following steps: inputting the voice sample data to the pre-training model, and outputting a first predicted text recognition result of the voice sample data; inputting the voice sample data to the dialect understanding model, and outputting a second predicted text recognition result; fusing the first predicted text recognition result and the second predicted text recognition result to form a target predicted text recognition result of the voice sample data; Inputting the voice sample data with the Mandarin tag, the voice sample data with the dialect tag and the voice sample data associated with the target predicted text recognition result into an initial model to obtain a third predicted text recognition result output by the initial model; determining a first loss value according to the difference between the third predicted text recognition result and the target predicted text recognition result; determining a second loss value according to the third predicted text recognition result, the Mandarin information of the voice sample stored in the Mandarin tag and the difference between the dialect information of the voice sample stored in the dialect tag; Training the initial model according to the first loss value, the second loss value and a preset loss function to obtain the language understanding model, wherein the preset loss function is a loss function corresponding to the identification accuracy of the real voice sample data.
2. The method of claim 1, wherein prior to said training a pre-training model using said speech sample data to obtain a trained pre-training model, the method further comprises: noise filtering is carried out on the voice sample data to obtain voice sample data after noise filtering; performing word-by-word audio segmentation on the voice sample data subjected to noise filtering to obtain single-word voice data; training the pre-training model by using the voice sample data to obtain a trained pre-training model, wherein the training comprises the following steps: after the audio segmentation is detected, training the pre-training model according to the single word voice data and the labels of voice sample data to which the single word voice data belongs to obtain a trained pre-training model.
3. The method of claim 1, wherein said performing dialect adaptation training on said pre-trained model using said voice sample data with said dialect labels to obtain a dialect understanding model comprises: when the voice sample data of the dialect label is detected to be updated, performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect label to obtain a dialect understanding model; When the processing concurrent task quantity of the dialect understanding model is detected to be smaller than a preset threshold value, performing dialect adaptability training on the pre-training model by using voice sample data with the dialect label to obtain the dialect understanding model; and when a fine tuning instruction is received, training a pre-training model by using the voice sample data, and performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect label to obtain a dialect understanding model corresponding to the fine tuning instruction.
4. The method of claim 1, wherein the pre-training model comprises a convolution layer, a loop layer, and a full-join layer, and wherein during the dialect adaptation training of the pre-training model, the method further comprises: The number of layers of the convolution layers of the pre-training model is increased to obtain a convolution layer with dialect characteristics, and when the feature of the voice sample data with the dialect labels is extracted, the convolution layers except the convolution layers with the dialect characteristics are frozen by the convolution layers so as to retain general features in the extracted features; Increasing the number of the circulating layers of the pre-training model to obtain a bidirectional circulating layer; and increasing the number of layers of the full-connection layer of the pre-training model to obtain the full-connection layer with dialect feature, wherein the full-connection layer is used for mapping the dialect category of the voice sample data with the dialect label.
5. The method of claim 1, wherein the acquiring voice sample data and generating a tag for each of the voice sample data comprises: performing text transcription according to the acquired voice sample data to obtain a text transcription result, wherein the text transcription result comprises text data corresponding to the voice sample data; combining the text transcription result with the character vowels and pinyin to obtain text sample data; Generating labels for the text sample data and the voice sample data respectively, wherein the text sample data comprises Mandarin and dialects of each place, the labels comprise Mandarin labels and dialect labels, and the dialect labels are used for indicating the types of the dialects; The method for performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect label to obtain a dialect understanding model comprises the following steps: Combining the text sample data with the dialect tag with the voice sample data with the dialect tag to obtain multi-modal dialect data containing the voice sample data and the corresponding text sample data; And performing dialect adaptability training on the pre-training model by using the multi-modal dialect data to obtain a dialect understanding model, wherein the dialect understanding model is used for recognizing the dialect voice data to obtain a text recognition result.
6. The voice processing method of the intelligent customer service is characterized by comprising the following steps of: acquiring audio data to be processed input by a user; Inputting the audio data to be processed into the trained language understanding model according to any one of claims 1 to 5 to obtain a text recognition result corresponding to the audio data to be processed; determining a reply sentence matched with the text recognition result according to the text recognition result and a preset knowledge base; the reply sentence is sent to the device of the user so that the reply sentence is used as a reply to the audio data to be processed.
7. A speech recognition model training device, comprising: The system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring voice sample data and generating a label for each voice sample data, the voice sample data comprises mandarin and dialects, the label comprises mandarin labels and dialects labels, and the dialects labels are used for indicating the types of the dialects; The pre-training module is used for training the pre-training model by using the voice sample data to obtain a trained pre-training model; The dialect training module is used for performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect label to obtain a dialect understanding model, wherein the dialect understanding model is used for recognizing the text recognition result of the dialect voice data, and the pre-training model is used for recognizing the text recognition result of the mandarin voice data; the fusion module is used for carrying out model fusion on the pre-training model and the dialect understanding model to obtain a language understanding model, wherein the language understanding model is used for recognizing text recognition results of dialect voice data and mandarin voice data; The fusion module may specifically include: the first prediction text sub-module is used for inputting the voice sample data to the pre-training model and outputting a first prediction text recognition result of the voice sample data; the second prediction text sub-module is used for inputting the voice sample data to the dialect understanding model and outputting a second prediction text recognition result; the target prediction text sub-module is used for fusing the first prediction text recognition result and the second prediction text recognition result to form a target prediction text recognition result of the voice sample data; The third predicted text sub-module is used for inputting the voice sample data with the Mandarin tag, the voice sample data with the dialect tag and the voice sample data associated with the target predicted text recognition result into an initial model to obtain a third predicted text recognition result output by the initial model; a first loss value sub-module, configured to determine a first loss value according to a difference between the third predicted text recognition result and the target predicted text recognition result; a second loss value sub-module, configured to determine a second loss value according to the third predicted text recognition result, mandarin information of the voice sample stored in the mandarin label, and a difference between dialect information of the voice sample stored in the dialect label; the loss function sub-module is used for training the initial model according to the first loss value, the second loss value and a preset loss function to obtain the language understanding model, wherein the preset loss function is a loss function corresponding to the identification accuracy of the real voice sample data.
8. An intelligent customer service voice processing device, comprising: The second acquisition module acquires audio data to be processed input by a user; The recognition module is used for inputting the audio data to be processed into the trained language understanding model according to any one of claims 1 to 5 to obtain a text recognition result corresponding to the audio data to be processed; The question-answer module is used for determining answer sentences matched with the text recognition results according to the text recognition results and a preset knowledge base; And the reply module is used for sending the reply sentence to the equipment of the user so that the reply sentence is used as a reply to the audio data to be processed.
9. An electronic device comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1 to 6.
10. A readable storage medium, characterized in that instructions in the readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the method claims 1 to 6.

Description

Voice recognition model training method and intelligent customer service voice processing method Technical Field The invention relates to the field of voice recognition, in particular to a voice recognition model training method, an intelligent customer service voice processing device, electronic equipment and a computer readable storage medium. Background In recent years, intelligent customer service systems have been widely developed due to the wide working coverage time and rapid response, and many artificial customer services and intelligent customer services are integrated to serve more users, so that many positive evaluations are obtained. The intelligent customer service is designed and optimized on the Mandarin mostly at present, so that the intelligent customer service is used for serving Mandarin users with large number of users, the intelligent customer service determines the problems of the users through voice recognition and replies, and the voice recognition has a good recognition accuracy effect on the Mandarin. In addition to a large number of Mandarin users, there are many dialect users who have poor speech recognition after speech input, and influence and answer the question effect of the user. Disclosure of Invention The embodiment of the invention provides a voice recognition model training method, which aims to solve the problem of insufficient recognition accuracy of voice recognition on dialects in the prior art. Correspondingly, the embodiment of the invention also provides a voice processing method, a voice processing device, electronic equipment and a computer readable storage medium for intelligent customer service, which are used for ensuring the realization and the application of the method. In a first aspect, an embodiment of the present invention provides a method for training a speech model, where the method includes: The method comprises the steps of obtaining voice sample data, and generating a label for each voice sample data, wherein the voice sample data comprises mandarin and dialects, the labels comprise mandarin labels and dialect labels, and the dialect labels are used for indicating the types of the dialects; Training a pre-training model by using the voice sample data to obtain a trained pre-training model; Performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect tag to obtain a dialect understanding model, wherein the dialect understanding model is used for recognizing a text recognition result of the dialect voice data, and the pre-training model is used for recognizing the text recognition result of the mandarin voice data; And carrying out model fusion on the pre-training model and the dialect understanding model to obtain a language understanding model, wherein the language understanding model is used for recognizing text recognition results of dialect voice data and mandarin voice data. In a second aspect, an embodiment of the present invention provides a method for voice recognition, where the method includes: acquiring audio data to be processed input by a user; Inputting the audio data to be processed into the trained language understanding model to obtain a text recognition result corresponding to the audio data to be processed; determining a reply sentence matched with the text recognition result according to the text recognition result and a preset knowledge base; the reply sentence is sent to the device of the user so that the reply sentence is used as a reply to the audio data to be processed. In a third aspect, an embodiment of the present invention provides a device for training a speech recognition model, where the device includes: The first acquisition module is used for acquiring voice sample data and generating a label for each voice sample data; The pre-training module is used for training the pre-training model by using the voice sample data to obtain a trained pre-training model; the dialect training module is used for performing dialect adaptability training on the pre-training model by using the voice sample data with the dialect label to obtain a dialect understanding model; and the fusion module is used for carrying out model fusion on the pre-training model and the dialect understanding model to obtain a language understanding model. In a fourth aspect, an embodiment of the present invention provides a voice processing device for intelligent customer service, where the device includes: The second acquisition module acquires audio data to be processed input by a user; the recognition module is used for inputting the audio data to be processed into the trained language understanding model to obtain a text recognition result corresponding to the audio data to be processed; The question-answer module is used for determining answer sentences matched with the text recognition results according to the text recognition results and a preset knowledge base; And the reply module is used for