CN-114416955-B - Training method, device, equipment and storage medium of heterogeneous language model

CN114416955BCN 114416955 BCN114416955 BCN 114416955BCN-114416955-B

Abstract

The embodiment of the application provides a training method, a device, equipment and a storage medium of a heterogeneous language model, wherein the method comprises the steps of obtaining a voice training sample set; training a first initial network model and a second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models, wherein the first network models and the second network models are different in structure, the first network models are used for processing an input pinyin sequence to obtain at least one text sequence corresponding to the pinyin sequence, the second network models are used for determining a target text sequence corresponding to the pinyin sequence from the at least one text sequence, and the heterogeneous language models are determined according to the at least two first network models and the at least two second network models. The training method, the training device, the training equipment and the storage medium of the heterogeneous language model are used for improving the accuracy of the language model.

Inventors

JIANG DI

Assignees

深圳前海微众银行股份有限公司

Dates

Publication Date: 20260512
Application Date: 20220121

Claims (14)

1. A method for training a heterogeneous language model, comprising: acquiring a voice training sample set; Training a first initial network model and a second initial network model by adopting the voice training sample set to obtain at least two first network models and at least two second network models, wherein the structures of the first network models and the second network models are different, the first network models are used for processing an input pinyin sequence to obtain at least one text sequence corresponding to the pinyin sequence, and the second network models are used for determining a target text sequence corresponding to the pinyin sequence from the at least one text sequence; Determining at least four first language models according to the at least two first network models and the at least two second network models; The method comprises the steps of obtaining a voice verification sample set, wherein the voice verification sample set comprises a plurality of pinyin verification samples and text verification results corresponding to the pinyin verification samples, and aiming at each first language model in at least four first language models, the first language models comprise a first network model and a second network model; Processing the plurality of pinyin verification samples sequentially through the first network model and the second network model to obtain text output results corresponding to the pinyin verification samples; Determining the ratio of the number of pinyin verification samples with different text output results and text verification results to the total number of the plurality of pinyin verification samples as the error rate of the first language model; Determining a heterogeneous language model based on the at least four first language models and the error rate, or Performing binary conversion processing on model parameters of the first network model aiming at each first network model to obtain a first initial parameter sequence, performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequence, replacing the model parameters of the first network model with model parameters corresponding to the at least two first intermediate parameter sequence to obtain at least two third network models corresponding to the first network model, determining a first target network model according to the plurality of third network models, or generating a first weight of each first network model through a weight generation model, fusing the model parameters of each first network model according to the first weight to obtain first target model parameters, replacing the model parameters of the first network model with the first target model parameters to obtain a first target network model; determining a second target network model according to the at least two second network models; And determining the first target network model and the second target network model as heterogeneous language models.
2. The method of claim 1, wherein determining at least four first language models from the at least two first network models and the at least two second network models comprises: and carrying out random combination processing on the at least two first network models and the at least two second network models to obtain at least four first language models.
3. The method of claim 1, wherein determining at least four first language models from the at least two first network models and the at least two second network models comprises: Performing binary conversion processing on model parameters of each first network model to obtain a first initial parameter sequence; the method comprises the steps of carrying out cross processing and mutation processing on a first initial parameter sequence to obtain at least two first intermediate parameter sequence, replacing model parameters of a first network model with model parameters corresponding to the at least two first intermediate parameter sequence to obtain at least two third network models corresponding to the first network model; performing binary conversion processing on model parameters of each second network model to obtain a second initial parameter sequence; the model parameters of the second network model are replaced by the model parameters corresponding to the at least two second intermediate parameter sequences, so as to obtain at least two fifth network models corresponding to the second network model; and carrying out random combination processing on at least two third network models corresponding to the at least two first network models and at least two fifth network models corresponding to the at least two second network models to obtain at least four first language models.
4. The method of training a heterogeneous language model according to claim 1, wherein said determining a heterogeneous language model from said at least four first language models and said error rate comprises: judging that first language models with error rates smaller than a preset value exist in the at least four first language models; If yes, determining a first language model with the error rate smaller than a preset value as the heterogeneous language model; If not, acquiring a first model parameter sequence corresponding to model parameters of a preset number of first language models in the at least four first language models to obtain at least one second language model corresponding to the preset number of first language models, and determining the heterogeneous language models according to the plurality of second language models and the error rate of each second language model.
5. The method of claim 1, wherein determining the first target network model from the plurality of third network models comprises: acquiring a voice verification sample set; determining the error rate of each third network model according to the voice verification sample set; and determining a first target network model according to the third network models and the error rate.
6. The method for training a heterogeneous language model according to claim 5, wherein the set of speech verification samples includes a plurality of pinyin verification samples and text verification results corresponding to each of the plurality of pinyin verification samples; Determining an error rate of a third network model from the speech verification sample set, comprising: Processing the plurality of pinyin verification samples sequentially through the third network model and any one of the second network models to obtain text output results corresponding to the pinyin verification samples; and determining the ratio of the number of pinyin verification samples with different text output results and text verification results to the total number of the plurality of pinyin verification samples as the error rate of the third network model.
7. The method of claim 5, wherein determining a first target network model based on the plurality of third network models and the error rate comprises: Judging that a third network model with the error rate smaller than a preset value exists in the plurality of third network models; if yes, determining a third network model with the error rate smaller than a preset value as the first target network model; If not, acquiring a first initial parameter sequence corresponding to model parameters of a preset number of third network models in the plurality of third network models to obtain at least one fourth network model corresponding to each of the preset number of third network models, and determining the first target network model according to the plurality of fourth network models and the error rate of each fourth network model.
8. The method of claim 1, wherein determining a second target network model from the at least two second network models comprises: generating second weights of the second network models through the weight generation models; fusing the model parameters of each second network model according to the second weights to obtain second target model parameters; And replacing the model parameters of the second network model with the second target model parameters to obtain the second target network model.
9. The method of training a heterogeneous language model of claim 8, further comprising: acquiring a voice verification sample set; determining the error rate of the heterogeneous language model according to the voice verification sample set, and determining a reward value according to the error rate; And updating the model parameters of the weight generation model according to the rewarding value, and regenerating the first weight of each first network model and the second weight of each second network model to obtain a new heterogeneous language model.
10. The method for training a heterogeneous language model according to any one of claims 1 to 9, wherein the speech training sample set comprises at least two sample subsets, wherein training a first initial network model and a second initial network model using the speech training sample set to obtain at least two first network models and at least two second network models comprises: training the first initial network model by adopting the at least two sample subsets respectively to obtain first network models corresponding to the at least two sample subsets respectively; Training the second initial network model by adopting the at least two sample subsets respectively to obtain second network models corresponding to the at least two sample subsets respectively; Determining the first network models corresponding to the at least two sample subsets as the at least two first network models; and determining the second network models corresponding to the at least two sample subsets as the at least two second network models.
11. The training device for the heterogeneous language model is characterized by comprising a processing module, wherein the processing module is used for: acquiring a voice training sample set; Training a first initial network model and a second initial network model by adopting the voice training sample set to obtain at least two first network models and at least two second network models, wherein the structures of the first network models and the second network models are different, the first network models are used for processing an input pinyin sequence to obtain at least one text sequence corresponding to the pinyin sequence, and the second network models are used for determining a target text sequence corresponding to the pinyin sequence from the at least one text sequence; Determining at least four first language models according to the at least two first network models and the at least two second network models; The method comprises the steps of obtaining a voice verification sample set, wherein the voice verification sample set comprises a plurality of pinyin verification samples and text verification results corresponding to the pinyin verification samples, and aiming at each first language model in at least four first language models, the first language models comprise a first network model and a second network model; Processing the plurality of pinyin verification samples sequentially through the first network model and the second network model to obtain text output results corresponding to the pinyin verification samples; Determining the ratio of the number of pinyin verification samples with different text output results and text verification results to the total number of the plurality of pinyin verification samples as the error rate of the first language model; Determining a heterogeneous language model based on the at least four first language models and the error rate, or Performing binary conversion processing on model parameters of the first network model aiming at each first network model to obtain a first initial parameter sequence, performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequence, replacing the model parameters of the first network model with model parameters corresponding to the at least two first intermediate parameter sequence to obtain at least two third network models corresponding to the first network model, determining a first target network model according to the plurality of third network models, or generating a first weight of each first network model through a weight generation model, fusing the model parameters of each first network model according to the first weight to obtain first target model parameters, replacing the model parameters of the first network model with the first target model parameters to obtain a first target network model; determining a second target network model according to the at least two second network models; And determining the first target network model and the second target network model as heterogeneous language models.
12. An electronic device comprising a processor and a memory communicatively coupled to the processor; The memory stores computer-executable instructions; the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 10.
13. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 10.
14. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1 to 10.

Description

Training method, device, equipment and storage medium of heterogeneous language model Technical Field The present application relates to the field of speech recognition technologies, and in particular, to a training method, apparatus, device, and storage medium for a heterogeneous language model. Background An automatic speech recognition (Automatic Speech Recognition, ASR) system identifies speech to obtain text corresponding to the speech. The ASR system includes an Acoustic Model (AM) and a Language Model (LM). The AM is used for obtaining corresponding pinyin according to the voice. LM is isomorphic language model, and is used for obtaining text characters based on the Pinyin. In the related art, the initial LM is typically trained using a speech data set to obtain the LM in the ASR system described above. However, due to information privacy limitation, the number of voice samples in the voice data set is small, and the isomorphic language model is an n-gram model or a deep neural network (Deep Neural Networks, DNN), so that the accuracy of the LM obtained by training is low (i.e., the accuracy of text characters obtained by the LM is low). Disclosure of Invention The embodiment of the application provides a training method, device and equipment of a heterogeneous language model and a storage medium, which are used for solving the problem of low accuracy of the language model. In a first aspect, an embodiment of the present application provides a training method for a heterogeneous language model, including: acquiring a voice training sample set; Training a first initial network model and a second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models, wherein the structures of the first network models and the second network models are different, the first network models are used for processing an input pinyin sequence to obtain at least one text sequence corresponding to the pinyin sequence, and the second network models are used for determining a target text sequence corresponding to the pinyin sequence from the at least one text sequence; A heterogeneous language model is determined from the at least two first network models and the at least two second network models. Optionally, determining the heterogeneous language model according to the at least two first network models and the at least two second network models includes: determining at least four first language models according to the at least two first network models and the at least two second network models; acquiring a voice verification sample set; Determining the error rate of each first language model according to the voice verification sample set; a heterogeneous language model is determined based on the at least four first language models and the error rate. Optionally, determining at least four first language models according to at least two first network models and at least two second network models, including: And carrying out random combination processing on the at least two first network models and the at least two second network models to obtain at least four first language models. Optionally, determining at least four first language models according to at least two first network models and at least two second network models, including: performing binary conversion processing on model parameters of the first network model aiming at each first network model to obtain a first initial parameter sequence, performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequence; performing binary conversion processing on model parameters of the second network model aiming at each second network model to obtain a second initial parameter sequence, performing cross processing and mutation processing on the second initial parameter sequence to obtain at least two second intermediate parameter sequence; And carrying out random combination processing on at least two third network models corresponding to the at least two first network models and at least two fifth network models corresponding to the at least two second network models to obtain at least four first language models. Optionally, the voice verification sample set comprises a plurality of pinyin verification samples and text verification results corresponding to the pinyin verification samples, wherein for each of at least four first language models, the first language model comprises a first network model and a second network model; determining an error rate of the first language model from the speech verification sample set, comprising: sequentially processing the plurality of pinyin verification samples through the first network model and the second network model to obtain text output results corresponding to the pinyin verification samples; and determining the ratio of the number of pinyin verification samples