CN-116050463-B - Front-end adapter training method, electronic device and storage medium

CN116050463BCN 116050463 BCN116050463 BCN 116050463BCN-116050463-B

Abstract

The invention discloses a front-end adapter training method, electronic equipment and a storage medium, wherein the front-end adapter comprises a first stage and a second stage, the first stage comprises the steps of inputting a wave form format of voice to an original front end, obtaining a first output of the original front end, inputting other formats of the voice to the front-end adapter, obtaining a second output of the front-end adapter, calculating first losses of the first output and the second output to train the front-end adapter, and inputting the second output of the front-end adapter to a trunk transducer model, and calculating second losses, wherein the first losses and the second losses optimize preset epochs simultaneously. Other formats of speech feature may also be compatible with self-supervised learning models that use waveform pre-training by minimizing the distance between different front end outputs.

Inventors

YU KAI
CHEN XIE
MA ZIYANG
ZHENG ZHISHENG

Assignees

思必驰科技股份有限公司

Dates

Publication Date: 20260505
Application Date: 20230130

Claims (7)

1. A front-end adapter training method, wherein the training method comprises a first stage and a second stage, the first stage comprising: inputting a wave form format of voice to an original front end, and obtaining a first output of the original front end; Inputting other formats of the voice to the front-end adapter, obtaining a second output of the front-end adapter, and calculating first losses of the first output and the second output to train the front-end adapter; Inputting a second output of the front-end adapter to a backbone transducer model, and calculating a second loss, wherein the first loss and the second loss are optimized for a preset epoch at the same time, the first loss is an L2 loss, and the second loss is a CTC loss; after inputting other formats of the voice to the front-end adapter, inputting the acquired third output of the front-end adapter to the trunk transducer model, and calculating only a second loss; wherein in the first phase both the CTC loss and the L2 loss are applied, the L2 loss does not back-propagate to the waveform front-end block, the CTC loss does not back-propagate to the Fbank front-end in the first phase, and in the second phase all modules are co-optimized with the CTC loss, the original waveform front-end is frozen during training and used to normalize the Fbank front-end output, and the L2 loss is applied to minimize the distance between the waveform and Fbank front-end based outputs.
2. The method of claim 1, wherein the preset epoch is 200 epoch.
3. The method of claim 1, wherein the other formats include Fbank format and MFCC format.
4. A self-supervising speech model comprising a front-end adapter trained in accordance with the method of any one of claims 1-3 and a backbone transducer model.
5. The model of claim 4, wherein the self-supervising speech model is a speech recognition model.
6. An electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 3.
7. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method of any of claims 1 to 3.

Description

Front-end adapter training method, electronic device and storage medium Technical Field The invention belongs to the technical field of front-end adapter training, and particularly relates to a front-end adapter training method, electronic equipment and a storage medium. Background In the related art, the self-supervision voice model comprises a wav2vec series, huBERT, data2vec and the like. These models are trained using massive amounts of unlabeled data during the pre-training phase and small amounts of labeled data during the fine-tuning phase. Thus, while the above-described efforts in keyword discovery tasks greatly improve performance under certain specific conditions, some unresolved issues limit the versatility of these approaches. The inventor finds that the models need to input data in a wave form in the pre-training, fine-tuning and decoding stages in the process of realizing the application, and the models are not friendly to production scenes. Disclosure of Invention The embodiment of the invention provides a front-end adapter training method, electronic equipment and a storage medium, which are used for at least solving one of the technical problems. In a first aspect, an embodiment of the present invention provides a front-end adapter training method, where the training method includes a first stage and a second stage, the first stage includes inputting a wave form format of a voice to an original front-end, obtaining a first output of the original front-end, inputting other formats of the voice to the front-end adapter, obtaining a second output of the front-end adapter, calculating a first loss of the first output and the second output to train the front-end adapter, and inputting the second output of the front-end adapter to a backbone transducer model, calculating a second loss, where the first loss and the second loss simultaneously optimize a preset epoch. In a second aspect, an electronic device is provided that includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the front-end adapter training method of any of the embodiments of the present invention. In a third aspect, embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of the front-end adapter training method of any of the embodiments of the present invention. According to the method provided by the embodiment of the application, the front-end adapters in other formats using the same voice can achieve the effect basically the same as that of the front-end using the waveform format by training the front-end adapters in other formats. Other formats of speech feature may also be compatible with self-supervised learning models that use waveform pre-training by minimizing the distance between different front end outputs. Drawings In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. FIG. 1 is a flowchart of a front-end adapter training method according to an embodiment of the present invention; FIG. 2 is a first stage of a front end adapter according to one embodiment of the present invention; FIG. 3 is a second stage of the front end adapter according to one embodiment of the present invention; FIG. 4 is an illustration of three different types of self-supervised learning models of the related art; FIG. 5 is a waveform-based fine-tuning error rate result according to an embodiment of the present invention; FIG. 6 is a graph showing Euclidean distance variation between waveform-based and Fbank front-end outputs, according to one embodiment of the present invention; FIG. 7 is a graph showing bit error rate results of three self-supervised learning models across Librispeech and GIGASPEECH test sets, as provided by an embodiment of the present invention; fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Detailed Description For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that