CN-121999760-A - Tibetan language multiparty speech recognition method based on low-rank parameter expert model

CN121999760ACN 121999760 ACN121999760 ACN 121999760ACN-121999760-A

Abstract

A Tibetan language multiparty speech recognition method based on a low-rank parameter expert model comprises the steps of extracting acoustic features of an input Tibetan language voice signal to obtain a high-level acoustic feature, analyzing the high-level acoustic feature through a dialect discriminator to determine expert routing weight distribution corresponding to various Tibetan language dialects, dynamically weighting and combining a plurality of low-rank parameter expert sub-modules built in a Tibetan language large language model based on the expert routing weight distribution to form an adaptive decoding network aiming at the current input voice dialect feature, inputting the high-level acoustic feature representation and pseudo tag prompt information generated by CTC into the adaptive decoding network together, performing autoregressive decoding through the adaptive decoding network to gradually generate a text token sequence, and outputting a final Tibetan language multiparty speech recognition result. The method can support the processing of Tibetan language multi-direction speech recognition and improve the accuracy of Tibetan language speech recognition.

Inventors

Xiao Sujie
LI TA
CHENG GAOFENG
ZHAO QINGWEI

Assignees

中国科学院声学研究所

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (9)

1. A Tibetan language multiparty speech recognition method based on a low-rank parameter expert model, the method comprising: extracting acoustic characteristics of an input Tibetan language voice signal to obtain a high-level acoustic characterization; Analyzing the high-level acoustic characterization by a dialect discriminator to determine expert routing weight distribution corresponding to a plurality of Tibetan dialects; Based on the expert routing weight distribution, a plurality of low-rank parameter special sub-modules built in the Tibetan language model are dynamically weighted and combined to form an adaptive decoding network aiming at the current input voice dialect characteristics; The high-level acoustic feature representation and pseudo tag prompt information generated by CTC are input to the self-adaptive decoding network together; and performing autoregressive decoding through the self-adaptive decoding network, gradually generating a text token sequence, and outputting a final Tibetan multiparty speech recognition result.
2. The method of claim 1, wherein the low-rank parameter expert sub-modules are adaptively inserted into a feed-forward neural network layer or an attention mapping layer of the Tibetan language model in a low-rank matrix decomposition form by adopting a low-rank adaptation structure, and each low-rank parameter expert sub-module corresponds to a specific Tibetan language.
3. The method of claim 1, wherein the pseudo tag hint information generated by CTCs specifically comprises: and performing greedy decoding or bundle search decoding on the high-level acoustic characterization by using a CTC branch to generate the pseudo tag prompt information.
4. The method of claim 1, wherein the Tibetan multiparty speech recognition model employs a multitasking joint optimization during a training phase, and wherein the overall loss function includes at least CTC loss for constraining speech-text alignment, dialect discrimination loss for dialect feature enhancement, and autoregressive prediction loss for directing the large language model to output the target text sequence.
5. The method of claim 1, wherein the dialect identifier is trained by minimizing the dialect discrimination loss calculated based on the dialect class labels provided in training data.
6. The method of claim 1, wherein the expert routing weight distribution is a probability vector, and wherein the dynamic weighted combination is specifically: And carrying out weighted summation on the output of each low-rank parameter special sub-module according to the corresponding routing weight to obtain the comprehensive output of the self-adaptive decoding network in the current layer.
7. The method of claim 1, wherein the acoustic feature extraction is performed by a pre-trained speech encoder model, and wherein the Tibetan large language model is a Transformer architecture-based autoregressive decoder model.
8. The method of claim 7, wherein in a model fine tuning stage, a backbone network of the Tibetan language model remains frozen, and parameters of the pre-trained speech encoder model, low-rank parameter expert sub-module, the dialect arbiter, and associated link layers are updated.
9. A Tibetan language multiparty speech recognition device based on a low-rank parameter expert model, the device comprising: the acquisition module is used for extracting acoustic characteristics of the input Tibetan language voice signals and acquiring high-level acoustic characterization; The processing module is used for analyzing the high-level acoustic characterization through a dialect discriminator and determining expert routing weight distribution corresponding to a plurality of Tibetan languages; the processing module is further used for dynamically weighting and combining a plurality of low-rank parameter special sub-modules built in the Tibetan language model based on the expert routing weight distribution to form an adaptive decoding network aiming at the current input voice dialect characteristics; The processing module is further configured to input the high-level acoustic feature representation and pseudo tag prompt information generated by CTC to the adaptive decoding network together; the processing module is also used for performing autoregressive decoding through the self-adaptive decoding network, gradually generating a text token sequence and outputting a final Tibetan multiparty speech recognition result.

Description

Tibetan language multiparty speech recognition method based on low-rank parameter expert model Technical Field The invention relates to the technical field of voice recognition, in particular to a Tibetan language multiparty speech recognition method and device based on a low-rank parameter expert model. Background The voice recognition technology is one of important basic technologies in the field of artificial intelligence, and aims to automatically analyze and model voice signals to generate text sequences corresponding to the voice signals, so that the voice recognition technology is widely applied to a plurality of fields such as intelligent vehicle-mounted, man-machine interaction, telephone navigation, voice translation, conference systems and the like. At present, the large-scale continuous speech recognition method based on end-to-end has made remarkable progress on mainstream language tasks such as mandarin, and particularly the speech coder model based on self-supervision pre-training remarkably improves the modeling capability of speech features. The self-supervision pre-training model represented by wav2vec 2.0 effectively relieves the dependence on manual annotation corpus by pre-training on large-scale unlabeled voice data, and is widely applied to low-resource voice recognition tasks. In recent years, a large language model shows strong modeling capability in natural language understanding and generating tasks, a new technical path is provided for voice recognition language modeling, the voice recognition capability of main stream languages such as English, mandarin and the like is further improved, but the problem of low recognition precision in Tibetan language multi-dialect voice recognition still exists. Disclosure of Invention In order to solve the problems in the prior art, the embodiment of the application provides a Tibetan language multi-dialect voice recognition method, a device, a computing device, a computer storage medium and a product containing a computer program based on a low-rank parameter expert model, which can support processing Tibetan language multi-dialect voice recognition and improve the accuracy of Tibetan language voice recognition. According to the method, acoustic feature extraction is carried out on an input Tibetan language voice signal to obtain high-level acoustic characterization, analysis is carried out on the high-level acoustic characterization through a dialect discriminator to determine expert routing weight distribution corresponding to various Tibetan language, based on the expert routing weight distribution, a plurality of low-rank parameter special sub-modules built in a Tibetan language large language model are dynamically weighted and combined to form an adaptive decoding network aiming at current input Tibetan language features, the high-level acoustic feature representation and pseudo tag prompt information generated by CTC are input into the adaptive decoding network together, autoregressive decoding is carried out through the adaptive decoding network to gradually generate a text token sequence, and a final Tibetan language multi-party voice recognition result is output. In some possible implementations, the low-rank parameter specific submodules adopt a low-rank adaptive structure and are adaptively inserted into a feedforward neural network layer or an attention mapping layer of the Tibetan language model in a low-rank matrix decomposition mode, and each low-rank parameter specific submodule corresponds to a specific Tibetan language. In some possible implementations, the pseudo tag hint information generated by CTCs specifically includes greedy decoding or bundle search decoding of the higher layer acoustic characterization using one CTC branch to generate the pseudo tag hint information. In some possible implementations, the Tibetan multiparty speech recognition model employs a multitasking joint optimization during the training phase, and the overall penalty function includes at least CTC penalty for constraining speech-text alignment, dialect discrimination penalty for dialect feature enhancement, and autoregressive prediction penalty for guiding the large language model to output the target text sequence. In some possible implementations, the dialect identifier is trained by minimizing a dialect discrimination loss calculated based on the dialect class labels provided in the training data. In some possible implementation modes, the expert routing weight is distributed into a probability vector, and the dynamic weighting combination is specifically that the outputs of the low-rank parameter expert sub-modules are weighted and summed according to the corresponding routing weights to obtain the comprehensive output of the adaptive decoding network at the current layer. In some possible implementations, acoustic feature extraction is done by a pre-trained speech encoder model, and the Tibetan large language model is an autoregressive decod