CN-121983064-A - Method for constructing multilingual voice recognition model

CN121983064ACN 121983064 ACN121983064 ACN 121983064ACN-121983064-A

Abstract

The invention discloses a method for constructing a multilingual voice recognition model, which relates to the technical field of computers, wherein candidate texts are generated in parallel by spoken language branches and lyric branches, comparison type smoothness evaluation is carried out on candidates in a window under the same spoken language model, and a time window which is more in line with the generation rule of lyrics is marked as a suppression interval by gate control judgment, so that pollution of lyric semantics to spoken language subtitles is avoided at an output end; in addition, the gating difference value is mapped into continuous gating intensity, and the intensity is acted on word list output in the decoding process to form soft suppression, so that non-spoken language fragments are more prone to occupying the output space instead of random spoken language words, hysteresis or continuous time window consistent triggering constraint is introduced, gating jitter at the boundary of overlapped time windows is reduced, and subtitle consistency is improved.

Inventors

XU YIDAN
WANG LI

Assignees

南昌交通学院

Dates

Publication Date: 20260505
Application Date: 20260331

Claims (10)

1. A method for multilingual speech recognition model construction, comprising: step S1, mixed audio containing spoken voice and background music and a target language label are obtained; S2, constructing a voice recognition model, wherein the voice recognition model comprises the steps of acquiring pre-training parameters, loading the pre-training parameters to a shared voice encoder, a first decoder and a second decoder, establishing an association relationship between the first decoder and a first language model, and establishing an association relationship between the second decoder and a second language model; Step S3, extracting an acoustic feature sequence by a voice encoder, wherein a first decoder generates a first candidate text and first time alignment information, and a second decoder generates a second candidate text and second time alignment information; S4, respectively carrying out probability evaluation on the two candidate texts in the same time window by using a first language model to obtain a first smoothness score and a second smoothness score, and carrying out gating decision according to a scoring difference value and a threshold value to determine a time window to be suppressed, wherein the time window is a time interval which is obtained by dividing a preset window length and a preset step length along the mixed audio time axis; and S5, deleting or occupying place replacing processing is carried out on the first candidate text within the time window according to the first time alignment information, and the spoken language identification text of the target language is output, wherein the occupying place replacing processing is to replace the suppressed word segment with an occupying place word, and the occupying place word is a special word used for representing the suppressed non-spoken language segment in a word list.
2. The method of claim 1, wherein the mixed audio is divided into a plurality of mutually overlapping time windows, the first and second smoothness scores are calculated independently for each time window and the gating decision is made.
3. The method for multi-lingual speech recognition model construction according to claim 1, wherein the first smoothness score and the second smoothness score are both scores obtained by performing negative logarithmic measurement on conditional probabilities of candidate texts in a time window by the first language model and normalizing the conditional probabilities by the number of terms, and a smaller score value indicates a smoother speech under the first language model.
4. A method for multilingual speech recognition model construction according to claim 1, wherein the first time alignment information is generated from a frame-level time stamp or frame-level alignment result output by the first decoder at decoding time for mapping the word elements in the first candidate text to the time window.
5. The method for multilingual speech recognition model construction of claim 1 wherein the second language model is a lyrics language model and the second decoder is configured to generate human lyrics text in the background music as the second candidate text.
6. The method for multilingual speech recognition model construction according to claim 5, wherein when the second smoothness score is lower than the first smoothness score and the difference is greater than the threshold, determining the corresponding time window as a lyrics dominant time window, and triggering the suppression processing for the corresponding word elements in the first candidate text; in the gating decision stage, besides adopting two paths of fluency scoring difference values and a threshold to make gating decisions, the difference values are mapped into continuous gating intensities, the continuous gating intensities are used for carrying out soft suppression on word list output of a first decoder in a corresponding time window in the decoding process of generating a first candidate text, so that the output probability of occupied words rises along with the increase of the gating intensities and the output probability of non-occupied words falls along with the increase of the gating intensities, and the gating decisions simultaneously introduce hysteresis boundaries or continuous time window consistency triggering constraints.
7. The method for multilingual speech recognition model of claim 1 further comprising a dynamic context buffer that writes the recognized vocabulary, its language tags, and its corresponding acoustic feature subfragments into the buffer when an external confirmation signal is received.
8. The method for constructing a multilingual speech recognition model according to claim 7, wherein the acoustic feature sub-segment vector is represented as a vector representation obtained by pooling, projection and normalization of frame-level acoustic features, and wherein the method comprises the steps of calculating cosine similarity between a current acoustic feature sub-segment and an acoustic feature sub-segment in a buffer memory in subsequent decoding, and performing directional bias on vocabulary probability distribution output in a current step of the first decoder when an activation condition is satisfied; The method comprises the steps of selecting a frame-level acoustic feature sub-segment vector, carrying out pooling, projection and normalization on the frame-level acoustic feature sub-segment vector, carrying out matching between cosine similarity and acoustic feature sub-segment vector representation in a cache, limiting the activation condition by a similarity threshold and continuous triggering constraint, enabling bias strength and similarity to be in monotone corresponding relation when directional bias is triggered, applying the bias strength and similarity to vocabulary output of a first decoder according to attenuation rules in a plurality of continuous decoding steps, and enabling the directional bias and priori weight adjustment of a language sub-vocabulary to be overlapped in parallel under the condition of multi-language mixed decoding, so that probability of obtaining consistent directions of the sub-vocabulary of an activated language and a confirmed vocabulary is improved.
9. The method of claim 7 or 8, wherein the first decoder supports multi-lingual mixed decoding and divides the vocabulary into a plurality of seed vocabularies, and wherein the probability weights of the seed vocabularies in the current decoding step are adjusted when the language tags of the cache entries are activated.
10. The method for multi-lingual speech recognition model construction according to claim 1, wherein when the target language is a low-resource language, a probabilistic pitch rule set between the target language and a related language is obtained; the first decoder is mapped to initial candidate texts which are generated by the mixed audio and expressed in relative languages through the voice transformation rule to obtain a plurality of constraint candidate texts, and the acoustic feature sequences and the output of the first decoder are combined for carrying out fusion scoring to determine a final spoken language identification text; The low-resource language is a language with the training corpus scale lower than a preset scale threshold, the related language is a language with pedigree association with the low-resource language, the probabilistic pitch rule set is a pitch rule set with occurrence probability weight, the constraint candidate text is a target language candidate text mapped by the probabilistic pitch rule set, and the fusion score is a comprehensive score obtained by combining acoustic consistency, first language model consistency and rule consistency.

Description

Method for constructing multilingual voice recognition model Technical Field The invention relates to the technical field of computers, in particular to a method for constructing a multilingual voice recognition model. Background In social media short video content authoring and dissemination scenarios, platforms typically provide automatic captioning capabilities to improve accessibility and retrieval efficiency, which are typically generated based on speech recognition technology. The short video and audio are often obtained by mixing the user oral playing with background music BGM, and the user oral playing and the background music BGM are obviously overlapped in a longer time interval, and the BGM possibly contains voice singing or lyrics, so that the mixed signal simultaneously contains spoken voice and voice singing components; The existing multilingual speech recognition system mostly adopts end-to-end modeling and combines large-scale pre-training or weak supervision training to learn cross-language acoustic-text mapping capability, however, under the mixed sound conditions of the oral playing and the voice BGM, the representation extracted by the model is easy to carry the oral and lyric information at the same time, and when the oral playing language is different from the lyric language or code switching exists, word element misinsertion, language drift or semantic fracture of the cross-language is more easy to occur in the decoding stage, so that the subtitle usability is reduced. For such interference, some schemes in the prior art improve robustness by introducing mixed speech containing music into training data, or introduce a music detection and speech enhancement/separation module at the front end to suppress accompaniment and singing components, but in short video mixing with clear rhythm and strong accompaniment and singing coupling, front end separation may introduce distortion or residual interference, and an end-to-end model may still generate confusion at a multilingual boundary and a human voice type boundary, thereby limiting multilingual subtitle generation effects. Disclosure of Invention The present invention has been made in view of the above-described problems occurring in the prior art. The invention provides a method for constructing a multilingual voice recognition model, which solves the problem that when a port is played and BGM containing lyrics are mixed, the lyrics are wrongly written into a subtitle and the semantics are broken. In order to solve the technical problems, the invention provides the following technical scheme: The embodiment of the invention provides a method for constructing a multilingual voice recognition model, which comprises the following steps: step S1, mixed audio containing spoken voice and background music and a target language label are obtained; S2, constructing a voice recognition model, wherein the voice recognition model comprises the steps of acquiring pre-training parameters, loading the pre-training parameters to a shared voice encoder, a first decoder and a second decoder, establishing an association relationship between the first decoder and a first language model, and establishing an association relationship between the second decoder and a second language model; Step S3, extracting an acoustic feature sequence by a voice encoder, wherein a first decoder generates a first candidate text and first time alignment information, and a second decoder generates a second candidate text and second time alignment information; S4, respectively carrying out probability evaluation on the two candidate texts in the same time window by using a first language model to obtain a first smoothness score and a second smoothness score, and carrying out gating decision according to a scoring difference value and a threshold value to determine a time window to be suppressed, wherein the time window is a time interval which is obtained by dividing a preset window length and a preset step length along the mixed audio time axis; and S5, deleting or occupying place replacing processing is carried out on the first candidate text within the time window according to the first time alignment information, and the spoken language identification text of the target language is output, wherein the occupying place replacing processing is to replace the suppressed word segment with an occupying place word, and the occupying place word is a special word used for representing the suppressed non-spoken language segment in a word list. The method for constructing the multilingual voice recognition model is characterized in that the mixed audio is divided into a plurality of mutually overlapped time windows, the first smoothness score and the second smoothness score are independently calculated for each time window, and the gating decision is made. As a preferable scheme of the method for constructing the multilingual voice recognition model, the first smoothness score and the s