CN-121983034-A - Dynamic noise self-adaptive voice recognition method, device, equipment and medium based on large language model

CN121983034ACN 121983034 ACN121983034 ACN 121983034ACN-121983034-A

Abstract

The application discloses a dynamic noise self-adaptive voice recognition method, device, equipment and medium based on a large language model, and relates to the technical field of computers. The method comprises the steps of analyzing short-time sub-signals intercepted from an original voice signal in real time to identify the noise environment type and the signal-to-noise ratio level of the original voice signal, determining a target signal processing path from a plurality of preset signal processing paths based on the noise environment type and the signal-to-noise ratio level, processing the original voice signal based on the target signal processing path to obtain corresponding acoustic characteristics, generating corresponding environment perception prompts based on the noise environment type, the signal-to-noise ratio level and the target signal processing path, and decoding data after modal fusion operation of the environment perception prompts and the acoustic characteristics by utilizing a preset large language model to obtain target identification text. Therefore, the method can sense the environment in real time and adaptively select the optimal signal processing strategy to perform voice recognition in complex and changeable noise environments.

Inventors

ZHOU SIZHONG
Lian Zening
PAN HONGPING
JIANG QI

Assignees

杭州大岳智擎科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260402

Claims (10)

1. A method for dynamic noise adaptive speech recognition based on a large language model, comprising: intercepting an original voice signal to obtain a short-time sub-signal, and analyzing the short-time sub-signal in real time to identify a noise environment type and a signal-to-noise ratio level of the original voice signal, wherein the noise environment type is used for representing prior probability distribution of an interference source of the original voice signal, and the signal-to-noise ratio level is used for representing quality level of the original voice signal; Determining a target signal processing path from a plurality of preset signal processing paths based on the noise environment type and the signal-to-noise ratio level, and processing the original voice signal based on the target signal processing path so as to obtain corresponding acoustic characteristics based on the processed voice signal; And generating a corresponding environment perception prompt based on the noise environment type, the signal-to-noise ratio level and the target signal processing path, and decoding data after the environment perception prompt and the acoustic feature are subjected to modal fusion operation by utilizing a preset large language model so as to obtain a target recognition text.
2. The large language model based dynamic noise adaptive speech recognition method according to claim 1, wherein the real-time analysis of the short-time sub-signals to recognize the noise environment type and signal-to-noise ratio level of the original speech signal comprises: extracting features of the short-time sub-signals to obtain sub-signal features; Carrying out real-time analysis on the sub-signal characteristics by using a preset lightweight class network to obtain probability distribution of a plurality of corresponding preset environment categories and probability distribution of a plurality of quantized signal-to-noise ratio intervals; determining the noise environment type and the signal-to-noise ratio level of the original voice signal based on the probability distribution of the plurality of preset environment categories and the probability distribution of the plurality of quantized signal-to-noise ratio intervals; the noise environment types comprise a preset stable reference environment type, a preset stable broadband noise type, a preset non-stable broadband noise type, a multi-speaker interference type, a preset high reverberation environment type and a composite noise environment, and the signal-to-noise ratio level is divided into a preset high signal-to-noise ratio and a preset low signal-to-noise ratio based on a preset signal-to-noise ratio threshold.
3. The large language model based dynamic noise adaptive speech recognition method of claim 2, wherein the determining a target signal processing path from a plurality of preset signal processing paths based on the noise environment type and the signal-to-noise ratio level comprises: Determining a path index based on the noise environment type and the signal-to-noise ratio level by using a preset decision matrix, and determining a target signal processing path from a plurality of preset signal processing paths based on the path index; The preset signal processing paths comprise a first preset signal processing path for processing the signal-to-noise ratio level to be a preset high signal-to-noise ratio and the noise environment type to be a preset stable reference environment type, a second preset signal processing path for processing the signal-to-noise ratio level to be a preset high signal-to-noise ratio and the noise environment type to be a preset stable broadband noise type, a third preset signal processing path for processing the signal-to-noise ratio level to be a preset low signal-to-noise ratio and the noise environment type to be a preset stable broadband noise type, and a fourth preset signal processing path for processing the preset non-stable broadband noise type, the multi-speaker interference type and the preset high reverberation environment, and a fifth preset signal processing path for processing the composite noise environment.
4. The large language model based dynamic noise adaptive speech recognition method of claim 3, wherein said processing said original speech signal based on said target signal processing path comprises: if the target signal processing path is a first preset signal processing path, carrying out standardization processing on the original voice signal; If the target signal processing path is a second preset signal processing path, carrying out wiener filtering processing on the original voice signal in a short-time Fourier transform domain; If the target signal processing path is a third preset signal processing path, performing spectral subtraction processing on the original voice signal in a power spectral domain; if the target signal processing path is a fourth preset signal processing path, performing signal separation processing on the original voice signal by using a preset time domain convolutional neural network; And if the target signal processing path is a fifth preset signal processing path, processing the original voice signal by using a preset two-stage serial processing architecture, wherein the preset two-stage serial processing architecture comprises a first-stage processing architecture for carrying out wiener filtering processing or spectral subtraction processing on the original voice signal and a second-stage processing architecture for carrying out signal separation processing on the original voice signal, and the output end of the first-stage processing architecture is connected with the input end of the second-stage processing architecture.
5. The large language model based dynamic noise adaptive speech recognition method according to claim 4, wherein the deriving corresponding acoustic features based on the processed speech signal comprises: and performing feature extraction operation on the processed voice signals by using a preset feature alignment layer to obtain corresponding acoustic features with fixed dimensions.
6. The large language model based dynamic noise adaptive speech recognition method of claim 1, wherein the generating the corresponding context awareness cues based on the noise environment type, the signal-to-noise level, and the target signal processing path comprises: Inquiring from a preset structured prompt template library based on the noise environment type, the signal-to-noise ratio level and the target signal processing path to obtain a target template; filling the noise environment type, the signal-to-noise ratio level and the target signal processing path into the target template to generate a corresponding environment perception prompt; The environment perception prompt is a natural language text prompt for describing the current environment and processing measures.
7. The method for adaptive speech recognition of dynamic noise based on large language model according to claim 1, wherein the decoding the data after the mode fusion operation of the environmental perception prompt and the acoustic feature by using a preset large language model to obtain the target recognition text comprises: Converting the environment perception prompt into a semantic vector sequence, and splicing the semantic vector sequence and the acoustic feature in a feature dimension to generate a fusion feature sequence; and inputting the fusion characteristic sequence into a preset large language model so as to utilize the preset large language model to perform decoding operation to obtain the target recognition text.
8. A large language model based dynamic noise adaptive speech recognition device, comprising: The signal analysis module is used for intercepting an original voice signal to obtain a short-time sub-signal, and analyzing the short-time sub-signal in real time to identify the noise environment type and the signal-to-noise ratio level of the original voice signal, wherein the noise environment type is used for representing the prior probability distribution of an interference source of the original voice signal, and the signal-to-noise ratio level is used for representing the quality level of the original voice signal; The signal processing module is used for determining a target signal processing path from a plurality of preset signal processing paths based on the noise environment type and the signal-to-noise ratio level, and processing the original voice signal based on the target signal processing path so as to obtain corresponding acoustic characteristics based on the processed voice signal; And the signal recognition module is used for generating a corresponding environment perception prompt based on the noise environment type, the signal-to-noise ratio level and the target signal processing path, and decoding data after the environment perception prompt and the acoustic feature are subjected to modal fusion operation by utilizing a preset large language model so as to obtain a target recognition text.
9. An electronic device, comprising: A memory for storing a computer program; A processor for executing the computer program to implement the large language model based dynamic noise adaptive speech recognition method according to any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the large language model based dynamic noise adaptive speech recognition method according to any one of claims 1 to 7.

Description

Dynamic noise self-adaptive voice recognition method, device, equipment and medium based on large language model Technical Field The present invention relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a medium for dynamic noise adaptive speech recognition based on a large language model. Background The voice recognition technology aims at automatically converting human voice into text, and is one of key technologies supporting man-machine interaction. Its development has undergone an evolution process from traditional statistical models to modern end-to-end deep learning models. The conventional speech recognition system is generally composed of a plurality of independent modules such as an acoustic model, a pronunciation model and a language model, and each module respectively bears tasks such as acoustic feature mapping, phoneme-word conversion and sentence probability evaluation. The modularized architecture is complex to train, and because of inconsistent optimization targets of the modules, error accumulation is easy to cause, and overall performance is affected. In recent years, an end-to-end model based on deep learning is becoming the mainstream. The model integrates the voice recognition task into a single neural network, and directly maps the acoustic feature sequence into a text sequence, so that the system flow is obviously simplified and the recognition accuracy is improved. Typical end-to-end architecture generally employs an encoder-decoder framework in which an encoder (e.g., a transducer or Conformer, etc. structure) is responsible for extracting high-level semantic information from speech features, and the decoder generates corresponding text based on an autoregressive manner. To further improve the fluency and accuracy of languages, current research trends begin to introduce pre-trained large language models as decoders to exploit their powerful language generation and knowledge reasoning capabilities. However, in a noisy environment, the quality of the voice signal is severely reduced, so that the recognition accuracy is drastically reduced, and improving the noise robustness of the voice recognition model becomes a core challenge in the field. In order to cope with noise interference, the prior art mainly adopts the following two schemes: The passive robust model based on large-scale data training is represented by a Whisper model OpenAI, and the scheme enables the model to learn to resist various types of interference by carrying out end-to-end training on massive, multi-noise and multi-accent audio data. However, the performance of the system is highly dependent on the coverage range of training data, and the system has limited generalization capability for underdeveloped or extreme noise environments in the training set and faces the bottleneck of unknown noise performance degradation. The scheme adopts a classical noise reduction-recognition separation idea, the noise-added voice is enhanced by a signal processing module (such as spectral subtraction, wiener filtering or voice enhancement model) with a fixed strategy, and then the processed signal is input into a standard voice recognition model. Because the front-end processing strategy and the intensity are preset and cannot be adjusted, the system is difficult to realize the optimal performance in a dynamically-changing noise environment, and often faces the trade-off dilemma of excessive or insufficient processing. From the above, how to perform speech recognition in a complex and changeable noise environment by sensing the environment in real time and adaptively selecting the optimal signal processing strategy is a problem to be solved. Disclosure of Invention In view of the above, the present invention aims to provide a method, a device, and a medium for dynamic noise adaptive speech recognition based on a large language model, which can sense an environment in real time and adaptively select an optimal signal processing strategy for speech recognition in a complex and variable noise environment. The specific scheme is as follows: In a first aspect, the present application provides a method for dynamic noise adaptive speech recognition based on a large language model, including: intercepting an original voice signal to obtain a short-time sub-signal, and analyzing the short-time sub-signal in real time to identify a noise environment type and a signal-to-noise ratio level of the original voice signal, wherein the noise environment type is used for representing prior probability distribution of an interference source of the original voice signal, and the signal-to-noise ratio level is used for representing quality level of the original voice signal; Determining a target signal processing path from a plurality of preset signal processing paths based on the noise environment type and the signal-to-noise ratio level, and processing the original voice signal based on the targ