CN-115881134-B - Voice recognition method, device, storage medium and equipment

CN115881134BCN 115881134 BCN115881134 BCN 115881134BCN-115881134-B

Abstract

The application discloses a voice recognition method, a device, a storage medium and equipment, wherein the method comprises the steps of firstly extracting the audio characteristics of acquired target voice, inputting the audio characteristics into a preset acoustic model to obtain an acoustic state posterior score of each frame of audio, constructing a general scene FSA network model according to a preset general scene FSA hotword, then adopting a preset decoding algorithm to decode the preset general scene model and the general scene FSA network model in parallel to obtain a first recognition result and a second recognition result, importing N times of special scene FSA hotwords in the process of recognizing the target voice, constructing a special scene FSA network model according to the special scene FSA hotword and preset condition judgment logic, further decoding the special scene FSA network model to obtain a third recognition result, and determining a final recognition result corresponding to the target voice according to the three recognition results. Therefore, the influence of performance is avoided, and the accuracy of the identification result is effectively improved.

Inventors

LI MIN
WEI CHONGZHOU
LI YONGCHAO
FU ZHONGHUA

Assignees

西安讯飞超脑信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20221230

Claims (10)

1. A method of speech recognition, comprising: preprocessing the target voice and extracting the audio characteristics of the target voice; Inputting the audio characteristics of the target voice into a preset acoustic model to obtain the posterior score of the acoustic state of each frame of audio in the target voice, and constructing a general scene FSA network model according to the pre-led general scene finite state receiver FSA hotword; according to the posterior score of the acoustic state of each frame of audio in the target voice, a preset decoding algorithm is adopted to carry out parallel decoding on a preset general scene model and the general scene FSA network model, and a first recognition result and a second recognition result are obtained; After target voice to be recognized is obtained, in the process of recognizing the target voice, leading in N times of special scene FSA hot words, and constructing a special scene FSA network model according to the special scene FSA hot words and preset condition judgment logic, wherein N is a positive integer greater than 1; And decoding the special scene FSA network model to obtain a third recognition result, and determining a final recognition result corresponding to the target voice according to the first recognition result, the second recognition result and the third recognition result.
2. The method of claim 1, wherein preprocessing the target speech to extract audio features of the target speech comprises: Preprocessing the target voice by adopting windowing framing, fourier transformation and Mel cepstrum signal processing technology, and extracting the audio characteristics of the target voice.
3. The method of claim 1, wherein the predetermined decoding algorithm is a viterbi algorithm.
4. The method of claim 1, wherein the preset condition judgment logic includes judging whether the number of times of importing the special scene FSA hotword reaches a limit condition, and wherein the constructing the special scene FSA network model according to the special scene FSA hotword and the preset condition judgment logic includes: judging whether the importing times of the FSA hot words in the special scene reach a limiting condition or not; If yes, constructing a special scene FSA network model by using the latest imported special scene FSA hotword.
5. The method of claim 1, wherein the pre-set condition determination logic includes determining whether the target speech reaches a pre-set duration audio during decoding, wherein the constructing a special scene FSA network model based on the special scene FSA hotword and the pre-set condition determination logic includes: in the process of decoding the target voice, judging whether the target voice reaches audio frequency with preset duration; If yes, constructing a special scene FSA network model by using the latest imported special scene FSA hotword.
6. The method of claim 1, wherein when the target speech is a clause, the constructing a special scene FSA network model according to the special scene FSA hotword and a preset condition judgment logic comprises: when the fact that the importing times of the FSA hot words of the special scene do not reach the limiting condition and the fact that the target voice does not reach the audio frequency with the preset duration in the decoding process is judged, judging whether decoding of the last frame of a clause corresponding to the target voice is performed at the moment or not; If yes, constructing a special scene FSA network model by using the latest imported special scene FSA hotword.
7. The method of claim 1, wherein when the target speech includes M clauses and M is a positive integer greater than 1, the constructing a special scene FSA network model according to the special scene FSA hotword and preset condition judgment logic comprises: If the importing times of the special scene FSA hot words are judged to not reach the limiting condition, and the target voice is judged to not reach the preset duration audio in the decoding process, and when the last frame of the first clause in the target voice is decoded, after the special scene FSA network model is built by utilizing the latest imported special scene FSA hot words, for the subsequent M-1 clauses, sequentially and repeatedly judging whether the target voice reaches the preset duration audio in the decoding process, and the target voice does not reach the preset duration audio in the decoding process, and updating the special scene FSA network model according to the judging result.
8. A speech recognition apparatus, comprising: the device comprises an acquisition unit, a preprocessing unit and a processing unit, wherein the acquisition unit is used for acquiring target voice to be recognized, preprocessing the target voice and extracting the audio characteristics of the target voice; The first construction unit is used for inputting the audio characteristics of the target voice into a preset acoustic model to obtain the posterior score of the acoustic state of each frame of audio in the target voice, and constructing a general scene FSA network model according to the pre-imported general scene finite state receiver FSA hotword; The decoding unit is used for carrying out parallel decoding on a preset general scene model and the general scene FSA network model by adopting a preset decoding algorithm according to the posterior score of the acoustic state of each frame of audio in the target voice to obtain a first recognition result and a second recognition result; The second construction unit is used for importing N times of special scene FSA hot words in the process of recognizing the target voice after the target voice to be recognized is obtained, and constructing a special scene FSA network model according to the special scene FSA hot words and preset condition judgment logic, wherein N is a positive integer greater than 1; The determining unit is used for decoding the special scene FSA network model to obtain a third recognition result, and determining a final recognition result corresponding to the target voice according to the first recognition result, the second recognition result and the third recognition result.
9. A voice recognition device is characterized by comprising a processor, a memory and a system bus; the processor and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method of any of claims 1-7.

Description

Voice recognition method, device, storage medium and equipment Technical Field The present application relates to the field of speech processing technologies, and in particular, to a speech recognition method, apparatus, storage medium, and device. Background With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the interaction mode based on voice recognition becomes a widely used service application solution, and numerous application scenes such as calling, sending information, inquiring weather information, navigating and positioning, searching music videos and the like can be assisted by a user to complete specific behavior intention based on the voice recognition technology. Thus, the voice recognition technology has been a hot spot of research in the related art. In a conventional speech recognition method, a recognition flow shown in fig. 1 is generally adopted, a number of specific hot words are introduced into a speech recognition system before speech is input, a command word network, namely a finite state receiver (FINITE STATE Acceptor, abbreviated as FSA) network is constructed by using the hot words and specific sentence patterns, a weighted finite state Transducer (WEIGHTED FINITE STATE Transducer, abbreviated as WFST) is used for being connected in parallel with the FSA, the two networks are decoded simultaneously, the results are respectively obtained, score comparison is carried out, and finally, the network result with higher score is determined to be used as the speech recognition result. However, in practical applications, some FSA needs to support scene switching, and this speech recognition method cannot support network updating along with scene switching in the speaking process, for example, there are two problems that, firstly, for a short sentence (monocotyledonous sentence) scene, if a user speaks before speaking, the user turns into an audio (audio) scene, the current speech recognition method cannot support network updating in the speaking process, and secondly, for a long sentence scene (multiple clauses), the user speaks a first sentence and then switches to an audio scene, and the current speech recognition method cannot support network updating in the next (subsequent) clause. And thus the final result of the speech recognition may be inaccurate. Disclosure of Invention The embodiment of the application mainly aims to provide a voice recognition method, a device, a storage medium and equipment, which can effectively improve the accuracy of a recognition result when voice recognition is performed. The embodiment of the application provides a voice recognition method, which comprises the following steps: preprocessing the target voice and extracting the audio characteristics of the target voice; Inputting the audio characteristics of the target voice into a preset acoustic model to obtain the posterior score of the acoustic state of each frame of audio in the target voice, and constructing a general scene FSA network model according to the pre-led general scene finite state receiver FSA hotword; according to the posterior score of the acoustic state of each frame of audio in the target voice, a preset decoding algorithm is adopted to carry out parallel decoding on a preset general scene model and the general scene FSA network model, and a first recognition result and a second recognition result are obtained; In the process of identifying the target voice, importing N times of special scene FSA hotwords, and constructing a special scene FSA network model according to the special scene FSA hotwords and preset condition judgment logic, wherein N is a positive integer greater than 1; And decoding the special scene FSA network model to obtain a third recognition result, and determining a final recognition result corresponding to the target voice according to the first recognition result, the second recognition result and the third recognition result. In a possible implementation manner, the preprocessing the target voice to extract the audio feature of the target voice includes: Preprocessing the target voice by adopting windowing framing, fourier transformation and Mel cepstrum signal processing technology, and extracting the audio characteristics of the target voice. In a possible implementation manner, the preset decoding algorithm is a viterbi algorithm. In a possible implementation manner, the preset condition judgment logic includes judging whether the importing times of the special scene FSA hotword reach a limiting condition, and the constructing a special scene FSA network model according to the special scene FSA hotword and the preset condition judgment logic includes: judging whether the importing times of the FSA hot words in the special scene reach a limiting condition or not; If yes, constructing a special scene FSA network model by using the latest imported special scene FSA hot