CN-121999764-A - Speech recognition method, device, electronic equipment and storage medium

CN121999764ACN 121999764 ACN121999764 ACN 121999764ACN-121999764-A

Abstract

The application is applicable to the technical field of artificial intelligence, and provides a voice recognition method which comprises the steps of obtaining voice data of a target child user; the method comprises the steps of performing perception noise reduction processing on voice data to obtain preprocessed voice data, performing acoustic feature extraction processing on the preprocessed voice data to obtain acoustic feature vectors, performing voice recognition processing on the acoustic feature vectors through a preset voice recognition model based on the acoustic feature vectors to obtain first semantic vectors corresponding to the voice data, performing error correction and correction on the first semantic vectors by adopting a context fusion mechanism in combination with second semantic vectors in a context memory library to generate target semantic vectors, and outputting corresponding target voice recognition results based on the target semantic vectors. The application solves the problems that the existing voice recognition method has the problems that the context logic jump or repeated pronunciation frequently occurs in the conversation process of children, the voice of the children is difficult to be recognized accurately, and the user experience is poor.

Inventors

CHENG BING
MAO WEIPENG

Assignees

深圳市噜咔博士科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251218

Claims (10)

1. A method of speech recognition, the method comprising the steps of: Acquiring voice data of a target child user; Performing perception noise reduction processing on the voice data to obtain preprocessed voice data; performing acoustic feature extraction processing on the preprocessed voice data to obtain acoustic feature vectors; Performing voice recognition processing on the acoustic feature vector through a preset voice recognition model to obtain a first semantic vector corresponding to the voice data; combining a second semantic vector in the context memory, and correcting the first semantic vector by adopting a context fusion mechanism to generate a target semantic vector; and outputting a corresponding target voice recognition result based on the target semantic vector.
2. The method for recognizing speech according to claim 1, wherein the performing acoustic feature extraction processing on the preprocessed speech data to obtain acoustic feature vectors comprises: performing fundamental frequency extraction processing on the preprocessed voice data through a preset acoustic model to obtain fundamental frequency characteristics; Carrying out formant extraction processing on the preprocessed voice data to obtain formant characteristics; determining an age interval of the target child user based on the fundamental frequency characteristic and the formant characteristic; dynamically adjusting parameter configuration of the preset acoustic model based on the age interval to obtain an acoustic model with adjusted parameters so as to match sounding frequency response characteristics of the target child user; And processing the preprocessed voice data through the acoustic model after parameter adjustment to obtain an acoustic feature vector.
3. The method for recognizing speech according to claim 1, wherein the step of performing speech extraction processing on the acoustic feature vector by a predetermined speech recognition model to obtain a first semantic vector corresponding to the speech data comprises: Performing voice-to-text processing on the acoustic feature vector through a preset voice recognition model to generate a plurality of candidate recognition texts and corresponding confidence degrees; Re-scoring the confidence degrees of the candidate recognition texts by combining the current noise type, and taking the candidate recognition text with the highest score after re-scoring as target text data; and carrying out semantic extraction processing on the target text data to obtain a first semantic vector.
4. The method for recognizing speech according to claim 1, wherein before said speech recognition processing is performed on said acoustic feature vector by a preset speech recognition model to obtain a first semantic vector corresponding to said speech data, said method further comprises: Acquiring a training data set and a pre-trained voice recognition model, wherein the training data set comprises sample voice data and semantic annotation data of the sample voice data; Performing noise invariance training on the pre-trained voice recognition model to obtain a first pre-trained voice recognition model; And performing fine tuning training on the first pre-trained voice recognition model through the training data set, and obtaining a preset voice recognition model after training is completed.
5. The method of claim 1, wherein the generating the target semantic vector by combining the second semantic vector in the context memory and correcting the first semantic vector with the context fusion mechanism comprises: loading a second semantic vector of the last N rounds of conversations in the context memory; carrying out grammar structure analysis on the first semantic vector to obtain a grammar structure analysis result; calculating semantic relevance weights between the first semantic vector and the second semantic vector; determining the dialogue intention of the target child user by combining the context of the interaction scene; and correcting the first semantic vector by adopting a context fusion mechanism based on the grammar structure analysis result, the correlation weight and the dialogue intent to generate a target semantic vector.
6. The speech recognition method of claim 1, wherein after the outputting of the corresponding target speech recognition result based on the target semantic vector, the method further comprises: adding the target semantic vector to a context repository; And dynamically managing and updating the semantic vectors stored in the context memory based on a preset evaluation strategy.
7. The speech recognition method of claim 1, wherein after the outputting of the corresponding target speech recognition result based on the target semantic vector, the method further comprises: When the system identification fails or the target child user actively corrects, according to the voice data and correct labels which are actively corrected by the identification failure or the target child user; and performing fine tuning optimization on the voice recognition model based on the voice data and the correct label.
8. A speech recognition device, characterized in that the speech recognition device comprises: the acquisition module is used for acquiring voice data of the target child user; The first processing module is used for performing perception noise reduction processing on the voice data to obtain preprocessed voice data; the second processing module is used for carrying out acoustic feature extraction processing on the preprocessed voice data to obtain acoustic feature vectors; The third processing module is used for carrying out voice recognition processing on the acoustic feature vector through a preset voice recognition model to obtain a first semantic vector corresponding to the voice data; the error correction module is used for combining the second semantic vector in the context memory bank, correcting the first semantic vector by adopting a context fusion mechanism, and generating a target semantic vector; And the output module is used for outputting a corresponding target voice recognition result based on the target semantic vector.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the speech recognition method according to any one of claims 1 to 7 when the computer program is executed.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps in the speech recognition method according to any of claims 1 to 7.

Description

Speech recognition method, device, electronic equipment and storage medium Technical Field The application belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method, a voice recognition device, electronic equipment and a storage medium. Background The intelligent toy can provide question and answer interaction for the child when the child is accompanied, and the user initiates a question in the question and answer interaction process, so that the intelligent toy answers according to the content of the question of the user. At present, the existing voice recognition method is mainly based on an adult voice data training model, and is difficult to accurately recognize the voice of children. Children pronounce usually has the problems of unclear syllables, unstable speech speed, irregular pauses, incomplete semantic expression and the like, so that the existing voice recognition system has lower accuracy under the voice input of children. In addition, the context logic jump or repeated pronunciation often occurs in the conversation process of children, so that the semantic understanding difficulty is increased, and the interactive experience of products such as intelligent toys, educational robots and the like is affected. Therefore, a high-accuracy voice recognition method is needed to solve the problem that the existing voice recognition method is difficult to accurately recognize the voice of the child due to the fact that the context logic jumps or repeated pronunciation often occur in the conversation process of the child, so that the user experience is poor. Disclosure of Invention The embodiment of the application provides a voice recognition method, which can solve the problems that the existing voice recognition method has the problems that the context logic jump or repeated pronunciation frequently occurs in the conversation process of children, the voice of the children is difficult to accurately recognize, and the user experience is poor. The method comprises the steps of performing perception noise reduction processing on voice data of a target child user to obtain preprocessed voice data, performing acoustic feature extraction processing on the preprocessed voice data to obtain acoustic feature vectors, performing voice recognition processing on the acoustic feature vectors through a preset voice recognition model to obtain first semantic vectors corresponding to the voice data, performing error correction and correction on the first semantic vectors by adopting a context fusion mechanism in combination with second semantic vectors in a context memory library to generate the target semantic vectors, and outputting corresponding target voice recognition results according to the target semantic vectors, so that the problem that the existing voice recognition method is difficult to accurately recognize the voice of the child due to contextual logic jump or repeated pronunciation of the child in the conversation process, and poor user experience is solved. In a first aspect, an embodiment of the present application provides a method for voice recognition, including the steps of: Acquiring voice data of a target child user; Performing perception noise reduction processing on the voice data to obtain preprocessed voice data; performing acoustic feature extraction processing on the preprocessed voice data to obtain acoustic feature vectors; Performing voice recognition processing on the acoustic feature vector through a preset voice recognition model to obtain a first semantic vector corresponding to the voice data; combining a second semantic vector in the context memory, and correcting the first semantic vector by adopting a context fusion mechanism to generate a target semantic vector; and outputting a corresponding target voice recognition result based on the target semantic vector. Optionally, the performing acoustic feature extraction processing on the preprocessed voice data to obtain an acoustic feature vector includes: performing fundamental frequency extraction processing on the preprocessed voice data through a preset acoustic model to obtain fundamental frequency characteristics; Carrying out formant extraction processing on the preprocessed voice data to obtain formant characteristics; determining an age interval of the target child user based on the fundamental frequency characteristic and the formant characteristic; dynamically adjusting parameter configuration of the preset acoustic model based on the age interval to obtain an acoustic model with adjusted parameters so as to match sounding frequency response characteristics of the target child user; And processing the preprocessed voice data through the acoustic model after parameter adjustment to obtain an acoustic feature vector. Optionally, the performing, by using a preset speech recognition model, speech extraction processing on the acoustic feature vector to obtain a first sem