CN-115731919-B - Speech recognition method, device, storage medium and electronic equipment

CN115731919BCN 115731919 BCN115731919 BCN 115731919BCN-115731919-B

Abstract

A voice recognition method, a voice recognition device, a storage medium and an electronic device are provided. The method comprises the steps of obtaining voice information to be recognized, detecting voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel, determining noise information of each channel based on crossing information of the voice activity fragments corresponding to different channels in a time dimension, and carrying out voice recognition on the voice information of the corresponding channel according to the noise information of each channel to obtain a voice recognition result corresponding to the voice information to be recognized. The method can greatly improve the accuracy of voice recognition.

Inventors

XIE ZHIPENG
WAN GENSHUN
PAN JIA
GAO JIANQING
LIU CONG
HU GUOPING
HU YU

Assignees

科大讯飞股份有限公司

Dates

Publication Date: 20260512
Application Date: 20220907

Claims (11)

1. A method of speech recognition, the method comprising: acquiring voice information to be recognized, wherein the voice information to be recognized comprises voice information of at least two sound channels; Detecting voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel; determining noise information of each channel based on crossing information of voice activity fragments corresponding to different channels in a time dimension, wherein in the case that the voice information to be identified is double-channel voice information, the crossing information of the voice activity fragments in the time dimension is used for indicating a time period when only one channel has the voice activity fragments, a time period when both channels have the voice activity fragments, and a time period when both channels do not have the voice activity fragments; and carrying out voice recognition on the voice information of the corresponding sound channel according to the noise information of each sound channel to obtain a voice recognition result corresponding to the voice information to be recognized.
2. The method of claim 1, wherein the detecting the voice activity segments of the voice information to be recognized to obtain a plurality of voice activity segments corresponding to each channel comprises: Performing voice activity endpoint detection on the voice information to be recognized to obtain a plurality of groups of voice activity endpoints corresponding to each channel; And determining a plurality of voice activity fragments corresponding to each channel according to the plurality of voice activity endpoints.
3. The method of claim 1, wherein the speech information to be recognized comprises two-channel speech information, and wherein determining the noise information for each channel based on cross information of speech activity segments corresponding to different channels in a time dimension comprises: When only one sound channel detects a voice activity segment in a target time period, determining that voice information contained in the sound channel in which the voice activity segment is not detected in the target time period is noise information; When voice activity segments are detected in both channels in a target period, noise information is determined based on voiceprint detection results of the two voice activity segments detected in the target period.
4. A method according to claim 3, wherein said determining noise information based on voiceprint detection results for two voice activity fragments detected in the target time period comprises: Respectively carrying out voiceprint detection on voice activity fragments in the two sound channels detected in the target time period; And when the voiceprint of any target voice activity segment is detected not to belong to the preset voiceprint, determining the voice information corresponding to the target voice activity segment as noise information.
5. The method according to any one of claims 1 to 4, wherein performing speech recognition on the speech information of the corresponding channel according to the noise information of each channel to obtain a speech recognition result corresponding to the speech information to be recognized, includes: Acquiring a preset voice recognition model; inputting the voice information and the corresponding noise information of each channel into the preset voice recognition model to perform voice recognition, so as to obtain a voice recognition result of each channel; And determining a voice recognition result corresponding to the voice information to be recognized according to the voice recognition result of each sound channel.
6. The method of claim 5, wherein the method further comprises: acquiring a noise pool corresponding to each channel; Updating a corresponding noise pool based on the noise information of each channel; Inputting the voice information and the corresponding noise information of each channel into the preset voice recognition model for voice recognition to obtain a voice recognition result of each channel, wherein the voice recognition result comprises the following steps: and inputting the voice information of each channel and the noise information in the corresponding updated noise pool into the preset voice recognition model for voice recognition to obtain a voice recognition result of each channel.
7. The method of claim 5, further comprising, prior to the obtaining the predetermined speech recognition model: acquiring second training sample data, wherein the second training sample data comprises a plurality of sections of second sample audio data, and text labels and noise labels corresponding to each section of second sample audio data; and training a preset voice recognition model by taking the multi-section second sample audio data as input and taking a text label and a noise label corresponding to each section of second sample audio data as output.
8. A speech recognition apparatus, comprising: The system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring voice information to be recognized, and the voice information to be recognized comprises voice information of at least two sound channels; The detection module is used for detecting the voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel; The system comprises a determining module, a determining module and a determining module, wherein the determining module is used for determining noise information of each channel based on cross information of voice activity fragments corresponding to different channels in a time dimension, wherein in the case that the voice information to be identified is double-channel voice information, the cross information of the voice activity fragments in the time dimension is used for indicating a time period when only one channel has the voice activity fragments, a time period when two channels have the voice activity fragments, and a time period when two channels do not have the voice activity fragments; And the recognition module is used for carrying out voice recognition on the voice information of the corresponding sound channel according to the noise information of each sound channel to obtain a voice recognition result corresponding to the voice information to be recognized.
9. A storage medium having stored thereon a computer program, which when loaded by a processor performs the steps of the speech recognition method according to any of claims 1-7.
10. An electronic device comprising a processor and a memory, the memory storing a computer program, characterized in that the processor is adapted to perform the steps of the speech recognition method according to any of claims 1 to 7 by loading the computer program.
11. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps in the speech recognition method of any one of claims 1 to 7.

Description

Speech recognition method, device, storage medium and electronic equipment Technical Field The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method, a device, a storage medium, and an electronic apparatus. Background Along with the continuous development of artificial intelligence technology, various products based on the artificial intelligence technology are also continuously produced and living in people, and great convenience is brought to the daily life of people. Among them, speech recognition technology, also known as automatic speech recognition (Automatic Speech Recognition, ASR), is an important branch of artificial intelligence technology. The goal is to convert the lexical content in human speech into computer readable inputs such as keys, binary codes, or character sequences. Currently, in some scenes, for example, in a multi-channel speech recognition scene, there is a problem that a speech recognition result is inaccurate. Disclosure of Invention The application provides a voice recognition method, a voice recognition device, a storage medium and electronic equipment, which can improve the accuracy of voice recognition. The voice recognition method provided by the application comprises the following steps: acquiring voice information to be recognized, wherein the voice information to be recognized comprises voice information of at least two sound channels; Detecting voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel; Determining noise information of each channel based on the cross information of the voice activity fragments corresponding to different channels in the time dimension; and carrying out voice recognition on the voice information of the corresponding sound channel according to the noise information of each sound channel to obtain a voice recognition result corresponding to the voice information to be recognized. The application provides a voice recognition device, comprising: The system comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring voice information to be recognized, and the voice information to be recognized comprises voice information of at least two sound channels; The detection module is used for detecting the voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel; the determining module is used for determining the noise information of each channel based on the cross information of the voice activity fragments corresponding to different channels in the time dimension; And the recognition module is used for carrying out voice recognition on the voice information of the corresponding sound channel according to the noise information of each sound channel to obtain a voice recognition result corresponding to the voice information to be recognized. The present application provides a storage medium having stored thereon a computer program which, when loaded by a processor, performs the steps of the speech recognition method as provided by the present application. The electronic equipment provided by the application comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the steps in the voice recognition method provided by the application by loading the computer program. The present application provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps in the speech recognition method provided by the present application. The method comprises the steps of obtaining voice information to be recognized, detecting voice activity fragments of the voice information to be recognized to obtain a plurality of voice activity fragments corresponding to each channel, determining noise information of each channel based on cross information of the voice activity fragments corresponding to different channels in time dimension, and carrying out voice recognition on the voice information of the corresponding channel according to the noise information of each channel to obtain a voice recognition result corresponding to the voice information to be recognized. Compared with the related art, the method and the device predict the voice dialogue process by detecting the voice activity fragments of the multi-channel voice information to be recognized, then recognize the noise information of each channel based on the voice activity fragments, and further correspondingly inhibit the noise information of each channel, so that more accurate voice recognition results are obtained through recognition. The method can greatly improve the noise immunity of the voice recognition system, thereby improving the accuracy of voice recognition.