CN-121983063-A - Speech recognition method, apparatus, device, storage medium, and program product

CN121983063ACN 121983063 ACN121983063 ACN 121983063ACN-121983063-A

Abstract

The application discloses a voice recognition method, a device, equipment, a storage medium and a program product, which relate to the technical field of voice recognition and comprise the steps of determining a first voice recognition result corresponding to voice data; the method comprises the steps of determining a second voice recognition result corresponding to lip video data, determining a first target weight corresponding to a first voice recognition result according to at least one of ambient noise, ambient light, a head motion vector and lip visibility corresponding to an object to be voice recognized, determining a second target weight corresponding to a second voice recognition result according to at least one of ambient noise, ambient light, a head motion vector and lip visibility, and fusing the first voice recognition result and the second voice recognition result based on the first target weight and the second target weight. Based on key factors such as environmental noise, illumination, head motion vectors, lip visibility and the like which influence the recognition reliability, the weights of the two groups of recognition results are dynamically determined, so that the speech recognition accuracy is improved.

Inventors

LI DAIFAN
PAN XUANHUA
LI TIANHUI
QIN SHUAN
LI HUILING

Assignees

上汽通用五菱汽车股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (10)

1. A method of speech recognition, the method comprising: recognizing the voice data to obtain a first voice recognition result; The lip video data corresponding to the voice data are identified, and a second voice identification result is obtained; determining a first target weight corresponding to the first voice recognition result according to at least one of ambient noise, ambient illumination, head motion vector and lip visibility corresponding to an object to be recognized; Determining a second target weight corresponding to the second speech recognition result according to at least one of the ambient noise, the ambient light, the head motion vector and the lip visibility; and fusing the first voice recognition result and the second voice recognition result based on the first target weight and the second target weight to determine a target voice recognition result.
2. The method of claim 1, wherein determining the first target weight corresponding to the first speech recognition result according to at least one of ambient noise, ambient light, head motion vector, and lip visibility corresponding to the object to be speech recognized comprises: Normalizing the environmental noise to determine a first value, wherein the environmental noise and the first value are in a negative correlation; determining a first correction proportion according to a first influence coefficient of the environmental noise on the first voice recognition result and the first numerical value; And determining the product of the first correction proportion and a first initial weight corresponding to the first voice recognition result as the first target weight.
3. The method of claim 2, wherein the determining a second target weight corresponding to the second speech recognition result based on at least one of the ambient noise, the ambient light, the head motion vector, and the lip visibility comprises: Normalizing the ambient light, the head motion vector and the lip visibility respectively, and determining a second value corresponding to the ambient light, a third value corresponding to the head motion vector and a fourth value corresponding to the lip visibility, wherein the head motion vector and the third value are in negative correlation, and the lip visibility and the fourth value are in positive correlation; determining a second correction proportion according to a second influence coefficient of the ambient light on the second voice recognition result and the second numerical value; determining a third correction proportion according to a third influence coefficient of the head motion vector on the second voice recognition result and the third numerical value; determining a fourth correction proportion according to a fourth influence coefficient of the lip visibility on the second voice recognition result and the fourth numerical value; determining the product of the second correction proportion, the third correction proportion and the fourth correction proportion as a target correction proportion; And determining the product of the target correction proportion and a second initial weight corresponding to the second voice recognition result as the second target weight.
4. The method of claim 1, wherein determining the first target weight corresponding to the first speech recognition result according to at least one of ambient noise, ambient light, head motion vector, and lip visibility corresponding to the object to be speech recognized comprises: Determining a first sub-weight corresponding to the first voice recognition result according to the environmental noise; determining a second sub-weight corresponding to the first voice recognition result according to the ambient light; Determining a third sub-weight corresponding to the first voice recognition result according to the head motion vector; determining a fourth sub-weight corresponding to the first voice recognition result according to the lip visibility; And fusing the first sub-weight, the second sub-weight, the third sub-weight and the fourth sub-weight to determine the first target weight.
5. The method of claim 4, wherein determining a second target weight for the second speech recognition result based on at least one of the ambient noise, the ambient light, the head motion vector, and the lip visibility comprises: Determining a fifth sub-weight corresponding to the second voice recognition result according to the environmental noise; Determining a sixth sub-weight corresponding to the second voice recognition result according to the ambient light; determining a seventh sub-weight corresponding to the second voice recognition result according to the head motion vector; Determining an eighth sub-weight corresponding to the second voice recognition result according to the lip visibility; And fusing the fifth sub-weight, the sixth sub-weight, the seventh sub-weight and the eighth sub-weight to determine the second target weight.
6. The method according to any one of claims 1-5, wherein the first speech recognition result includes a plurality of predicted recognition results and corresponding confidence levels, and the recognizing the lip video data corresponding to the speech data to obtain a second speech recognition result includes: and under the condition that the maximum confidence coefficient in the first voice recognition result is smaller than a confidence coefficient threshold value, recognizing the lip video data to obtain the second voice recognition result.
7. A speech recognition device, characterized in that the speech recognition device comprises: the first recognition module is used for recognizing the voice data to obtain a first voice recognition result; The second recognition module is used for recognizing the lip video data corresponding to the voice data to obtain a second voice recognition result; The first determining module is used for determining a first target weight corresponding to the first voice recognition result according to at least one of ambient noise, ambient illumination, head motion vector and lip visibility corresponding to the object to be recognized; A second determining module, configured to determine a second target weight corresponding to the second speech recognition result according to at least one of the ambient noise, the ambient light, the head motion vector, and the lip visibility; And the fusion module is used for fusing the first voice recognition result and the second voice recognition result based on the first target weight and the second target weight and determining a target voice recognition result.
8. A speech recognition device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the speech recognition method according to any one of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the speech recognition method according to any one of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 6.

Description

Speech recognition method, apparatus, device, storage medium, and program product Technical Field The present application relates to the field of speech recognition technology, and in particular, to a speech recognition method, apparatus, device, storage medium, and program product. Background The voice recognition technology is used as one of core technologies of man-machine interaction, is widely applied to a plurality of fields of intelligent assistants, automatic driving, intelligent customer service, medical records and the like, and has the core aim of accurately converting human voice signals into text information so as to realize efficient information interaction and processing. Therefore, how to improve the accuracy of speech recognition has become an important research direction. Disclosure of Invention The main objective of the present application is to provide a method, apparatus, device, storage medium and program product for speech recognition, which aims to solve the technical problem of how to improve the accuracy of speech recognition. In order to achieve the above object, the present application provides a speech recognition method, which includes: recognizing the voice data to obtain a first voice recognition result; The lip video data corresponding to the voice data are identified, and a second voice identification result is obtained; determining a first target weight corresponding to the first voice recognition result according to at least one of ambient noise, ambient illumination, head motion vector and lip visibility corresponding to an object to be recognized; Determining a second target weight corresponding to the second speech recognition result according to at least one of the ambient noise, the ambient light, the head motion vector and the lip visibility; and fusing the first voice recognition result and the second voice recognition result based on the first target weight and the second target weight to determine a target voice recognition result. In some embodiments, the determining the first target weight corresponding to the first speech recognition result according to at least one of ambient noise, ambient light, head motion vector, and lip visibility corresponding to the object to be speech recognized includes: Normalizing the environmental noise to determine a first value, wherein the environmental noise and the first value are in a negative correlation; determining a first correction proportion according to a first influence coefficient of the environmental noise on the first voice recognition result and the first numerical value; And determining the product of the first correction proportion and a first initial weight corresponding to the first voice recognition result as the first target weight. In some embodiments, the determining the second target weight corresponding to the second speech recognition result according to at least one of the ambient noise, the ambient light, the head motion vector, and the lip visibility includes: Normalizing the ambient light, the head motion vector and the lip visibility respectively, and determining a second value corresponding to the ambient light, a third value corresponding to the head motion vector and a fourth value corresponding to the lip visibility, wherein the head motion vector and the third value are in negative correlation, and the lip visibility and the fourth value are in positive correlation; determining a second correction proportion according to a second influence coefficient of the ambient light on the second voice recognition result and the second numerical value; determining a third correction proportion according to a third influence coefficient of the head motion vector on the second voice recognition result and the third numerical value; determining a fourth correction proportion according to a fourth influence coefficient of the lip visibility on the second voice recognition result and the fourth numerical value; determining the product of the second correction proportion, the third correction proportion and the fourth correction proportion as a target correction proportion; And determining the product of the target correction proportion and a second initial weight corresponding to the second voice recognition result as the second target weight. In some embodiments, the determining the first target weight corresponding to the first speech recognition result according to at least one of ambient noise, ambient light, head motion vector, and lip visibility corresponding to the object to be speech recognized includes: Determining a first sub-weight corresponding to the first voice recognition result according to the environmental noise; determining a second sub-weight corresponding to the first voice recognition result according to the ambient light; Determining a third sub-weight corresponding to the first voice recognition result according to the head motion vector; determining a fourth sub-weight corresponding