CN-122024717-A - Speech recognition method, apparatus, device, storage medium, and program product

CN122024717ACN 122024717 ACN122024717 ACN 122024717ACN-122024717-A

Abstract

The application discloses a voice recognition method, a device, equipment, a storage medium and a program product, which relate to the technical field of voice processing and comprise the steps of responding to a user voice instruction containing an indication pronoun, determining a sight line pointing region when the user sends the user voice instruction; the method comprises the steps of responding to the sight line pointing area to comprise at least two devices, determining the watching intention of the user according to video data of the user when the user voice command is sent out, responding to the watching intention to be the device which is purposely watched in the sight line pointing area, determining target devices from the at least two devices according to the video data, and replacing the indication pronoun in the user voice command with the target devices. The target equipment corresponding to the indication pronoun can be accurately determined, accurate replacement of the indication pronoun is realized, and accurate execution of the user voice instruction is further ensured.

Inventors

LI DAIFAN
PAN XUANHUA
LI TIANHUI
QIN SHUAN
LI HUILING

Assignees

上汽通用五菱汽车股份有限公司

Dates

Publication Date: 20260512
Application Date: 20260116

Claims (10)

1. A method of speech recognition, the method comprising: responding to the indication pronouns contained in the user voice command, and determining a sight line pointing area of the user when the user voice command is sent; Determining, in response to the gaze-directed area comprising at least two devices, a gaze intent of the user from video data of the user when the user voice instruction is issued; Determining a target device from the at least two devices from the video data in response to the gaze intent being an intended gaze of a device in the gaze-directed area; And replacing the indication pronoun in the user voice instruction with the target equipment.
2. The method of claim 1, wherein determining the user's gaze intention from video data of the user at the time the user voice command was issued comprises: Determining target characteristics according to the video data, wherein the target characteristics comprise at least one of gaze point continuity, pupil diameter change characteristics, micro-glance times, blink times in a preset time period from a starting moment in the video data and head posture changes; determining a target score according to the target feature; Determining that the gaze is intended to gaze at a device in the gaze-directed area if the target score is greater than or equal to a score threshold, or determining that the gaze is intended to not be intended to gaze at a device in the gaze-directed area if the target score is less than the score threshold.
3. The method of claim 2, wherein determining a target score based on the target feature comprises: determining a first score according to the gaze point continuity, wherein the first score and the gaze point continuity are in positive correlation; Determining a second score according to the pupil diameter variation characteristics, wherein the second score and the pupil diameter variation characteristics are in a negative correlation; determining a third score according to the micro-glance times, wherein the third score and the micro-glance times are in a negative correlation; determining a fourth score according to the blink times, wherein the fourth score and the blink times are in negative correlation; Determining a fifth score according to the head pose change, wherein the fifth score and the head pose change are in negative correlation; And determining the target score according to at least one of the first score, the second score, the third score, the fourth score and the fifth score.
4. The method of claim 1, wherein there are horizontally adjacent devices in the gaze-directed area, and wherein the determining a target device from the at least two devices based on the video data comprises any one of: determining a first motion characteristic of the sight line of the user in the horizontal direction according to the video data, and determining the target equipment from the horizontally adjacent equipment according to the first motion characteristic; And determining the gaze point of the user according to the video data, and determining the target device from the horizontally adjacent devices according to the aggregation degree of the gaze point of each device in the horizontally adjacent devices.
5. The method of claim 1, wherein there are vertically adjacent devices in the gaze-directed area, and wherein the determining a target device from the at least two devices based on the video data comprises any one of: determining a second motion characteristic of the line of sight of the user in a vertical direction according to the video data, and determining the target equipment from the vertically adjacent equipment according to the second motion characteristic; And determining a set of gaze points of the user on each device according to the video data, and determining the target device from the vertically adjacent devices according to the distribution variance of the set of gaze points corresponding to each device in the vertical direction.
6. The method of claim 1, wherein there is a spatially overlapping device in the gaze-directed region, wherein the determining a target device from the at least two devices based on the video data comprises any one of: determining depth information corresponding to a gaze point of the user according to the video data, and determining the target device from the spatially overlapped devices according to the depth information; And determining the duration of the user sight according to the video data, and determining the target equipment from the spatially overlapped equipment according to the difference value between the duration and the duration threshold corresponding to each equipment.
7. A speech recognition device, characterized in that the speech recognition device comprises: The first determining module is used for determining a sight line pointing area of a user when the user voice command is sent in response to the fact that the user voice command contains an indication pronoun; a second determining module, configured to determine, in response to the gaze-directed area including at least two devices, a gaze intention of the user according to video data of the user when the user voice command is issued; A third determining module for determining a target device from the at least two devices from the video data in response to the gaze intention being an intended gaze of a device in the gaze-directed area; and the replacing module is used for replacing the indication pronoun in the user voice instruction with the target equipment.
8. A speech recognition device, characterized in that the device comprises a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the speech recognition method according to any one of claims 1 to 6.
9. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, realizes the steps of the speech recognition method according to any one of claims 1 to 6.
10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the steps of the speech recognition method according to any one of claims 1 to 6.

Description

Speech recognition method, apparatus, device, storage medium, and program product Technical Field The present application relates to the field of speech processing technology, and in particular, to a speech recognition method, apparatus, device, storage medium, and program product. Background With the development of intelligent internet-connected automobiles, vehicle-mounted voice recognition technology and in-vehicle camera perception technology have gradually become important components of driving assistance and man-machine interaction. In the related art, a user in a vehicle may control vehicle components, such as adjusting an air-conditioning temperature, opening a sunroof, etc., through voice commands. However, when the user voice command replaces the device with the designation pronouns of "this", "that", etc., it is difficult to precisely judge the target component that the user voice command is intended to control, thereby resulting in failure to accurately execute the user voice command. Disclosure of Invention The main object of the present application is to provide a voice recognition method, apparatus, device, storage medium and program product, which are aimed at solving the technical problem that it is difficult to accurately judge the target component that the user voice command wants to control when the user voice command contains an indication pronoun. In order to achieve the above object, the present application provides a speech recognition method, which includes: responding to the indication pronouns contained in the user voice command, and determining a sight line pointing area of the user when the user voice command is sent; Determining, in response to the gaze-directed area comprising at least two devices, a gaze intent of the user from video data of the user when the user voice instruction is issued; Determining a target device from the at least two devices from the video data in response to the gaze intent being an intended gaze of a device in the gaze-directed area; And replacing the indication pronoun in the user voice instruction with the target equipment. In some embodiments, determining the gaze intent of the user from video data of the user at the time of issuing the user voice instruction comprises: Determining target characteristics according to the video data, wherein the target characteristics comprise at least one of gaze point continuity, pupil diameter change characteristics, micro-glance times, blink times in a preset time period from a starting moment in the video data and head posture changes; determining a target score according to the target feature; Determining that the gaze is intended to gaze at a device in the gaze-directed area if the target score is greater than or equal to a score threshold, or determining that the gaze is intended to not be intended to gaze at a device in the gaze-directed area if the target score is less than the score threshold. In some embodiments, the determining a target score from the target feature comprises: determining a first score according to the gaze point continuity, wherein the first score and the gaze point continuity are in positive correlation; Determining a second score according to the pupil diameter variation characteristics, wherein the second score and the pupil diameter variation characteristics are in a negative correlation; determining a third score according to the micro-glance times, wherein the third score shows a negative correlation with the micro-glance times; determining a fourth score according to the blink times, wherein the fourth score and the blink times are in negative correlation; Determining a fifth score according to the head pose change, wherein the fifth score and the head pose change are in negative correlation; And determining the target score according to at least one of the first score, the second score, the third score, the fourth score and the fifth score. In some embodiments, there is a horizontally adjacent device in the gaze-directed region, and the determining a target device from the at least two devices based on the video data includes any one of: determining a first motion characteristic of the sight line of the user in the horizontal direction according to the video data, and determining the target equipment from the horizontally adjacent equipment according to the first motion characteristic; And determining the gaze point of the user according to the video data, and determining the target device from the horizontally adjacent devices according to the aggregation degree of the gaze point of each device in the horizontally adjacent devices. In some embodiments, there are vertically adjacent devices in the gaze-directed region, and the determining a target device from the at least two devices based on the video data includes any one of: determining a second motion characteristic of the line of sight of the user in a vertical direction according to the video data, and determining the t