EP-4002197-B1 - SIGN LANGUAGE RECOGNITION METHOD AND APPARATUS, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER DEVICE

EP4002197B1EP 4002197 B1EP4002197 B1EP 4002197B1EP-4002197-B1

Inventors

YANG, Zhaoyang
SHEN, Xiaoyong
TAI, Yuwing
JIA, JIAYA

Dates

Publication Date: 20260506
Application Date: 20200624

Claims (12)

A gesture language recognition method, applicable to a terminal, comprising: obtaining (S202) a to-be-recognized gesture language video; extracting (S204) a gesture feature from each image frame in the gesture language video by using a two-dimensional convolutional neural network model, to extract the gesture feature in each frame, for which a window of three frames is used; extracting (S206) a gesture change feature from each image frame in the gesture language video by using a three-dimensional convolutional neural network model, to extract the gesture change feature in each frame, for which a window of three frames is used; extracting (S208) gesture language word information from a fused feature obtained by fusing the gesture feature and the gesture change feature, comprising converting the fused feature obtained by fusing the gesture feature and the gesture change feature into a feature vector; combining feature vectors, which correspond to a plurality of consecutive image frames, among feature vectors obtained through conversion to obtain a feature vector group; and extracting the gesture language word information from the feature vector group by using a long short-term memory network, wherein the fusing the gesture feature and the gesture change feature includes summing up the gesture feature and the gesture change feature and averaging a result of the summing to obtain the fused feature; and combining (S210) , by using a bidirectional long short-term memory network, the gesture language word information with other pieces of language word information into a gesture language sentence.
The method according to claim 1, wherein the obtaining (S202) a to-be-recognized gesture language video comprises: filming a target object in an environment; detecting a waiting time period before the target object switches to a next gesture in a real-time manner during filming in a case that a video obtained by filming the target object comprises a human face feature and the gesture feature; and using the obtained video as the to-be-recognized gesture language video in a case that the waiting time period meets a pre-set condition.
The method according to claim 2, further comprising: saving, in a case that the waiting time period does not meet the pre-set condition, the video obtained by filming the target object, and returning to the operation of detecting a waiting time period before the target object switches to a next gesture in a real-time manner during filming, until the waiting time period meets the pre-set condition; and using a current video obtained by filming the target object and the saved video as the to-be-recognized gesture language video.
The method according to claim 1, further comprising: detecting, in a case that a video obtained by filming a target object comprises a human face feature and the gesture feature, the gesture feature of the target object in a real-time manner during filming; using the obtained video as the to-be-recognized gesture language video in a case that the detected gesture feature meets a gesture end point condition; saving the filmed video in a case that the detected gesture feature does not meet the gesture end point condition, and performing the operation of detecting the gesture feature of the target object in a real-time manner during filming, until the gesture feature meets the gesture end point condition; and using a current video obtained by filming the target object and the saved video as the to-be-recognized gesture language video.
The method according to claim 1, wherein the terminal performs convolution on the fused feature obtained after pooling by using a two-dimensional convolutional neural network model in a second feature extraction unit to obtain a gesture feature, the two-dimensional convolutional neural network model has a convolution kernel size of 3×3, a stride of 1, and a channel quantity of 128, the terminal performs convolution on the fused feature obtained after the pooling by using a three-dimensional convolutional neural network model in a second feature extraction unit to obtain a gesture change feature, the three-dimensional convolutional neural network model has a convolution kernel size of 3×3×3, a stride of 1, and a channel quantity of 128, the terminal averages a sum of the gesture feature outputted by the two-dimensional convolutional neural network model in the second feature extraction unit and the gesture change feature outputted by the three-dimensional convolutional neural network model in the second feature extraction unit to obtain a fused feature of a second fusion, then, the terminal performs convolution on the fused feature of the second fusion by using the two-dimensional convolutional neural network model with a convolution kernel size of 1×1, a stride of 1, and a channel quantity of 128, performs pooling by using a max pooling layer, and uses a fused feature obtained after the pooling as an input of a third feature extraction unit.
The method according to claim 1, wherein the converting the fused feature obtained by fusing the gesture feature and the gesture change feature into a feature vector comprises: performing convolution on the fused feature obtained by fusing the gesture feature and the gesture change feature; and performing global average pooling on the fused feature after the convolution, to obtain the feature vector corresponding to each image frame in the gesture language video.
The method according to claim 1, further comprising: displaying prompt information on a displayed gesture language recognition operation page in a case that a new gesture language sentence is synthesized; adjusting, in a process that a historical gesture language sentence is moved from a first position to a second position on the gesture language recognition operation page, a presentation manner of the historical gesture language sentence, the historical gesture language sentence being a gesture language sentence synthesized before the new gesture language sentence is synthesized; and displaying the new gesture language sentence at the first position in a target presentation manner different from the presentation manner.
The method according to claim 1, wherein a regular term is introduced into a loss function of the long short-term memory network; and the regular term is: L 1 = − ∑ n = 1 N P o , n log P o , n P c , n , wherein N is a total quantity of vocabulary, P o,n is a probability of occurrence of an n th word that is predicted during classification according to a sentence feature, and P c,n is a probability of occurrence of the n th word that is determined according to a word feature.
The method according to claim 8, wherein the bidirectional long short-term memory network adopts a connectionist temporal classification loss function; and the connectionist temporal classification loss function is configured to mark a gesture language word corresponding to an image frame comprising no gesture language word information as a null character, and delete the null character during synthesis of the gesture language sentence.
The method according to claim 1, wherein the pre-set condition is a time period threshold, and the waiting time period meets the pre-set condition in a case that the waiting time period is greater than or equal to the time period threshold.
A computer-readable storage medium, storing a computer program, the computer program, when executed by a processor, causing the processor to perform operations of the method according to any one of claims 1 to 11.
A computer device, comprising a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, causing the processor to perform operations of the method according to any one of claims 1 to 11.

Description

This application claims priority to Chinese Patent Application No. 2019106501590, filed with the National Intellectual Property Administration, PRC on July 18, 2019 and entitled "SIGN LANGUAGE RECOGNITION METHOD AND APPARATUS, COMPUTER-READABLE STORAGE MEDIUM, AND COMPUTER DEVICE". FIELD The present disclosure relates to the field of computer technologies, and in particular, to a gesture language recognition method, a gesture language recognition apparatus, a computer-readable storage medium, and a computer device. BACKGROUND For people with hearing impairment, gesture language is a common natural language to express thoughts to others. However, ordinary people know little about the gesture language, making it difficult to communicate with people with hearing impairment. Therefore, the emergence of the gesture language recognition technology is of great significance, which can promote the communication between ordinary people and those with hearing impairment. It is challenging to recognize a series of continuous gesture language expressions directly into words. In a conventional gesture language recognition solution, a bracelet or glove with sensors is used to obtain information such as distances and muscle activities, through which gesture language recognition is carried out. However, gesture language recognition is performed at low accuracy with the above gesture language recognition solution. ZHU GUANGMING ET AL: "Continuous Gesture Segmentation and Recognition Using 3DCNN and Convolutional LSTM", IEEE TRANSACTIONS ON MULTIMEDIA, IEEE, USA, vol. 21 , no. 4, 1 April 2019, pages 1011-1021, concerns a deep architecture for continuous gesture recognition. SHUO WANG ET AL: "Connectionist Temporal Fusion for Sign Language Translation", MULTIMEDIA, ACM, 2 PENN PLAZA, SUITE 701 NEW YORKNY1 0121 -0701 USA, 15 October 2018, pages 1483-1491, concerns a hybrid deep architecture which consists of a temporal convolution module (TCOV), a bidirectional gated recurrent unit module (BGRU), and a fusion layer module (FL) to address the continuous sign language translation (CSLT) problem. SUMMARY The invention is defined by the appended claims. Details of one or more embodiments of the present disclosure are provided in the subsequent accompanying drawings and descriptions. Other features and advantages of the present disclosure become obvious with reference to the specification, the accompanying drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram of an application environment of a gesture language recognition method according to an embodiment.FIG. 2 is a flowchart of a gesture language recognition method according to an embodiment.FIG. 3 is a schematic diagram of a gesture language recognition page according to an embodiment.FIG. 4 is a schematic diagram of human face feature points according to an embodiment.FIG. 5 is a schematic diagram of two-dimensional convolution and three-dimensional convolution according to an embodiment.FIG. 6 is a schematic structural diagram of a feature extraction unit according to an embodiment.FIG. 7 is a flowchart of a step of extracting gesture language word information according to an embodiment.FIG. 8 is a flowchart of a step of displaying prompt information in a case that a new gesture language sentence is synthesized and displaying the new gesture language sentence in a preset presentation manner according to an embodiment.FIG. 9 is a block diagram showing a structure of a machine learning model according to an embodiment.FIG. 10 is a flowchart of a gesture language recognition method according to another embodiment.FIG. 11 is a block diagram showing a structure of a gesture language recognition apparatus according to an embodiment.FIG. 12 is a structural block diagram of a gesture language recognition apparatus according to another embodiment.FIG. 13 is a block diagram showing a structural of a computer device according to an embodiment. DESCRIPTION OF EMBODIMENTS To make objectives, technical solutions, and advantages of the present disclosure clearer and more comprehensible, the present disclosure is further elaborated in detail with reference to the accompanying drawings and embodiments. It is to be understood that the specific embodiments described herein are merely used for explaining the present disclosure but are not intended to limit the present disclosure. FIG. 1 is a diagram of an application environment of a gesture language recognition method according to an embodiment. Referring to FIG. 1, the gesture language recognition method is applied to a gesture language recognition system. The gesture language recognition system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected with each other through a network. The gesture language recognition method may be performed by the terminal 110, or may be performed by the terminal 110 in cooperation with the server 120. When the method is performed by the terminal 110,