KR-20260064997-A - SIGN LANGUAGE VIDEO TRANSLATION SYSTEM AND METHOD OF TRANSLATING SIGN LANGUAGE VIDEO USING THE SAME

KR20260064997AKR 20260064997 AKR20260064997 AKR 20260064997AKR-20260064997-A

Abstract

The sign language video translation system includes a feature point extraction module that extracts user feature points from a target video, a feature point preprocessing module that normalizes body feature points among user feature points, normalizes hand feature points among body feature points, and generates preprocessed feature points by restoring hand feature points through interpolation for frames where hand feature points were not extracted, a storage module that stores the preprocessed feature points, and an artificial intelligence neural network that recognizes and learns the feature points stored in the storage module.

Inventors

박종철
황의준
조석민
노경근
이희제

Assignees

한국과학기술원

Dates

Publication Date: 20260508
Application Date: 20241030

Claims (20)

Feature point extraction module for extracting user feature points from a target image; A feature point preprocessing module that normalizes body feature points among the feature points of the user, normalizes hand feature points among the body feature points, and generates preprocessed feature points by recovering hand feature points through interpolation for frames where hand feature points were not extracted; A storage module in which the above-mentioned preprocessed feature points are stored; and A sign language image translation system comprising an artificial intelligence neural network that recognizes and learns feature points stored in the storage module.
In claim 1, the feature point extraction module generates an image representing the user's feature points on a frame-by-frame basis from the target image, and The above feature point preprocessing module is a sign language image translation system that positions the feature point of the hand at the center of the image.
In claim 2, the feature point preprocessing module positions the feature point at the center of the body at the center of the image, sign language image translation system.
In claim 3, the feature point preprocessing module scales the feature points of the body using the length of the neck, in a sign language image translation system.
In claim 1, the feature point preprocessing module normalizes the feature points of the body based on the following [mathematical formula], sign language image translation system. [Mathematical Formula] In the [mathematical formula], the above represents the location of the normalized body feature points, and the above indicates the location of the characteristic point at the center of the head, and the above indicates the location of the characteristic point of the neck, and the above represents the location of the body's feature points before normalization.
A sign language image translation system according to claim 1, wherein the feature point preprocessing module recovers the hand feature points through bilinear interpolation using the hand feature points for the frame in which the feature points were extracted among the previous frames of the frame in which the hand feature points were not extracted, and the hand feature points for the frame in which the hand feature points were extracted among the next frames of the frame in which the hand feature points were not extracted.
In claim 6, among the previous frames of the frame from which the hand feature points were not extracted, the frame from which the hand feature points were extracted is The frame is the nth frame, and among the frames following the frame in which the hand feature points were not extracted, the frame in which the hand feature points were extracted is It is the th frame, and the above has a maximum value, and the above has a minimum value, and the above feature point preprocessing module is the Feature points of the hand for the th frame and the above A sign language video translation system utilizing hand feature points for the nth frame.
In claim 6, the feature point preprocessing module recovers the hand feature points for the first frame of the target image by averaging the hand feature points extracted from the first frame of each of the images other than the target image, in a sign language image translation system.
A sign language image translation system according to claim 8, wherein the feature point preprocessing module recovers the hand feature points for the last frame of the target image by averaging the hand feature points extracted from the last frame of each of the images other than the target image.
A sign language image translation system according to claim 9, wherein the feature point preprocessing module recovers the hand feature points for the first frame, recovers the hand feature points for the last frame, and recovers the hand feature points for all frames of the target image by recovering the hand feature points for the frames in which the hand feature points were not extracted through bilinear interpolation.
In claim 1, the feature point preprocessing module recovers the feature points of the hand based on the following [mathematical formula], sign language image translation system. [Mathematical Formula] In the [mathematical formula] above represents the recovered feature point of the k-th frame, and the above is the hand feature point extracted from the previous frames of the k-th frame above. Indicates the feature points for the nth frame, and the above The hand feature points extracted from the frames following the k-th frame above It represents the feature points for the nth frame.
In claim 11, the above in [mathematical formula] has a maximum value, and the above is a sign language video translation system having a minimum value.
A sign language video translation system according to claim 1, wherein the artificial intelligence neural network includes a transformer encoder-only mode.
A sign language image translation system according to claim 1, wherein the target image includes sign language images, and the feature point preprocessing module normalizes the number of frames of each of the sign language images.
A sign language image translation system according to claim 1, wherein the interpolation method is bilinear interpolation.
Step of extracting user feature points from a target image; A step of normalizing the body feature points among the feature points of the above user; A step of normalizing the hand feature points among the above body feature points; and A step of preprocessing feature points, including a step of recovering hand feature points through interpolation for frames from which hand feature points were not extracted; A step of storing the above-mentioned preprocessed feature points; and A sign language image translation method comprising the step of recognizing and learning the stored feature points.
In claim 16, the step of extracting user feature points from the target image includes the step of generating an image representing the user's feature points on a frame-by-frame basis from the target image. A sign language image translation method, wherein the step of normalizing the feature points of the hand includes the step of positioning the feature points of the hand at the center of the image.
In claim 17, the step of normalizing the feature points of the body is, A step of positioning a central feature point of the body at the center of the image; and A sign language image translation method comprising the step of scaling the above-mentioned body feature points using the length of the neck.
A sign language image translation method according to claim 16, wherein the step of normalizing the feature points of the body is performed based on the following [mathematical formula]. [Mathematical Formula] In the [mathematical formula], the above represents the normalized body feature points, and the above ... indicates the characteristic point at the center of the head, and the above represents the characteristic points of the neck, and the above represents the body's feature points before normalization.
A sign language image translation method according to claim 16, wherein the step of recovering the hand feature point comprises recovering the hand feature point through bilinear interpolation using the hand feature point for the frame in which the hand feature point was extracted among the previous frames of the frame in which the hand feature point was not extracted, and the hand feature point for the frame in which the hand feature point was extracted among the next frames of the frame in which the hand feature point was not extracted.

Description

Sign Language Video Translation System and Method of Translating Sign Language Video Using the Same The present invention relates to a sign language image translation system and a sign language image translation method using the same. More specifically, the present invention relates to a sign language image translation system that extracts user feature points from an image and a sign language image translation method using the same. Sign language is a visual language that allows for the expression of meaning through hand movements, shapes, and directions. While sign language is used not only among the hearing impaired but also in conversations with the general public, its complex expression methods can make it difficult for the hearing population to learn. Consequently, translation systems are currently being developed that either visually convert natural language into sign language or recognize sign language and display it in natural language. For example, systems utilizing artificial intelligence neural networks to recognize sign language and translate it into natural language are being developed. To facilitate sign language recognition and learning, feature points of sign language users can be extracted and fed to an artificial intelligence neural network for training. However, training the neural network may not be easy because height, movements, positions, and body types vary among sign language users, and data on sign language videos is scarce. Additionally, it may be difficult for the neural network to properly distinguish sign language expressions with slightly different hand shapes, such as 'Need' or 'Yes,' and feature points may not be extracted if the user's hand movements are very fast. Furthermore, because the length of sign language videos varies, it may not be easy for the neural network to recognize the feature points of the sign language users. FIG. 1 is a block diagram showing a sign language video translation system according to one embodiment of the present invention. Figure 2 is a flowchart illustrating a method for translating sign language images using the sign language image translation system of Figure 1. Figure 3 is a block diagram showing the step of extracting feature points from an image included in the sign language image translation method of Figure 2. Figure 4 is a flowchart showing the step of preprocessing feature points included in the sign language image translation method of Figure 2. Figure 5 is a block diagram showing the steps of normalizing the body feature points and normalizing the hand feature points included in the step of preprocessing the feature points of Figure 4. Figure 6 is a block diagram showing the step of recovering hand feature points for frames in which hand feature points were not extracted, which is included in the step of preprocessing feature points of Figure 4. FIG. 7 is a block diagram showing the step of recovering feature points for the first frame included in the step of recovering feature points for the hand for the frame in FIG. 6 in which feature points of the hand were not extracted. FIG. 8 is a block diagram showing the step of recovering feature points for the last frame included in the step of recovering feature points for the hand for the frame in FIG. 6 in which feature points of the hand were not extracted. Figure 9 is a table showing the translation accuracy from sign language to natural language according to each normalization status. With respect to the embodiments of the present invention disclosed in the text, specific structural or functional descriptions are provided merely for the purpose of explaining the embodiments of the present invention, and the embodiments of the present invention may be implemented in various forms and should not be interpreted as being limited to the embodiments described in the text. The present invention is capable of various modifications and may take various forms, and specific embodiments are illustrated in the drawings and described in detail in the text. However, this is not intended to limit the invention to the specific disclosed forms, and it should be understood that the invention includes all modifications, equivalents, and substitutions that fall within the spirit and scope of the invention. Terms such as "first," "second," etc., may be used to describe various components, but said components should not be limited by said terms. These terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" or "connected" to another component, it should be understood that while it may be directly connected or connected to that other component, there may also be other components in between. Conversely, when it is stated that one componen