CN-122023851-A - Dynamic sign language recognition method

CN122023851ACN 122023851 ACN122023851 ACN 122023851ACN-122023851-A

Abstract

The invention provides a dynamic sign language identification method, and aims to provide a self-adaptive video frame taking and space-time feature fusion algorithm for background separation, which helps to realize gesture background separation and sign language identification under a complex background. The method mainly comprises the steps of data preprocessing, self-adaptive frame taking, construction of a space-time convolutional neural network, embedding of an SC feature attention module and feature classification, wherein the method improves network feature extraction effects by using a space-time feature extraction method and a feature attention module, and effectively screens invalid and repeated data by using foreground and background separation and self-adaptive frame taking algorithms, reduces training expenditure and improves network training effects.

Inventors

GAO QIZHI

Assignees

无锡机电高等职业技术学校

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (6)

1. A dynamic sign language identification method is characterized by comprising the following steps: S1, recording a sign language video, acquiring sign language gesture depth and image information through a depth camera, aligning the depth and the image information in pixels, and separating foreground from background of each frame of image to remove interference features; S2, carrying out self-adaptive frame taking processing on the preprocessed video through a self-adaptive frame taking algorithm, and screening invalid and repeated data in a data set; S3, sending the preprocessed data set into a space-time convolutional neural network for space-time feature extraction, performing feature screening on low-dimensional features and outputting gesture high-dimensional features; And S4, dividing the high-dimensional feature map extracted by the space-time convolution feature extraction network into a plurality of feature blocks with fixed sizes, then one-dimensionally expanding the feature blocks into a CIT network for feature extraction and screening, and finally outputting a category space for mapping the features to the classification, so as to obtain the prediction probability of each category. And (3) saving the training model, performing comparative experiment verification on the public and self-built data sets to verify the model ubiquity, and saving the optimal training model.
2. The method of claim 1, wherein in the step S1, firstly pseudo-colorization is performed on the depth image, secondly multi-point sampling is performed on the depth image and the color image, color depth deviation is calculated based on sampling points, pixel deviation is performed on the color image according to the deviation to achieve pixel alignment, finally, AND operation is performed on the aligned depth image and the color image, and pixels at positions corresponding to the color image are reserved at positions with non-zero pixel values in the depth image, otherwise, discarding is performed, so that foreground and background separation is achieved.
3. The method for dynamic sign language recognition according to claim 1, wherein in the step S2, an odd sampling error square value between two adjacent frames of images is calculated as a similarity measure, and the calculation formula is: wherein X (i, j) represents the pixel value of the original image at coordinates (i, j), and Y (i+1, j+1) represents the pixel value of the next frame image at coordinates (x+1, y+1); Comparing the SRSE value with a preset judging threshold value, deciding to save the previous frame image or discard the previous frame image according to the comparison result, and updating the calculation object; Repeating the above process until the frame is fetched, and if the final frame number is not up to the preset minimum value, adjusting the judging threshold value and re-fetching the frame.
4. The method of claim 1, wherein in the third step, the space-time convolutional neural network is SFCNet, and the structure of the space-time convolutional neural network comprises: A plurality of three-dimensional convolution layers and pooling layers configured to extract and compress spatio-temporal features layer by layer from a video sequence; an SC feature attention module embedded between the convolutional layers is configured to filter and enhance the effective spatiotemporal features by modeling channel dependencies.
5. The dynamic sign language identification method of claim 4 wherein the SC feature attention module operates by: processing the input feature map by a convolution operator of 1×1×C to strengthen the time scale of the feature layer; Fusing the processed characteristics with the original characteristics to improve the influence of the time characteristics, wherein the output calculation mode is as follows: Wherein, the The convolution is represented by a representation of the convolution, , , ; Is a single representation of the corresponding channel acting on X The 2D spatial kernel of the channel.
6. The method for dynamic sign language recognition according to claim 1, wherein in the fourth step, the feature conversion network is a CIT network based on a transducer architecture, and the processing procedure comprises: carrying out feature normalization on the input one-dimensional feature sequence; calculating the feature relevance of different positions in the sequence by using a multi-head attention mechanism; the output of the multi-head attention mechanism is sent to a multi-layer perceptron to be introduced into nonlinear transformation; and finally, expanding the processed characteristics and sending the expanded characteristics into a classifier to obtain a classification result.

Description

Dynamic sign language recognition method Technical Field The invention relates to the technical field of sign language identification, in particular to a dynamic sign language identification method. Background Sign language is the main communication means of hearing impaired people, and has important significance for facilitating communication among non-hearing impaired people and developing sign language recognition technology. Sign language involves the collaboration of multiple cues including gesture actions, facial expressions, and body gestures. However, the transition of motion gestures makes it difficult for deep neural networks to automatically discover implicit relationships between multiple visual cues. Because of the characteristic of sign language continuity, the utilization of time and space features is emphasized more, the time division of the gestures is needed, the common method is to decompose the gestures into isolated word recognition problems, inspired by long and short time memory networks, and on the basis of considering structural information and a feature attention mechanism, a plurality of researchers expand the long and short time memory networks by using a layered attention network to complete the task of sign language recognition. However, if the segmentation is inaccurate, problems such as semantic expression errors may occur in the subsequent recognition. In order to solve the problem, part of researchers use sensors to acquire gesture motion information to assist in completing sign language recognition tasks, but the method requires a subject to wear motion sensor equipment, so that a certain trouble is caused to the sign language recognition work. The existing method has the defects of the preprocessing mode of the sign language video data, and a large amount of invalid and repeated data exist in the data set, so that the network training amount is huge, and the gesture time sequence feature extraction effect is influenced. In order to improve the utilization rate of time sequence characteristics, part of methods utilize a plurality of sensors, such as a visible light RGB camera, a depth camera, a millimeter wave radar or calculate additional channels, such as optical flow, so as to improve the performance of the method, so that model parameters are huge, the network training amount is high, and the application promotion to actual scenes is limited. Gestures have temporal correlation and spatial continuity, and this spatial and temporal dependence indicates that the utilization of gesture spatial and temporal features is particularly important, and that existing methods have deficiencies in this regard. The full utilization of inter-frame time related information and hand type position, shape and direction coding information in each frame is particularly important for improving the gesture feature utilization rate and improving the model recognition effect. In summary, the prior art has obvious defects in preprocessing efficiency of sign language video data and utilization rate of time-space characteristics. Therefore, a new method for sign language recognition capable of automatically and efficiently simplifying input data and deeply fusing space-time context information is urgently needed in the art so as to achieve a more accurate and practical recognition effect. Disclosure of Invention Aiming at the defects in the prior art, the application provides a dynamic sign language identification method, wherein the time-space feature extraction method is used, the feature attention module is used to improve the network feature extraction effect, and the foreground and background separation and self-adaptive frame taking algorithm is used to effectively screen invalid and repeated data, reduce the training expenditure and improve the network training effect. The technical scheme adopted by the invention is as follows: a dynamic sign language recognition method, comprising the steps of: s1, recording sign language video, acquiring sign language gesture depth and image information through a depth camera, aligning the depth and the image information in pixels, and separating foreground from background of each frame of image to remove interference features. S2, carrying out self-adaptive frame taking processing on the preprocessed video through a self-adaptive frame taking algorithm, and screening invalid and repeated data in a data set; And S3, sending the preprocessed data set into a space-time convolutional neural network for space-time feature extraction, performing feature screening on the low-dimensional features and outputting gesture high-dimensional features. And S4, dividing the high-dimensional feature map extracted by the space-time convolution feature extraction network into a plurality of feature blocks (the sizes are 16) with fixed sizes, then one-dimensionally expanding the feature blocks into CIT networks for feature extraction and screening, and finally outputting a