Search

CN-122024312-A - Cross-modal sign language recognition and real-time translation method

CN122024312ACN 122024312 ACN122024312 ACN 122024312ACN-122024312-A

Abstract

The invention discloses a cross-mode sign language identification and real-time translation method, which is characterized in that multi-level sign language features of hand key points, hands, motion tracks, facial expressions and body gestures are fused, and by combining context semantic understanding and grammar structure analysis, high-precision continuous sign language identification and bidirectional translation are realized through a sign language grammar structure analysis, a context semantic deducing mechanism, continuous sign language segmentation and identification, a sign language dialect knowledge base and a virtual sign language generator, so that communication efficiency of hearing impaired people and hearing impaired people is improved. The invention can recognize isolated sign language vocabulary, understand grammar structure and context of sign language, realize bidirectional translation of sign language- & gt text/voice, text/voice- & gt virtual sign language animation, and support recognition and conversion of sign language dialects in different areas.

Inventors

  • LI YIPENG
  • SHAO XINQING
  • WU HAO
  • ZHANG PING
  • ZHOU HONGWEI

Assignees

  • 江苏润和软件股份有限公司

Dates

Publication Date
20260512
Application Date
20260122

Claims (7)

  1. 1. A cross-mode sign language identification and real-time translation method is characterized in that by combining multi-level sign language features of hand key points, hands, motion tracks, facial expressions and body gestures and combining context semantic understanding and grammar structure analysis, high-precision continuous sign language identification and bidirectional translation are realized through sign language grammar structure analysis, context semantic deducing mechanisms, continuous sign language segmentation and identification, sign language dialect knowledge base and virtual sign language generators, so that communication efficiency of hearing impaired people and hearing impaired people is improved, and the method specifically comprises the following steps: S1, multi-level sign language feature extraction, namely respectively designing special feature extractors aiming at three information sources of hand actions, facial expressions and body gestures to construct multi-level sign language feature representation; s2, multi-level feature fusion, namely after feature extraction of three levels of hand, face and body gestures is completed, designing a space-time attention mechanism to fuse the multi-level features so as to obtain multi-level fusion feature features, wherein complementarity of the features of different levels is fully utilized; S3, modeling and grammar analysis of the context Wen Yuyi, after multi-level fusion characteristics are obtained, designing a grammar structure analysis and context semantic inference mechanism of the sign language for understanding the grammar structure and the context semantic of the sign language so as to obtain a semantic feature sequence containing grammar and context information; s4, after the semantic feature sequence containing grammar and context information is obtained, the end-to-end recognition is realized by adopting a connection time sequence classification CTC loss or attention mechanism for realizing the automatic segmentation and recognition of the continuous sign language; s5, establishing a sign language knowledge base for supporting the recognition and conversion of sign language dialects in different areas after the sign language recognition is completed; s6, generating a virtual sign language, and designing a virtual sign language generation model for realizing reverse translation of texts/voices into the sign language.
  2. 2. The method for cross-modal sign language recognition and real-time translation according to claim 1, wherein step S1 specifically comprises: S11, extracting key points of hands and hand features For hand motion, a depth camera or a binocular camera is adopted to obtain 3D coordinate information of 21 key points of the hand, and the input hand key point sequence is set as Wherein Represent the first The 3D coordinates of the 21 keypoints of the frame, Is the number of frames; Firstly, calculating the relative position characteristics of key points of the hand, and normalizing by taking the wrist as a reference point: wherein the method comprises the steps of Is the first Coordinates of key points of wrist of frame, modeling hand skeleton structure by graph rolling network to capture hand space structure, and constructing 21 key points into graph structure Wherein For a set of nodes, Is an edge set; The method comprises the steps of adopting a space-time diagram convolution network ST-GCN to extract hand structural features, capturing the spatial relationship among key points of the hands through a diagram convolution operation by the ST-GCN, capturing the dynamic change of a time dimension through time sequence convolution, firstly carrying out space diagram convolution on each key point and neighbor nodes thereof to extract local spatial features, then carrying out one-dimensional convolution on the time sequence dimension to capture a motion mode, and finally gradually extracting semantic features of higher layers through multi-layer stacked space-time diagram convolution layers: wherein the method comprises the steps of For the dimension of the convolution feature of the drawing, Is the first Hand structural features of the frame; The method comprises the steps of firstly flattening 3D coordinates of 21 key points into 63-dimensional vectors, then carrying out feature transformation through a plurality of fully-connected layers, and finally outputting probability distribution of various hand types through a Softmax layer: wherein the method comprises the steps of For the number of hand-type categories, Classifying probability distribution for hands; S12, hand motion trail feature extraction To capture the dynamic characteristics of hand motion, calculating the motion track of the key points of the hand, and defining a motion velocity vector: wherein the method comprises the steps of For time interval, to more fully capture motion features while calculating motion direction angles, for describing the change in direction of hand motion: Combining a motion velocity vector and a direction angle into a motion feature vector Extracting time sequence characteristics of the motion trail by adopting a bidirectional long-short time memory network: wherein the method comprises the steps of For the dimension of the motion characteristics, biLSTM the network respectively processes the sequences through the forward LSTM and the backward LSTM, and finally, the hidden states in the two directions are spliced to obtain the final time sequence characteristic representation; s13, facial expression feature extraction For facial expression, facial motion coding unit AU recognition system is used to extract expression features, and input facial image sequence is set as Wherein Represent the first A frame section image; The method comprises the steps of positioning facial key points by using a facial key point detector, extracting AU features, wherein the AU extractor is realized by adopting a convolutional neural network, namely, firstly extracting feature representation of facial images through a pretrained CNN, then regressing intensity values of all AU units through a full connection layer, wherein each AU unit corresponds to a specific facial muscle action, and outputting the intensity values of 0-5 through a regression network: wherein the method comprises the steps of For the AU feature dimension, The intensity value of the action unit comprises the upward eyebrow, the downward mouth corner and the opening of eyes; to capture temporal changes in facial expressions, a temporal attention mechanism is employed: wherein the method comprises the steps of As a matrix of weights that can be learned, In order to pay attention to the weight vector, Is a weighted facial expression feature; s14, body posture feature extraction Extracting upper body inclination angle and head action for body posture, and setting body key point sequence as Wherein Represent the first Of frames Personal body keypoints; calculating the inclination angle of the upper body: The head motion encoder is realized by adopting an LSTM network, wherein the coordinate sequence of head key points is input into the LSTM, the time sequence mode of head motion is captured through a memory mechanism, and the hidden state of the LSTM encodes the motion characteristics of the head in the time dimension, including the motion modes of nodding, shaking head, rotating left and right and the like: wherein the method comprises the steps of Is the head motion feature dimension.
  3. 3. The method for cross-modal sign language recognition and real-time translation according to claim 2, wherein in step S2, the extracted sign language is received 、 、 、 、 And Features, realizing depth fusion through projection and attention mechanisms; firstly, projecting features of different layers into a unified feature space: wherein the method comprises the steps of , , In order to project the matrix of the light, In order to unify the dimensions of the feature space, Representing vector stitching; The multi-head mechanism is used for dividing a feature space into a plurality of subspaces, each subspace is used for independently calculating the attention, and finally, the output of all heads is spliced, in particular, hand features are used as queries, face and body features are used as key values, the attention degree of the hand features on the face and body features is calculated, and therefore semantic association among the features of different layers is captured: Wherein each attention head is calculated as: , in order to pay attention to the number of heads, For the dimension of each head, Is the first A projection matrix of the individual heads is provided, For outputting the projection matrix, the simplified representation is: wherein the method comprises the steps of , , , Is a matrix of learnable parameters; the final multi-level fusion feature is obtained by weighted summation: wherein the method comprises the steps of Is a learnable fusion weight coefficient, meets the following requirements , Is the final multi-level fusion characteristic feature.
  4. 4. The method for cross-modal sign language recognition and real-time translation as claimed in claim 3, wherein in step S3, the feature sequence is fused with the output As input, by grammar parsing and context inference, semantic representations containing grammar information and context information are output Providing more accurate semantic features for subsequent sign language recognition, and specifically comprising: s31, grammar structure analysis of sign language Constructing a sign language grammar knowledge base and defining grammar rule set ; For sign language feature sequences of input The method comprises the steps of carrying out grammar analysis by adopting a bidirectional LSTM network, capturing context information by a grammar analyzer through a bidirectional LSTM coding sign language feature sequence, focusing on related rules in a grammar rule base through an attention mechanism, and outputting grammar structure features including grammar roles, temporal information and language information of a current gesture through a full connection layer: wherein the method comprises the steps of For the dimension of the grammar characteristics, Grammar structure information (such as master guest role, tense, language, etc.) of the current moment is contained; s32, up-down Wen Yuyi inference To process ambiguous gestures, the context attention mechanism is used to infer the exact meaning, the context window size is defined as Calculating the semantic similarity of the current gesture and the context gesture: context weighting feature: wherein the method comprises the steps of As the distance attenuation coefficient, , Is an attenuation parameter; Finally, semantic representation of the grammar information and the context information are combined: wherein the method comprises the steps of For the purpose of a semantic fusion matrix, Is a semantic feature dimension.
  5. 5. The method for cross-modal sign language recognition and real-time translation as claimed in claim 4, wherein in step S4, the semantic feature sequences are encoded into hidden representations, then mapped into vocabulary space, and final text recognition results are output through CTC decoding or sequence-to-sequence translation, specifically comprising: S41, continuous identification based on CTC The semantic feature sequence is input into a bidirectional LSTM coder, the bidirectional LSTM coder consists of a plurality of layers of bidirectional LSTM, the forward LSTM and the backward LSTM respectively process the context information of the sequence, and finally the forward hidden state and the backward hidden state are spliced to obtain the coded representation containing the complete context information: wherein the method comprises the steps of And The hidden states of the forward and backward LSTM respectively, Concealing layer dimensions for a bi-directional LSTM encoder; Mapping the coding features into vocabulary space through the full connection layer, wherein the coding features are mapped into vocabulary probability distribution: wherein the method comprises the steps of For the sign language vocabulary size, Is the first The lexical probability distribution of the frames, And A weight matrix and a bias vector are output layers; The method comprises the steps of training by adopting a CTC loss function to realize end-to-end identification without manual segmentation, processing the problem of inconsistent length of input and output sequences by introducing blank labels by a CTC algorithm, allowing a model to output sequences with any length, and calculating the probability of all possible aligned paths by a dynamic programming algorithm, wherein the CTC loss function is defined as follows: wherein the method comprises the steps of In order to be a sequence of authentic tags, For the model to output a sequence of values, In order to align the paths of the optical device, For all aligned path sets that can be mapped to a real tag sequence, Is a path In the first place A tag of the moment; S42, sequence translation based on attention mechanism As an alternative to CTC, a transducer architecture is used to implement sequence-to-sequence translation, a transducer decoder is stacked from multiple decoder layers, each layer containing a self-attention mechanism, an encoder-decoder attention mechanism, and a feed-forward neural network, in a specific implementation, the generated prefix sequence is first embedded and position encoded and then processed through the multiple decoder layers, each layer containing residual connections and layer normalization: wherein the method comprises the steps of In order to output the sequence of events, Is a generated prefix sequence; the vocabulary prediction probability is calculated by Softmax function: wherein the method comprises the steps of To output a projection matrix.
  6. 6. The method for cross-modal sign language recognition and real-time translation as claimed in claim 5, wherein in step S5, the received output multi-level fusion features are The method comprises the steps of identifying the types of dialects through a dialects encoder and realizing feature conversion among different dialects by utilizing a trans-dialects mapping matrix so as to support trans-dialects sign language identification and translation, wherein the method comprises the following steps: S51, dialect feature encoding For sign language of different dialects, extracting features by using a dialect encoder, realizing the dialect encoder by using a fully-connected network, designing an independent encoder network for each dialect, receiving multi-level fusion features by the dialect encoder, mapping the general features into feature representations specific to the dialect through a plurality of fully-connected layers and nonlinear activation functions, and capturing the expression modes specific to the dialect: wherein the method comprises the steps of Represent the first The number of the species is the number of the species, In order to be able to speak of a particular feature, And For the sake of dialect Encoder parameters of (a); s52, trans-dialect mapping A trans-dialect mapping matrix is designed to map features of one dialect to another dialect space: wherein the method comprises the steps of To say from the square To dialect Is used for mapping the matrix.
  7. 7. The method for cross-modal sign language recognition and real-time translation according to claim 6, wherein in step S6, a complete bi-directional translation system is formed with the sign language recognition, the recognition module realizes sign language-text translation, the generation module realizes text-sign language translation, the generation module receives text input, acquires semantic representation through a text encoder, then generates gesture feature sequences, and finally converts the gesture feature sequences into 3D virtual sign language animation comprising hand actions, facial expressions and body gestures, and the method specifically comprises the following steps: s61, generating text and gesture sequence For input text The text encoder adopts a pre-trained language model to encode the text sequence into a word vector representation related to the context: The gesture decoder adopts a converter decoder architecture to process the text semantic sequence, the gesture decoder generates a gesture feature sequence in an autoregressive mode, and at each time step, the gesture decoder focuses on relevant parts of the text through an attention mechanism according to the generated gesture sequence and the text representation of the text encoder to generate the next gesture feature: wherein the method comprises the steps of Coding sequence for text, wherein Is the first The characteristic representation of the individual gestures is provided, Is a gesture feature dimension; s62, 3D sign language animation generation The method comprises the steps of converting gesture features into a 3D hand key point sequence, enabling a key point decoder to be realized by adopting a multi-layer full-connection network, mapping the gesture features into 3D coordinates of 21 key points, enabling the key point decoder to gradually expand feature dimensions through a plurality of full-connection layers, and finally outputting a 21 multiplied by 3 coordinate matrix: wherein the method comprises the steps of And Is a keypoint decoder parameter; The corresponding facial expression and body posture are generated simultaneously, the facial expression decoder adopts a fully-connected network to map gesture features into AU feature vectors, and the body posture decoder also adopts the fully-connected network to map the gesture features into body inclination angles: wherein the method comprises the steps of For the facial expression decoder parameters, Decoder parameters for body posture; The motion smoothing algorithm is adopted to ensure the smoothness of the animation: wherein the method comprises the steps of Is a smoothing parameter.

Description

Cross-modal sign language recognition and real-time translation method Technical Field The invention relates to the field of artificial intelligence, computer vision and natural language processing, in particular to a cross-mode sign language identification and real-time translation method which is suitable for various application scenes such as public service barrier-free, online education, video conference, emergency help, film and television captions, sign language news broadcasting and the like. Background Along with the rapid development of artificial intelligence technology, sign language recognition and translation are used as important bridges for communication between hearing impaired people and hearing impaired people, and have wide application prospects in various fields. Traditional sign language recognition methods mainly rely on single-modality information, such as gesture recognition based on hand keypoints only, static gesture classification based on hands only, or dynamic gesture recognition based on motion trajectories only. However, sign language expression has multi-dimensional and multi-level complex features, and single-mode information often cannot capture the complete semantics of sign language comprehensively and accurately. Sign language is used as a complete visual language system, and the expression of the sign language not only comprises hand actions, but also relates to various information such as facial expressions, body postures, spatial position relations and the like. Facial expressions bear important language and emotion information in sign language, such as query, emphasis, negation and other grammar functions, and body gestures and spatial position relations are used for expressing grammar structures and semantic relations. The conventional sign language recognition system has the following problems: 1. The existing method mainly focuses on hand key points or hand type information, ignores important auxiliary information such as facial expressions, body gestures and the like, and is low in recognition accuracy of grammar structures such as question sentences, negative sentences and the like, so that complete semantics of sign language cannot be understood. 2. Continuous sign language segmentation is difficult, and sign language expression is continuous, unlike spoken language which has clear word boundaries. Most of the existing systems need to manually cut or rely on simple time sequence windows, vocabulary boundaries in continuous sign language streams cannot be automatically identified, and practicality and instantaneity of the systems are limited. 3. Contextual semantic understanding lacks-there are a number of ambiguous gestures in sign language whose exact meaning needs to be judged in combination with context and grammar structure. The existing model lacks a special grammar analysis module and a context modeling mechanism, and cannot effectively process ambiguous gestures, so that recognition errors are caused. 4. The cross-dialect recognition capability is insufficient, and sign languages of different countries and regions are obviously different, such as Chinese sign language, american sign language, british sign language and the like. The existing system is usually trained on a single sign language, and can not identify and convert sign language in different areas, so that the application range of the system is limited. 5. The bidirectional translation capability is limited, and most sign language recognition systems can only realize the unidirectional translation from sign language to text, but cannot realize the reverse translation from text/voice to sign language. Even if the reverse translation function exists, the generated virtual sign language animation is often not natural and smooth enough, and the coordination of facial expression and body gesture is lacking. 6. Real-time requirements are difficult to meet, and in real-time interaction scenes such as video conferences, emergency help seeking and the like, the system needs to quickly and accurately recognize sign language and respond. The existing deep learning model has high calculation complexity and large reasoning delay, and is difficult to meet the real-time requirement. 7. The data marking cost is high, a large amount of marking data is needed for sign language identification, and professional sign language specialists are needed for sign language marking, so that the marking cost is high and the period is long. The existing method generally needs a large amount of marked data to achieve better performance, and the training cost is high. Therefore, a sign language recognition and translation system capable of fully utilizing multi-modal information, deeply fusing multi-level sign language features, accurately understanding sign language grammar structures and upper and lower Wen Yuyi and having good real-time performance and trans-dialect adaptability is needed. The invention solves the technical