CN-122024719-A - Sign language identification and display method and system in double-screen integrated machine

CN122024719ACN 122024719 ACN122024719 ACN 122024719ACN-122024719-A

Abstract

The invention relates to the technical field of computers, in particular to a sign language identification and display method and a sign language identification and display system in a double-screen integrated machine, wherein the method comprises the steps of collecting sign language gestures made by a deaf-mute through a camera to obtain sign language data to be identified; the method comprises the steps of preprocessing sign language data to be recognized, inputting the sign language data to be recognized into an AI intelligent sign language recognition model to output first semantic information corresponding to the sign language data to be recognized, converting the first semantic information into voice content, playing the voice content through a loudspeaker, collecting voice data of a user through a microphone to obtain the voice data to be recognized, preprocessing the voice data to be recognized, inputting the preprocessed voice data to be recognized into a voice automatic recognition model to output second semantic information corresponding to the voice data to be recognized, generating a sign language expression model based on the second semantic information, driving a digital person to generate a sign language animation based on the sign language expression model, and displaying the sign language animation on a display screen. The invention can solve the problems of low communication efficiency, dependence on manual translation and inconvenience caused by common people not mastering sign language in the communication process of the deaf-mute and the common people.

Inventors

Du Yuyong
HE CHEN
WANG MENGFEI
QUAN JIANPING
ZHOU WANGLONG
TANG ZIYU
JIANG HAI

Assignees

深圳市华弘数库科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. The utility model provides a sign language discernment and show method in two screen all-in-one, its characterized in that, two screen all-in-one includes camera, microphone, speaker, display screen, the display screen has two, has one the display screen is used for facing deaf-mute person show information, the method includes: collecting sign language gestures made by a deaf-mute through a camera to obtain sign language data to be recognized; After preprocessing the sign language data to be recognized, inputting the sign language data to be recognized into an AI intelligent sign language recognition model to output first semantic information corresponding to the sign language data to be recognized; converting the first semantic information into voice content, and playing the voice content through a loudspeaker; Collecting voice data of a user through a microphone to obtain voice data to be recognized; After preprocessing the voice data to be recognized, inputting the preprocessed voice data to be recognized into a voice automatic recognition model to output second semantic information corresponding to the voice data to be recognized; and generating a sign language expression model based on the second semantic information, driving a digital person to generate a sign language animation based on the sign language expression model, and displaying the sign language animation on a display screen.
2. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 1, wherein the AI intelligent sign language recognition model is obtained through the following training process: Collecting a plurality of sign language sample videos covering a preset sign language vocabulary set, and manually labeling the sign language sample videos to form semantic labels, wherein the semantic labels comprise hand deformation information, hand orientation information, position relation information of sign language relative to a body and hand motion track information related to sign language expression; And based on the sign language sample video and the semantic annotation, performing model training by adopting a deep learning training strategy to obtain the AI intelligent sign language recognition model.
3. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 2, wherein the sign language sample video satisfies at least one of the following conditions during collection: the selected sample in the sign language sample video can represent the characteristic hand shape and action characteristics of the target sign language word; The selected samples in the sign language sample video are different from each other; The collection view angle of the sign language sample video is consistent with the installation view angle of the camera so as to reduce the influence of external view angle difference on recognition; the sample set of the sign language sample video can cover the change of the target hand shape under different motion states; Aiming at the characteristic hand shape of the single-hand language word, the sign language sample video simultaneously comprises different execution conditions of a left hand and a right hand; The sign language sample video comprises a static sign language sample with an arm naturally hanging down and a dynamic sign language sample comprising continuous actions, so that the AI intelligent sign language recognition model is simultaneously applicable to static sign language and continuous dynamic sign language recognition.
4. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 1, wherein the AI intelligent sign language recognition model is a continuous sign language recognition model, and the continuous sign language recognition model comprises: The data preprocessing module is used for performing image preprocessing, detection and segmentation and target tracking processing on the sign language data to be identified so as to obtain a sign language fragment sequence with consistent time sequence; The feature extraction and fusion module is used for extracting space-time features of video clips from the sign language clip sequence, extracting limb movement track features, and carrying out multi-mode feature fusion on the space-time features and the limb movement track features to obtain fusion feature vectors; The spatial attention module is used for carrying out spatial attention weighting on the fusion feature vector so as to highlight key area features related to sign language expression; The time sequence modeling module is used for performing time sequence modeling on the weighted fusion feature vector input into a bidirectional time sequence network according to time sequence so as to obtain long-term space-time feature representation; And the decoding module is used for carrying out end-to-end decoding on the long-term space-time characteristic representation input connection time sequence classification model, outputting semantic words corresponding to the sign language data to be identified, and outputting the semantic words as the first semantic information.
5. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 4, wherein the feature extraction and fusion module comprises: The sampling unit is used for uniformly sampling the sign language fragment sequence covering the sign language data to be identified to form a video fragment sequence arranged in time sequence; The space-time feature extraction unit is used for carrying out convolution feature extraction on each video segment in the video segment sequence by adopting a space-time convolution neural network so as to obtain space-time features respectively representing the corresponding different segments; The track feature extraction unit is used for extracting the motion track of the hand key points or skeleton points in the video segment sequence based on the human key points corresponding to the sign language data to be identified, and executing feature encoding processing on the motion track to obtain the limb motion track features of the hand; And the fusion unit is used for carrying out feature fusion on the space-time features corresponding to the same time sequence position and the limb movement track features to form the fusion feature vector used for representing the same time sequence position, and taking the fusion feature vector arranged in time sequence as the input of the spatial attention module.
6. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 4, wherein the spatial attention module is configured to dynamically focus on a motion area related to sign language expression, and the spatial attention weighting of the spatial attention module at least comprises the following processes: Based on the fusion feature vector, performing space-time convolution operation on the features of adjacent time sequence frames to obtain dynamic feature response reflecting local motion change; Normalizing the dynamic characteristic response and mapping the dynamic characteristic response into a motion intensity distribution characteristic diagram through an activation function so that the motion intensity distribution characteristic diagram represents the change intensity of a local motion area in the sign language generation process; Multiplying the motion intensity distribution feature map and the fusion feature vector by elements to enhance dynamic motion information in the fusion feature vector and inhibit feature response of a region irrelevant to sign language; calculating a weight vector corresponding to the enhanced feature dimension based on the enhanced fusion feature vector, and normalizing the weight vector to obtain an attention weight; and carrying out weighted fusion on the attention weight and the fusion feature vector to obtain the fusion feature vector with weighted spatial attention, and inputting the fusion feature vector with weighted spatial attention into the time sequence modeling module.
7. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 1, wherein the AI intelligent sign language recognition model adopts a staged training strategy during training, and comprises: In the pre-training stage, initializing and training model parameters based on a connection time sequence classification training criterion, so that the alignment probability of a labeling sequence output by a model to a reference labeling sequence is improved; In the co-training stage, inputting training samples in a first input form into a first identified branch to output a first branch prediction result, and determining a target segment based on the first branch prediction result; the target segment is input into a second identification branch in a second input form to output a second branch prediction result, and the first branch prediction result and the second branch prediction result are fused according to a preset fusion rule to obtain the final prediction output of the collaborative training stage; and constructing, in the co-training stage, a comprehensive penalty for performing inverse optimization, the comprehensive penalty including a connection timing classification penalty for achieving alignment of the target sentence with the predicted sentence, an efficiency penalty for balancing between accuracy and computational efficiency, and an alignment penalty for constraining different resolutions or different branch output consistencies, wherein the efficiency penalty is obtained by multiplying a sparse decision vector representing branch selection by a cost vector representing resource consumption, the alignment penalty being obtained by comparing the degree of difference of different branch output probability distributions after softening; The AI intelligent sign language recognition model is characterized by further comprising: Adopting a self-distillation training strategy, selecting a characteristic representation larger than a threshold level in a network as a teacher characteristic, taking a characteristic representation smaller than the threshold level as a student characteristic, calculating a mean square error between the teacher characteristic and the student characteristic to form distillation loss, and weighting and combining the distillation loss and the comprehensive loss to form final training loss; And adopting a model screening strategy based on the word error rate, calculating the insertion, deletion and replacement operation quantity required by converting the recognition sequence into the standard reference sequence, normalizing the operation quantity to obtain the word error rate, and selecting a model with the word error rate meeting a preset screening condition as the AI intelligent sign language recognition model.
8. The method for recognizing and displaying sign language in a dual-screen integrated machine according to claim 1, wherein when generating the sign language expression model based on the second semantic information, the method comprises: Mapping the second semantic information into a sign language phoneme sequence according to a preset mapping relation, wherein the sign language phoneme sequence is encoded by adopting a sign language labeling system and is converted into a markup language form which can be processed by a computer; Extracting sign language phoneme elements related to hand actions from the sign language phoneme sequence, wherein the sign language phoneme elements at least comprise hand shape elements, hand position elements, hand azimuth elements and hand motion elements; performing sub-word level word segmentation on the sign language phoneme elements to obtain a sign language phoneme label sequence, and introducing filling labels and unknown labels into the sign language phoneme label sequence to adapt to sign language phoneme input with different lengths; mapping the sign language phoneme label sequence into an embedded sequence, wherein the embedded sequence is obtained by combining label embedding and position embedding, and the embedded sequence is used as an input representation for generating the sign language expression model.
9. The method for recognizing and presenting sign language in a dual-screen integrated machine according to claim 8, wherein the sign language expression model is generated through a sequence generation network based on an attention mechanism, the sequence generation network includes an encoder and a decoder, and the generation of the sign language expression model includes: the encoder performs a self-attention computation and feed-forward network transformation on the embedded sequence to obtain a coding result characterizing the sign language phoneme sequence context; The decoder is executed in an autoregressive generation cycle, gradually predicts the key point vector at the next moment based on the coding result and the generated key point vector sequence when the decoder is executed, and synchronously outputs a stop mark for indicating the generation termination, wherein the key point vector is used for representing a plurality of key point coordinates of a body and two hands; The autoregressive generating loop takes the key point vector corresponding to the preset static gesture as initial input, and feeds back the key point vector obtained by the previous round of prediction as input of the next round of iteration in each iteration until the stop mark meets the preset stop condition; when training the sequence generation network, adopting a teacher forced training mode, taking the key point vector of the previous moment of the real key point sequence as decoder input to predict the key point vector of the next moment, adopting regression loss based on absolute error for the predicted key point output, and adopting binary classification loss for the stop mark; Performing sequence alignment and smoothing processing on a real key point sequence for training, including at least one of repeated alignment, zero alignment and mixed alignment aligned to a preset maximum sequence length, and inserting interpolation key point frames between adjacent key point frames to reduce adjacent frame differences, thereby obtaining a key point sequence representation for generating the sign language expression model; and using the key point sequence representation as the sign language expression model, or converting the key point sequence representation into digital human skeleton driving parameters to drive the digital human to generate the sign language animation and display the sign language animation on a display screen.
10. The sign language recognition and display system in the double-screen integrated machine is characterized by comprising a camera, a microphone, a loudspeaker, display screens and a processor, wherein two display screens are arranged, one display screen is used for displaying information for the deaf-mute, and the system comprises: the first acquisition module is used for acquiring sign language gestures made by the deaf-mute through the camera to obtain sign language data to be identified; The first semantic recognition module is used for preprocessing the sign language data to be recognized, inputting the sign language data to be recognized into an AI intelligent sign language recognition model, and outputting first semantic information corresponding to the sign language data to be recognized; the conversion module is used for converting the first semantic information into voice content and playing the voice content through a loudspeaker; the second acquisition module is used for acquiring voice data of a user through a microphone to obtain voice data to be recognized; The second semantic recognition module is used for inputting the preprocessed voice data to be recognized into a voice automatic recognition model after preprocessing the voice data to be recognized so as to output second semantic information corresponding to the voice data to be recognized; And the sign language generation module is used for generating a sign language expression model based on the second semantic information, driving a digital person to generate a sign language animation based on the sign language expression model, and displaying the sign language animation on a display screen.

Description

Sign language identification and display method and system in double-screen integrated machine Technical Field The invention relates to the technical field of computers, in particular to a sign language identification and display method and system in a double-screen integrated machine. Background The language receiving and feedback are basic ways of interpersonal communication, but the deaf-mute can not complete information acquisition and expression through sound due to limited hearing, and can communicate by means of the sign language, but the expression of the sign language depends on factors such as hand shape change, orientation, relative body position, movement track and the like, and part of the sign language also needs to combine body posture and facial expression, so that the ordinary person can learn and master the threshold more, and communication between the deaf-mute and the ordinary person is difficult. Along with the development of technologies such as artificial intelligence, computer vision and the like, how to realize understanding and expression of the information of the phrases and improve the portability of communication by means of intelligent equipment becomes one of the technical demands to be solved urgently, and especially in an actual communication scene, the instantaneity of communication, the understandability of expression and the continuity of the interaction process are required to be considered, so that the deaf-mute and the ordinary person can more naturally complete information transmission and feedback. Disclosure of Invention In view of the technical problems, the invention provides a sign language identification and display method and system in a double-screen integrated machine, which aim to solve the problems of low communication efficiency and inconvenience in relying on manual translation due to common lack of knowledge of ordinary people in the communication process of deaf-mutes and ordinary people. Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure. According to one aspect of the invention, a sign language recognition and display method in a double-screen integrated machine is provided, and the method comprises the following steps: collecting sign language gestures made by a deaf-mute through a camera to obtain sign language data to be recognized; After preprocessing the sign language data to be recognized, inputting the sign language data to be recognized into an AI intelligent sign language recognition model to output first semantic information corresponding to the sign language data to be recognized; converting the first semantic information into voice content, and playing the voice content through a loudspeaker; Collecting voice data of a user through a microphone to obtain voice data to be recognized; After preprocessing the voice data to be recognized, inputting the preprocessed voice data to be recognized into a voice automatic recognition model to output second semantic information corresponding to the voice data to be recognized; and generating a sign language expression model based on the second semantic information, driving a digital person to generate a sign language animation based on the sign language expression model, and displaying the sign language animation on a display screen. Further, the AI intelligent sign language recognition model is obtained through the following training process: Collecting a plurality of sign language sample videos covering a preset sign language vocabulary set, and manually labeling the sign language sample videos to form semantic labels, wherein the semantic labels comprise hand deformation information, hand orientation information, position relation information of sign language relative to a body and hand motion track information related to sign language expression; And based on the sign language sample video and the semantic annotation, performing model training by adopting a deep learning training strategy to obtain the AI intelligent sign language recognition model. Further, the sign language sample video at the time of collection at least meets one of the following conditions: the selected sample in the sign language sample video can represent the characteristic hand shape and action characteristics of the target sign language word; The selected samples in the sign language sample video are different from each other; The collection view angle of the sign language sample video is consistent with the installation view angle of the camera so as to reduce the influence of external view angle difference on recognition; the sample set of the sign language sample video can cover the change of the target hand shape under different motion states; Aiming at the characteristic hand shape of the single-hand language word, the sign language sample video simultaneously comprises different execution conditi