CN-121982769-A - Sign language vocabulary segmentation and recognition method based on continuous integral triggering

CN121982769ACN 121982769 ACN121982769 ACN 121982769ACN-121982769-A

Abstract

The invention discloses a sign language vocabulary segmentation and recognition method based on continuous integral triggering, which comprises the steps of (1) obtaining a sign language video frame sequence, extracting frame-level visual features to obtain a frame-level feature sequence, (2) inputting the frame-level feature sequence into a CIF module, generating a feature vector sequence corresponding to the vocabulary sequence through a continuous integral triggering method by the CIF module, outputting time segmentation points corresponding to each vocabulary, and (3) inputting the feature vector sequence into a vocabulary decoder to output vocabulary recognition results, (4) aligning the vocabulary recognition results with segmentation time points to form a vocabulary sequence output with a time stamp. The method and the device can effectively solve the problems of high labeling cost, two-stage error accumulation and lack of explicit segmentation capability of the traditional end-to-end method.

Inventors

ZHANG KEJUN
ZHENG ZEHUI

Assignees

浙江大学

Dates

Publication Date: 20260505
Application Date: 20260122

Claims (8)

1. A sign language vocabulary segmentation and recognition method based on continuous integral triggering is characterized by comprising the following steps: (1) Acquiring a sign language video frame sequence, extracting frame-level visual features, and obtaining a frame-level feature sequence; (2) Inputting the frame-level feature sequence into a CIF module, generating a feature vector sequence corresponding to the vocabulary sequence by the CIF module through a continuous integral triggering method, and outputting a time division point corresponding to each vocabulary; (3) Inputting the feature vector sequence into a vocabulary decoder, and outputting a vocabulary recognition result; (4) And aligning the vocabulary recognition result with the segmentation time point to form a vocabulary sequence output with a time stamp.
2. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 1, wherein in the step (1), when the frame-level visual features are extracted, 3D CNN or time sequence transducer network is adopted to extract space-time features, and frame-level features with dimension T x D are output.
3. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 1, wherein in step (2), the CIF module works as follows: first, a trigger weight is predicted for each frame feature by a linear projection layer and Sigmoid activation function ; Cumulative weight ; When the accumulated value reaches a preset threshold value, triggering to generate feature vectors of the vocabulary once, and resetting the accumulated value as a remainder; And recording the trigger time as a segmentation point of the vocabulary.
4. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 3 wherein the predetermined threshold is 1.0.
5. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 1 wherein in step (3), a vocabulary decoder adopts a transducer decoder or LSTM, receives a feature vector sequence output by the CIF module, and outputs a vocabulary tag sequence, i.e. a vocabulary recognition result.
6. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 1 further comprising a training step, wherein when the CIF module and the vocabulary decoder are trained, weak annotation data only containing vocabulary sequences is used, accurate time stamping is not needed, and end-to-end training is achieved by jointly optimizing cross entropy classification loss and CIF length constraint loss.
7. The method for sign language vocabulary segmentation and recognition based on continuous integral triggering of claim 6 wherein the loss function is a function of the CIF module and vocabulary decoder training The method comprises the following steps: ; Wherein, the For the cross-entropy class loss function, For the CIF length constraint loss, The length of the lexical sequence noted in the data, Is a length constraint weight.
8. The sign language vocabulary segmentation and recognition method based on continuous integral triggering of claim 1 wherein in step (4), a time-stamped vocabulary sequence output is formed in the following specific form: ; Wherein, the Representing the first in the input video A personal sign language vocabulary; represent the first Time starting points corresponding to the individual sign language words; Is the total number of sign language words to be output.

Description

Sign language vocabulary segmentation and recognition method based on continuous integral triggering Technical Field The invention belongs to the crossing field of computer vision and natural language processing, and particularly relates to a sign language vocabulary segmentation and recognition method based on continuous integral triggering. Background Sign language is a main communication mode of hearing impaired people, and the automatic recognition technology is an important research direction in the fields of man-machine interaction and information unobstructed. In practical applications, sign language recognition tasks can be generally divided into three categories, namely single sign language recognition, continuous sign language recognition and sign language translation. The single sign language recognition is to recognize single sign language vocabulary, the continuous sign language recognition is to recognize continuous multiple sign language vocabulary, and the sign language translation is to adjust the language sequence and moisten the language into natural language based on recognizing the sign language vocabulary. The continuous sign language recognition task of sign language recognition is generally divided into two subtasks, namely time division and vocabulary recognition. The time segmentation aims at dividing the continuous sign language video stream into vocabulary (Gloss) segments with independent semantics, and the vocabulary recognition is responsible for classifying each segment into a corresponding sign language vocabulary (Gloss). The traditional method adopts a two-stage process, namely, the time sequence action detection is firstly carried out to determine the boundary, and then each segment is classified. The two-stage method relies on strong labeling data (namely the start-stop time stamp of each sign language Gloss), the manual labeling cost is high, and the separation of the segmentation and the recognition steps is easy to cause error accumulation. For example, chinese patent document CN113642422a discloses a continuous chinese sign language recognition method, which realizes segmentation and word-by-word training of continuous video, and can recognize each word in the video. The Chinese patent document with the publication number of CN112347826A discloses a video continuous sign language identification method and system based on reinforcement learning, which comprises the steps of firstly obtaining the features of sign language video, then determining the semantic boundaries of video segments according to defined states by using a boundary detector, further extracting the features of a plurality of video segments with consistent semantics between every two boundaries in a pooling mode, and carrying out sign language word identification based on the further extracted features. In recent years, end-to-end sign language recognition methods have come up, such as CTC (Connectionist Temporal Classification) or RNN-T (Recurrent Neural Network Transducer) based methods, which allow training using weakly labeled (Gloss sequence only) data. CTCs allow for input and output sequence length inconsistencies, among other things, to address alignment issues by introducing blank labels (blank). However, CTCs cannot directly output explicit time boundaries, and their alignment process is implicit and difficult to interpret. The RNN-T models the probability distribution of input and output through a Joint Network (Joint Network), supports stream input, can realize online real-time identification, but has complex model structure and large training calculation amount, and the alignment process is also implicit alignment and lacks of interpretability. And, such end-to-end identification methods, although not requiring finely tagged data, are generally less accurate than two-stage conventional methods based on results in classical datasets (WLASL, PHOENIX 2014, etc.). Disclosure of Invention Aiming at the defects of the prior art, the invention provides a sign language vocabulary segmentation and recognition method based on continuous integral triggering, which can synchronously output recognized vocabulary sequences and start-stop time stamps thereof in videos by training only with weak labels of vocabulary sequence levels in an end-to-end mode, and effectively solves the problems of high labeling cost, two-stage error accumulation and lack of explicit segmentation capability of the traditional end-to-end method. A sign language vocabulary segmentation and recognition method based on continuous integral triggering comprises the following steps: (1) Acquiring a sign language video frame sequence, extracting frame-level visual features, and obtaining a frame-level feature sequence; (2) Inputting the frame-level feature sequence into a pre-trained CIF module, generating a feature vector sequence corresponding to the vocabulary sequence by the CIF module through a continuous integral triggering metho