CN-121999318-A - Sign language recognition training data generation method and system based on multi-modal data enhancement

CN121999318ACN 121999318 ACN121999318 ACN 121999318ACN-121999318-A

Abstract

The invention relates to the technical field of graphic data reading, and particularly discloses a sign language recognition training data generation method and system based on multi-modal data enhancement, wherein the method comprises the steps that a sign language recognition component reads multi-modal data of a sample to be recognized based on sampling equipment, and multi-modal synchronous resampling is realized firstly; the method comprises the steps of resampling, carrying out image quality evaluation frame by frame, outputting an image quality evaluation result, screening a boundary candidate frame set frame by frame based on the image quality evaluation result to form an action slice group of a sample to be identified, finally, matching data enhancement initial strategies for each action slice, optimizing the initial strategies based on dynamic and static layering labeling results to obtain data enhancement execution strategies, executing multi-modal consistent data enhancement writing on the action slices according to the strategies, enabling enhancement types and intensities to be adaptively adjusted along with the quality forms and dynamic and static structures of the slices, and finally forming a training data set for training of a sign language identification component.

Inventors

ZOU YAN
Liang Jiaodong
BIAN HONGLI

Assignees

山东开放大学
山东特殊教育职业学院

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (10)

1. The sign language recognition training data generation method based on multi-modal data enhancement is characterized by comprising the following steps of: The sign language recognition component calculates inter-modal offset based on a multi-modal dataset of a sample to be recognized and executes multi-modal synchronous resampling, and after resampling, the sample to be recognized is subjected to image quality evaluation frame by frame to obtain an image quality evaluation result; Screening a boundary candidate frame set frame by frame based on a graph quality evaluation result, carrying out static key segment detection on the boundary candidate frame set, and taking the static key segment as a correction constraint condition of the boundary candidate frame set to obtain an action slice group of a sample to be identified; And matching the data enhancement initial strategy with each action slice, carrying out hierarchical labeling on each action slice, optimizing the data enhancement initial strategy based on the hierarchical labeling result to form a data enhancement execution strategy, carrying out multi-mode data enhancement on the action slices of the sample to be identified, and generating a training data set of the sign language identification component.
2. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 1, comprising the steps of: the multi-mode data set of the sample to be identified comprises original graphic data of the sample to be identified and auxiliary mode data of the sample to be identified; The original graphic data of the sample to be identified is specifically based on that sampling equipment reads sign language video, and decodes the sign language video into a graphic frame sequence which is sequentially ordered according to time stamps; The auxiliary mode data of the sample to be identified specifically comprises the steps of reading the acceleration of each key point of each frame of the sample to be identified and the angular speed of each key point to form a time sequence of the auxiliary mode data of the sample to be identified; the key points are arranged on a main recognition object to which the sample to be recognized belongs, wherein the main recognition object comprises a first main recognition object and a second main recognition object.
3. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 1, comprising the steps of: the specific calculation process of calculating the inter-modal offset is as follows: Extracting an action intensity curve and an acceleration energy curve of a first main recognition object and an action intensity curve and an acceleration energy curve of a second main recognition object, and respectively sampling to set the same time step length to obtain a discrete sequence pair of the first main recognition object and a discrete sequence pair of the second main recognition object; And calculating a sequence cross-correlation score based on the discrete sequence pairs, respectively obtaining the time alignment offset of the first main recognition object and the time alignment offset of the second main recognition object based on the sequence cross-correlation score, comparing the two time alignment offsets to obtain the maximum value, and marking the maximum time alignment offset as the inter-modal offset of the sample to be recognized.
4. The sign language recognition training data generation method based on multi-modal data enhancement according to claim 3, comprising the steps of: the method for performing multi-modal synchronous resampling specifically comprises the following steps: Carrying out overall translation correction on the time sequence of the auxiliary mode data of the sample to be identified according to the inter-mode offset, carrying out interpolation resampling processing on the time sequence of the auxiliary mode data after the translation correction, and mapping the time sequence to a set uniform time axis to obtain a resampling time sequence of the auxiliary mode data of the sample to be identified; And determining a graph frame index corresponding to each time point through nearest frame matching by taking the unified time axis as a reference, so as to align the graph frame sequence to the unified time axis and obtain a resampled graph frame sequence of the sample to be identified.
5. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 1, comprising the steps of: The image quality evaluation of the sample to be identified frame by frame is specifically as follows: Extracting quality basic characteristic parameters of each graphic frame of the sample to be identified, and jointly carrying out graphic quality evaluation on each graphic frame of the sample to be identified to obtain a graphic quality evaluation result; the graph quality evaluation result is used for reflecting graph quality indexes of each graph frame of the sample to be identified.
6. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 1, comprising the steps of: The method filters the boundary candidate frame set frame by frame based on the graph quality evaluation result, and specifically comprises the following steps: screening a plurality of graphic frames with graphic quality indexes greater than or equal to a predefined graphic quality index threshold value from a sample to be identified, and listing the graphic frames into a boundary candidate frame set; And comparing the reference characteristics of each graphic frame of the sample to be identified through the reference cue constraint, and when the reference characteristics of each graphic frame of the sample to be identified meet any trigger condition of the reference cue constraint, listing the graphic frames into a boundary candidate frame set.
7. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 6, comprising the steps of: the static key segment detection comprises the following specific processes: The method comprises the steps of aiming at a boundary candidate frame set, calculating action recognition feature quantity of the boundary candidate frame set based on adjacent frame difference, and constructing a holding potential judging condition based on the action recognition feature quantity; And screening boundary candidate frames meeting the holding potential judging condition from the boundary candidate frame set, further marking the boundary candidate frames as holding potential candidate frames, merging the holding potential candidate frames continuously meeting the judging condition according to the time adjacency relationship as holding potential time intervals, and marking the holding potential time intervals as static key segments.
8. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 6, comprising the steps of: The action slice group of the sample to be identified is specifically formed by: Calculating the boundary probability of each boundary candidate frame from frame to frame in the boundary candidate frame set, and simultaneously obtaining a probability correction factor based on matching of the graph quality index of the frame, and multiplying the probability correction factor with the boundary probability of each boundary candidate frame to obtain the adaptive boundary probability of each boundary candidate frame; And performing initial segmentation based on the adaptive boundary probability of each boundary candidate frame to obtain an initial boundary set, and executing combination and/or movement and/or deletion correction on the initial boundary set through a preset map constraint, wherein a static key segment is used as a correction constraint condition, and the boundary is not allowed to cut into the static key segment to obtain an action slice group of a sample to be identified.
9. The method for generating sign language recognition training data based on multi-modal data enhancement according to claim 1, comprising the steps of: The multi-mode data enhancement is carried out on the action slice of the sample to be identified, specifically by: Performing quality summarization on the action slices, calculating to obtain slice-level quality profiles based on intra-slice frame-level graph quality indexes, and obtaining a data enhancement initial mode group of the slices based on slice-level quality profile matching; Layering labeling is carried out on the action slices, the interval labeling results of the dynamic section, the static section and the static key section of the sample to be identified are obtained, the static key section occupation ratio of the sample to be identified is counted, and the quota proportion adjustment factor is obtained through matching; Extracting an initial mode quota proportion of the data enhancement initial mode group, multiplying the initial mode quota proportion by a quota proportion adjustment factor to obtain a mode quota adaptation proportion configuration initial mode quota proportion, forming a data enhancement execution strategy, executing multi-mode consistent data enhancement processing on the action slice, and writing the enhanced multi-mode samples into a training sample set.
10. A sign language recognition training data generating system based on multi-modal data enhancement, applying the sign language recognition training data generating method based on multi-modal data enhancement as set forth in any one of claims 1 to 9, comprising: The pattern quality evaluation module is used for the sign language recognition component to read the multi-mode data set of the sample to be recognized based on the sampling equipment, calculate the inter-mode offset, execute multi-mode synchronous resampling, and perform pattern quality evaluation on the sample to be recognized frame by frame after resampling to obtain a pattern quality evaluation result; The static key segment detection module is used for screening the boundary candidate frame set frame by frame based on the graph quality evaluation result, carrying out static key segment detection on the boundary candidate frame set, and taking the static key segment as a correction constraint condition of the boundary candidate frame set to obtain an action slice group of a sample to be identified; The multi-modal data enhancement module is used for matching data enhancement initial strategies for each action slice, carrying out layered labeling on each action slice, enhancing the initial strategies based on layered labeling results, carrying out multi-modal data enhancement on the action slices of the sample to be identified by optimizing to form a data enhancement execution strategy, and generating a training data set of the sign language identification component.

Description

Sign language recognition training data generation method and system based on multi-modal data enhancement Technical Field The invention relates to the technical field of graphic data reading, in particular to a sign language recognition training data generation method and system based on multi-mode data enhancement. Background Sign language identification refers to a technology for acquiring information such as hand actions, facial expressions and body gestures by using a camera or various sensors, and mapping continuous gesture actions into understandable words, voices or semantic instructions through feature extraction and time sequence modeling, and is characterized in that the joint expression and identification of 'spatial gesture features and time motion rules' are combined, so that the automatic understanding of dynamic sign language expression is realized. The existing sign language recognition training data generation process is generally based on multi-modal collection, namely, sign language video/RGB frames, depth information, accelerometer and gyroscope sequences and other data collected by a wearable device are synchronously obtained, preprocessing (such as video frame extraction and normalization, depth map filtering and coordinate normalization, inertial signal denoising and alignment) is respectively carried out on each mode, then data enhancement is carried out on each mode to expand sample diversity (such as rotation/overturn, speed disturbance and segment segmentation of the video, noise and background disturbance are added to the depth data, random disturbance and smooth deformation are added to the inertial sequences), the enhanced multi-modal samples are fused into unified training samples (early/medium/late fusion can be carried out) according to a time stamp or alignment strategy, and finally a multi-modal training data set with labels is formed, which is used for training a recognition model capable of simultaneously learning spatial characteristics and time sequence dependence so as to improve generalization and robustness of the model under different shooting visual angles, illumination, shielding, speed difference and sensor noise conditions. For example, the Chinese patent application with publication number CN113688685A discloses a sign language identification method based on an interactive scene, which comprises the following steps of constructing a dialogue text database of the interactive scene, constructing a sign language video database of the interactive scene, training an interactive scene dialogue prediction model and a sign language video identification network, obtaining a prediction result of a current dialogue and sign language video identification keyword information through the trained interactive scene dialogue prediction model and the sign language video identification network, combining dialogue templates of all the prediction results with the sign language video identification keywords to obtain sign language keyword prediction sentences by using a similarity matching algorithm, carrying out cosine similarity calculation on the sign language keyword prediction sentences and the language model prediction sentences, taking the result with the highest similarity as the result with the highest matching degree with the current sign language identification keyword in the prediction results of the dialogue, and returning the result. For example, chinese patent publication number CN105205449B discloses a sign language recognition method based on deep learning. The method comprises the steps of (1) dividing a database sample set, (2) collecting image blocks, (3) whitening data, (4) training a sparse self-coding network, (5) obtaining a convolution characteristic diagram, (6) obtaining a pooling characteristic diagram, (7) training a classifier, and (8) testing classification results. The prior sign language recognition training data generation scheme has the core thought of ' controlled condition combined with manual rules ', wherein the sample labeling depends on the tissue collection of a preset fixed template, even the labeling result is directly spliced according to the similarity between the keywords and the template, so that the surface of the data scale can be expanded, and the training fragment extraction and sample enhancement depend on the manual space-time rules such as ' stop judgment boundary ', static frame deletion, fixed window cutting ', and the like. The core purpose of the rule is to simplify the graphic data extraction flow, and in a controlled environment with fixed demonstrators, single machine positions, clean background and normal action rhythm, the picture is stable, the hands are not easy to go out of the frame, the interference is less, the rule assumption is completely established, the basic precision of graphic extraction can be ensured, and related hidden dangers are not easy to appear. However, in practical application, sign language ide