CN-122024324-A - Dynamic sign language time sequence modeling method and system based on attention mechanism

CN122024324ACN 122024324 ACN122024324 ACN 122024324ACN-122024324-A

Abstract

The application provides a dynamic sign language time sequence modeling method and a system based on an attention mechanism, which belong to the technical field of sign language recognition and man-machine interaction and are used for solving the problems of poor adaptability, insufficient individuation and inaccurate capture of semantic features of sign language semantic recognition scenes in the related technology, the method generates a double-domain time sequence tensor by synchronously collecting hand space coordinates and inertial data and fusing, the four-dimensional fusion feature is constructed by combining the rhythm feature and the scene prior, semantic expression is enhanced through three-level attention weighting, scene adaptation is realized by relying on hierarchical meta-parameters, personalized optimization is realized through hand feature fingerprint and full-link parameter back feeding, and sign language semantic accurate modeling, scene adaptation and individual adaptation are realized.

Inventors

Wu Luying
GUO MIN
DANG ZHIMIN
MENG ZIYI
SUN ZONGCAI

Assignees

山东特殊教育职业学院

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. A dynamic sign language time sequence modeling method based on an attention mechanism is characterized in that, Synchronously acquiring image data and inertial measurement data of a dynamic sign language, and generating a double-domain time sequence tensor adapting to the action amplitude and speed characteristics of the sign language; Fusing the two-domain time sequence tensor, the three-dimensional rhythm characteristic representing sign language semantic turning and the scene priori characteristic adapting to sign language scene semantic constraint to form a four-dimensional fusion characteristic; Performing three-level attention weighting processing of frames, segments and sentences adapting to sign language semantic levels based on the four-dimensional fusion features to generate a semantic enhancement feature map; Loading large class, subdivision and semantic unit three-level nested element parameters adapting to the sign language scene hierarchical characteristics, and performing scene adaptation on the semantic enhancement feature map.
2. The method of claim 1, wherein the fusing the two-domain temporal tensor, the three-dimensional cadence feature, and the scene prior feature to form a four-dimensional fusion feature comprises: respectively projecting the two-domain time sequence tensor, the three-dimensional rhythm characteristic and the scene priori characteristic to a preset high-dimensional space; Calculating the internal association of the action, rhythm and scene of the mining sign language through the attention association, and acquiring the association weight among the multi-source features; and outputting dynamic coefficients through a dynamic gating adjustment mechanism, and adjusting contribution degrees of three types of features under different frames or different scenes according to the contribution degrees of the sign language semantic turning frames, the rhythm feature and the scene feature according to cross-scene switching, so as to complete the construction of four-dimensional fusion features.
3. The attention mechanism based dynamic sign language timing modeling method of claim 1, wherein the performing a frame, segment, sentence three-level attention weighting process based on the four-dimensional fusion feature comprises: Calculating the mutation degree of the rhythm representing the sign language semantic turn based on the mutation degree parameter in the three-dimensional rhythm characteristic; identifying semantic turning frames such as sign language negatives, sentence pauses and the like according to the rhythm mutation degree; And taking the rhythm mutation degree as an adjusting factor of the frame-level attention weight, and dynamically improving the attention weight of the semantic turning frame.
4. The attention mechanism based dynamic sign language timing modeling method of claim 1, wherein the performing a frame, segment, sentence three-level attention weighting process based on the four-dimensional fusion feature comprises: Calculating a rhythm matching coefficient of the current action segment and the standard sign language rhythm of the corresponding scene so as to ensure the normalization of the sign language action; Calculating an emotion rhythm coefficient based on the statistical parameters of the action segment rhythm trend to capture sign language emotion semantics; Calculating scene adaptation coefficients of the action segment features and scene specific semantic unit features to constrain sign language context semantics; And outputting dynamic weights through a gating dynamic fusion mechanism, adjusting the contribution degrees of the rhythm matching coefficient, the emotion rhythm coefficient and the scene adaptation coefficient, and generating the section-level attention weights.
5. The attention mechanism based dynamic sign language timing modeling method of claim 1, further comprising, prior to performing the frame, segment, sentence tertiary attention weighting process: Calculating the local rhythm density of each frame based on the three-dimensional rhythm characteristics to describe the aggregation characteristic of the sign language action rhythm; Setting a self-adaptive density threshold value adapting to a sign language rhythm change rule, and screening a density peak point as an action segment clustering center; And dividing the action segment by taking the middle point of the adjacent cluster center as the action segment boundary and combining sign language action continuity constraint.
6. The attention mechanism-based dynamic sign language time sequence modeling method according to claim 1, wherein the loading of the large class, subdivision, semantic unit three-level nested element parameters for the semantic enhancement feature map comprises the following steps: constructing a global, large-class, subdivision and semantic unit four-level nested element parameter system adapting to the hierarchical characteristics of the sign language scene, and adopting a storage structure combining basic parameters and incremental parameters; based on a small amount of sign language action feature support sets, online fine adjustment of meta parameters is realized through a parameter evolution mechanism; and when the scene is crossed, weighting and fusing meta parameters of different scenes based on the similarity of sign language semantic units.
7. The attention-mechanism-based dynamic sign language timing modeling method of claim 1, further comprising: Collecting a static image of the hand of a user, extracting hand physiological characteristics directly related to sign language actions, and generating hand characteristic fingerprints; constructing a five-level storage architecture according to user fingerprints, subdivision scenes, semantic units and parameter types, and storing personalized adaptation related data; and matching personalized parameters by taking the hand characteristic fingerprints as indexes and using a rapid indexing mechanism.
8. The attention-mechanism-based dynamic sign language timing modeling method of claim 7, further comprising: Updating and adapting to a rhythm baseline of slow change of the action habit of a user by adopting an index moving average algorithm; introducing historical loss constraint, and updating attention weight and meta-parameters based on increment step length; And (3) respectively feeding the updated personalized parameters back to the feature layer, the attention layer and the meta-parameter layer to realize full-link parameter back feeding.
9. The attention-mechanism-based dynamic sign language timing modeling method of claim 1, further comprising, after generating the semantic enhanced feature map: The hand key points of the sign language action core are taken as nodes, physical connection among the key points is taken as edges, and continuous frames are connected in series to construct a dynamic sign language time-space diagram; extracting sign language space-time characteristics through space diagram convolution branches and time convolution branches; And fusing the semantic enhancement feature map, the space-time features and the rhythm alignment result to calculate personalized semantic matching similarity.
10. A system for implementing the attention-based dynamic sign language timing modeling method according to any one of claims 1-9, The intelligent data acquisition system comprises a data acquisition module, a cooperative computing module, a distributed intelligent storage module and a control module, wherein the modules realize real-time linkage through a data bus; the data acquisition module comprises a depth camera, an inertial measurement sensor and an FPGA synchronous control unit, wherein the FPGA synchronous control unit integrates a three-dimensional rhythm calculation unit which is adaptive to sign language data preprocessing, a hand fingerprint extraction unit and a space-time diagram node preprocessing parallel unit; the collaborative computing module comprises a GPU, a CPU and an integrated parameter scheduling center, wherein the GPU is provided with a space-time diagram convolution and attention fusion parallel computing core which is adaptive to sign language space-time modeling; the distributed intelligent storage module comprises a high-speed SSD cache and an NAS storage, and builds an integrated index of user fingerprints, subdivision scenes, semantic units and parameter types, which adapt to requirements of sign language individuation and scene change; And the control module executes global collaborative loss function calculation, and links each module to complete data acquisition, modeling, matching and updating full-link closed-loop control.

Description

Dynamic sign language time sequence modeling method and system based on attention mechanism Technical Field The application relates to the field of sign language identification and time sequence modeling, in particular to a dynamic sign language time sequence modeling method and system based on an attention mechanism. Background Along with the development of barrier-free communication technology, dynamic sign language time sequence modeling is used as a core link of sign language identification, modeling accuracy and suitability of the dynamic sign language time sequence modeling directly influence accuracy of sign language semantic understanding, and the dynamic sign language time sequence modeling has important application value in barrier-free interaction of multiple scenes such as campuses, communities, government halls and the like. At present, the dynamic sign language time sequence modeling technology relies on computer vision and time sequence analysis technology, time sequence correlation modeling is carried out by extracting image features of sign language actions, and the technology is gradually developed to the directions of multi-feature fusion and attention modeling. In the prior art, dynamic sign language time sequence modeling mostly adopts a time sequence modeling mode of single action characteristics, or directly migrates a general time sequence attention mechanism, and partial schemes attempt to introduce simple scene characteristic auxiliary modeling. However, the prior art still has a plurality of defects that firstly, the exclusive characteristic of sign language 'strong binding of actions and rhythms and synchronization of semantics and rhythms mutation' is ignored, core semantic information is easy to miss by modeling only by means of action features, secondly, a general attention mechanism cannot adapt to isomerism of sign language multisource features (actions, rhythms and scenes), a feature fusion effect is poor, thirdly, a personalized adaptation mechanism aiming at multi-user action habit differences is lacked, a parameter fault problem exists when the multi-user action habit is matched across scenes, fourthly, semantic matching only focuses on time sequence features, space-time essence of sign language actions is ignored, multi-user adaptation precision is low, fifth, the existing hardware system is mostly composed of general computer vision hardware, real-time performance and parallel computing requirements of sign language modeling cannot be adapted, and hardware and algorithm cooperativity is poor. These drawbacks lead to insufficient accuracy and limited adaptation scenarios of the existing modeling method, and are difficult to meet the requirements of actual barrier-free communication, so that a dynamic time sequence modeling scheme for adapting sign language specific characteristics and combining individuation and multiple scenarios is needed. Disclosure of Invention The application provides a dynamic sign language time sequence modeling method and a system based on an attention mechanism, which can solve the problems of insufficient precision, poor suitability of multiple users and multiple scenes and weak cooperativity of hardware and an algorithm of the traditional dynamic sign language time sequence modeling, and realize accurate, efficient and personalized dynamic sign language time sequence modeling. In a first aspect, the present application provides a dynamic sign language time sequence modeling method based on an attention mechanism. The method comprises the steps of synchronously collecting image data and inertia measurement data of dynamic sign language, generating a double-domain time sequence tensor adapting sign language action amplitude and speed characteristics, fusing the double-domain time sequence tensor, three-dimensional rhythm characteristics representing sign language semantic turning and scene priori characteristics adapting sign language scene semantic constraint to form four-dimensional fusion characteristics, executing three-level attention weighting processing of frames, segments and sentences adapting to sign language semantic levels based on the four-dimensional fusion characteristics to generate a semantic enhancement feature map, loading three-level nesting element parameters of major class, subdivision and semantic unit adapting to the semantic enhancement feature map, and carrying out scene adaptation on the semantic enhancement feature map. By adopting the technical scheme, the sign language actions, the rhythms and the scene multisource features are systematically integrated into time sequence modeling, the sign language semantic hierarchy characteristics are matched by combining the hierarchical attention, the sign language semantic expression rule is effectively adapted, the accuracy of dynamic sign language time sequence modeling is improved, and a reliable basis is provided for subsequent semantic understanding. Further, the method