CN-121600602-B - Skeleton-based semantic enhancement pre-training sign language understanding framework method and system

CN121600602BCN 121600602 BCN121600602 BCN 121600602BCN-121600602-B

Abstract

The application provides a skeleton-based semantic enhancement pre-training sign language understanding frame method and system, and relates to the technical field of sign language identification, wherein the method comprises the steps of obtaining sign language video data, a skeleton sequence matched with the sign language video data and text data; extracting skeleton key point sequences from sign language video data, modeling to form skeleton features, segmenting text, inputting the skeleton features and segmented text into a fusion network in a pre-training stage to generate bidirectional enhancement features, aligning the bidirectional enhancement features through double-level semantics to obtain global and local similarity, calculating contrast loss based on the similarity and obtaining level loss through balancing parameter coordination weights, executing a matching task and a language modeling task to obtain corresponding loss, combining the three types of loss in a weighting mode to form pre-training total loss and adjusting parameters to complete training, and finally optimizing part of parameters in a fine-tuning stage by combining specific task types to realize enhanced understanding of sign language semantics. The application improves the accuracy of sign language understanding.

Inventors

WEI CHENHAO
YUAN HAIJIE

Assignees

小哆智能科技(北京)有限公司

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (9)

1. A method of bone-based semantic enhancement pre-training a sign language understanding framework, comprising: Acquiring sign language video data, a skeleton sequence and text data matched with the skeleton sequence; Extracting a skeleton key point sequence from the sign language video data, modeling the skeleton key point sequence by adopting a space-time graph convolution network to form skeleton characteristics, and performing word segmentation on the text data; In a pre-training stage, inputting the skeleton feature and segmented text data into a hand perception early fusion network, generating visual features of text guidance and text features of visual guidance through a cross attention mechanism, and performing double-level semantic alignment on the visual features and the text features to determine global similarity and local similarity, wherein the hand perception early fusion network comprises a multi-layer structured sign language encoder and a multi-layer structured text encoder, and extracts a first type token from the enhanced visual features output by the last layer of the sign language encoder and a second type token from the enhanced text features output by the last layer of the text encoder; Determining corresponding global contrast loss and local contrast loss based on the global similarity and the local similarity, and coordinating weights of the global contrast loss and the local contrast loss through preset balance parameters to obtain level loss; Based on the hierarchical loss, executing a sign language and text matching task and a language modeling task to respectively obtain matching loss and language modeling loss, and carrying out weighted combination on the hierarchical loss, the matching loss and the language modeling loss to obtain a pre-training total loss, and cooperatively adjusting model parameters according to the pre-training total loss so as to complete a pre-training process of semantic enhancement; In the fine tuning stage, based on the adjusted model parameters, combining the target task type, fine tuning the part of parameters of the semantic enhancement pre-training sign language understanding frame so as to realize the enhancement understanding of the sign language semantic; the generating local similarity based on the enhanced visual feature output by the last layer of the sign language encoder and the enhanced text feature output by the last layer of the text encoder comprises the following steps: Inputting the enhanced text feature output by the last layer of the sign language encoder and the enhanced visual feature output by the last layer of the text encoder to a preset aggregator, extracting a plurality of word tokens from the enhanced text feature output by the last layer of the text encoder, and extracting a plurality of visual feature tokens from the enhanced visual feature output by the last layer of the sign language encoder; Determining a clustering index of each word token; Combining feature vectors of word tokens belonging to the same clustering result range based on the clustering index to generate a semantic clustering result; calculating cosine similarity of each visual feature token and all semantic clustering results, and selecting the maximum cosine similarity corresponding to each visual feature token; and carrying out weighted summation on the maximum cosine similarity corresponding to all the visual feature tokens to obtain the local similarity.
2. The method of claim 1, wherein inputting the skeletal feature and segmented text data into a pre-speech perception fusion network, generating text-directed visual features and text-directed features via a cross-attention mechanism, and performing a two-level semantic alignment of the visual features and the text features to determine global and local similarities, comprises: in each predetermined layer of the sign language encoder, calculating a first attention weight of text to vision using a cross attention mechanism, generating a text-directed visual feature based on the first attention weight; in each predetermined layer of the text encoder, calculating a second attention weight of vision to text using a cross-attention mechanism, generating text features of the visual guide based on the second attention weights; And respectively carrying out global semantic alignment and local semantic alignment on the visual features and the text features to generate corresponding global similarity and local similarity.
3. The method of claim 2, wherein the globally and locally semantically aligning the visual feature and the text feature, respectively, comprises: Adding the visual features to the original visual features of the current layer to obtain enhanced visual features, and adding the text features to the original text features of the current layer to obtain enhanced text features; and respectively inputting the enhanced visual features and the enhanced text features to a sign language encoder and a next layer of the text encoder for processing until the processing of all the preset layers is completed.
4. The method of claim 1, wherein the extracting a sequence of skeletal keypoints from the sign language video data modeling the sequence of skeletal keypoints with a spatio-temporal graph convolution network to form skeletal features comprises: Acquiring a skeleton key point sequence of a target part from the sign language video data; aiming at the skeleton key point sequence of each part, adopting a space-time graph rolling network to respectively perform joint relation modeling in the space dimension and motion track modeling in the time dimension to obtain the space-time dynamic characteristics of each part; mapping the space-time dynamic characteristics of each part to a unified dimension space through linear transformation to generate corresponding compact characteristics; the compact features of all the sites are connected along the feature dimension to form the bone feature.
5. The method according to claim 1, wherein performing a task of matching sign language and text and a task of language modeling based on the hierarchical loss, respectively obtaining a matching loss and a language modeling loss, and weighting and combining the hierarchical loss, the matching loss and the language modeling loss to obtain a pre-training total loss, and cooperatively adjusting model parameters according to the pre-training total loss, comprises: Based on the hierarchical loss, performing a matching task in a target matching path, wherein the enhanced text features output by the last layer of the text encoder and the enhanced visual features output by the last layer of the sign language encoder are input into a plurality of layers of cross attention modules for deep interaction, matching probability is generated, and based on the matching probability, the matching loss is calculated; executing a language modeling task in a target modeling path, wherein the enhanced text features output by the last layer of the text encoder are input to a multi-layer self-attention module for autoregressive modeling to generate a text token prediction result, and calculating language modeling loss based on the text token prediction result; Linearly combining the hierarchy loss, the matching loss and the language modeling loss according to a preset first weight coefficient, a preset second weight coefficient and a preset third weight coefficient to obtain a total pre-training loss; Based on the pre-training total loss, simultaneously adjusting all trainable parameters in a sign language perception early fusion network, a target matching path and a target modeling path; and carrying out multi-round iterative updating on all the trainable parameters until the total loss of the pre-training reaches a preset convergence threshold.
6. The method according to claim 1, wherein determining the corresponding global contrast loss and local contrast loss based on the global similarity and the local similarity, coordinating weights of the global contrast loss and the local contrast loss by a preset balance parameter, and obtaining a hierarchy loss includes: According to the global similarity, respectively calculating a first global contrast loss from sign language to a text direction and a second global contrast loss from text to sign language direction; According to the local similarity, respectively calculating a first local contrast loss from sign language to a text direction and a second local contrast loss from text to the sign language direction; combining the first global contrast loss and the second global contrast loss to a global contrast loss, and combining the first local contrast loss and the second local contrast loss to a local contrast loss; Weight distribution is carried out on the global contrast loss and the local contrast loss through preset balance parameters; And adding the weighted global contrast loss and the weighted local contrast loss to obtain the hierarchy loss.
7. A system for bone-based semantic enhancement pre-training a sign language understanding framework, comprising: the acquisition module is used for acquiring sign language video data, skeleton sequences and text data matched with the skeleton sequences; The extraction module is used for extracting a skeleton key point sequence from the sign language video data, modeling the skeleton key point sequence by adopting a space-time graph convolution network to form skeleton characteristics, and performing word segmentation on the text data; The input module is used for inputting the skeleton feature and the segmented text data into a hand sense early fusion network in a pre-training stage, generating visual features of text guidance and text features of visual guidance through a cross attention mechanism, and carrying out double-level semantic alignment on the visual features and the text features to determine global similarity and local similarity, wherein the hand sense early fusion network comprises a hand encoder with a multi-layer structure and a text encoder with a multi-layer structure, a first type token is extracted from the enhanced visual features output by the last layer of the hand encoder, and a second type token is extracted from the enhanced text features output by the last layer of the text encoder; The determining module is used for determining corresponding global contrast loss and local contrast loss based on the global similarity and the local similarity, and coordinating weights of the global contrast loss and the local contrast loss through preset balance parameters to obtain a hierarchy loss; The combination module is used for executing a sign language and text matching task and a language modeling task based on the hierarchy loss, respectively obtaining a matching loss and a language modeling loss, carrying out weighted combination on the hierarchy loss, the matching loss and the language modeling loss to obtain a pre-training total loss, and cooperatively adjusting model parameters according to the pre-training total loss so as to complete a pre-training process of semantic enhancement; The fine tuning module is used for fine tuning part of parameters of the semantic enhancement pre-training sign language understanding framework based on the adjusted model parameters and the target task type in a fine tuning stage so as to realize the enhancement understanding of the sign language semantics; the generating local similarity based on the enhanced visual feature output by the last layer of the sign language encoder and the enhanced text feature output by the last layer of the text encoder comprises the following steps: Inputting the enhanced text feature output by the last layer of the sign language encoder and the enhanced visual feature output by the last layer of the text encoder to a preset aggregator, extracting a plurality of word tokens from the enhanced text feature output by the last layer of the text encoder, and extracting a plurality of visual feature tokens from the enhanced visual feature output by the last layer of the sign language encoder; Determining a clustering index of each word token; Combining feature vectors of word tokens belonging to the same clustering result range based on the clustering index to generate a semantic clustering result; calculating cosine similarity of each visual feature token and all semantic clustering results, and selecting the maximum cosine similarity corresponding to each visual feature token; and carrying out weighted summation on the maximum cosine similarity corresponding to all the visual feature tokens to obtain the local similarity.
8. An electronic device, comprising: A memory for storing a computer program; a processor for implementing the steps of the method of bone-based semantically enhanced pre-training sign language understanding framework of any one of claims 1 to 6 when executing the computer program.
9. A computer readable storage medium, characterized in that it has stored therein a computer program, which when executed by a processor is capable of implementing a method of bone-based semantic enhanced pre-training sign language understanding frameworks according to any of claims 1 to 6.

Description

Skeleton-based semantic enhancement pre-training sign language understanding framework method and system Technical Field The application relates to the technical field of sign language recognition, in particular to a skeleton-based semantic enhancement pre-training sign language understanding frame method and system. Background The sign language understanding technology based on skeleton data recognizes gesture meanings by analyzing the motion trail of key points of human bodies, and the method has wide prospect in constructing an unobstructed communication environment and an intelligent man-machine interaction system. And by extracting skeleton information from the video, the background interference can be avoided, the method is focused on the motion characteristics of core parts such as hands and bodies, and a technical basis is provided for cross-language sign language translation and education application. In the prior art, a skeleton-based sign language understanding method generally adopts a time sequence modeling network to process a joint coordinate sequence, and performs cross-modal learning by combining text information. For example, some schemes use graph convolution networks to capture spatial relationships between nodes of interest and then model temporal dynamics through recurrent neural networks, and others introduce a mechanism of attention to correlate visual features with textual descriptions using contrast learning to pull feature distances of both modalities during the pre-training phase. However, these methods have limitations in collaborative modeling of global semantics and local details, affecting the accuracy of understanding complex sign language statements by the model. Therefore, the technical problem of insufficient understanding depth of semantic meaning exists in the prior art. Disclosure of Invention The application provides a skeleton-based semantic enhancement pre-training sign language understanding frame method and system, which are used for solving the problems of low accuracy and semantic continuity check of sign language understanding in the prior art. To solve the above technical problem, in a first aspect, the present application provides a method for enhancing a pre-training sign language understanding framework based on skeletal semantics, which includes: Acquiring sign language video data, a skeleton sequence and text data matched with the skeleton sequence; Extracting a skeleton key point sequence from the sign language video data, modeling the skeleton key point sequence by adopting a space-time graph convolution network to form skeleton characteristics, and performing word segmentation on the text data; In the pre-training stage, inputting the skeleton features and the segmented text data into a hand-word perception early fusion network, generating visual features of text guidance and the text features of the visual guidance through a cross attention mechanism, and performing double-level semantic alignment on the visual features and the text features to determine global similarity and local similarity; Determining corresponding global contrast loss and local contrast loss based on the global similarity and the local similarity, and coordinating weights of the global contrast loss and the local contrast loss through preset balance parameters to obtain level loss; Based on the hierarchical loss, executing a sign language and text matching task and a language modeling task to respectively obtain matching loss and language modeling loss, and carrying out weighted combination on the hierarchical loss, the matching loss and the language modeling loss to obtain a pre-training total loss, and cooperatively adjusting model parameters according to the pre-training total loss so as to complete a pre-training process of semantic enhancement; In the fine tuning stage, based on the adjusted model parameters and combined with the target task type, fine tuning the part of parameters of the semantic enhancement pre-training sign language understanding frame so as to realize the enhancement understanding of the sign language semantic. Optionally, the inputting the skeletal feature and the segmented text data into a early-stage fusion network for hand perception, generating visual features of text guidance and text features of visual guidance through a cross attention mechanism, and performing double-level semantic alignment on the visual features and the text features to determine global similarity and local similarity, including: Inputting the skeleton characteristics and the segmented text data into a hand-language perception early fusion network, wherein the hand-language perception early fusion network comprises a hand-language encoder with a multilayer structure and a text encoder with a multilayer structure; in each predetermined layer of the sign language encoder, calculating a first attention weight of text to vision using a cross attention mechanism, generating a text-