CN-122025107-A - Autism notch board behavior recognition system based on light mask enhanced space-time prompt frame

CN122025107ACN 122025107 ACN122025107 ACN 122025107ACN-122025107-A

Abstract

The invention discloses an autism notch board behavior recognition system based on a lightweight mask enhanced space-time prompt frame, which is applied to the field of automatic recognition and analysis of autism notch board behaviors and comprises a visual feature extraction module, an expansion convolution time sequence modeling module, an instance-type interaction module and an action recognition module, wherein the visual feature extraction module is used for generating instance-level representation in a semantic set based on a mask-guided space enhancement mechanism, the expansion convolution time sequence modeling module is used for effectively modeling time sequence dependency between video frames based on a lightweight expansion convolution time sequence network DTCN, the instance-type interaction module is used for fusing multi-stage information from instance to text based on visual instance and text type interaction network ICIM, and the action recognition module is used for adopting a hidden state of [ EOS ] token as a sentence-level representation of text description and introducing a text semantic-guided multi-mode attention aggregation mechanism to acquire a text-guided video representation. The invention effectively solves the problems of multi-mode information fusion and result interpretation in the auxiliary diagnosis of autism.

Inventors

GUO YANRONG
Chai Fuxiao
WU JINGJING
HAO SHIJIE
HONG RICHANG

Assignees

合肥工业大学

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (7)

1. An autism notch board behavior recognition system based on a lightweight mask enhanced spatiotemporal hint frame, comprising: The visual feature extraction module is used for generating instance level representation in a semantic set by focusing a guide model on a behavior main body area through an explicit semantic mask constraint based on a mask guide space enhancement mechanism; The expansion convolution time sequence modeling module is used for effectively modeling time sequence dependence among video frames based on the lightweight expansion convolution time sequence network DTCN; An instance-category interaction module for fusing multi-stage information from instance-instance to instance-text based on visual instance and text category interaction network ICIM; The action recognition module is used for adopting the hidden state of the [ EOS ] token as sentence level representation of text description, stacking all normalized sentence level vectors according to categories to obtain category sentence level representation matrixes, introducing a multi-mode attention aggregation mechanism of text semantic guidance, and obtaining text-guided video representation; And the training module is used for taking the text-guided video representation and the corresponding category sentence-level representation matrix as training targets, keeping high consistency in the shared embedded space, maximizing the score of the matched video-label pair, and minimizing the scores of other unmatched pairs.
2. The autism notch board behavior recognition system based on the lightweight mask enhanced spatiotemporal hint framework of claim 1, wherein the mask-guided spatial enhancement mechanism is specifically: Given a video clip Wherein, the method comprises the steps of, Is the number of frames; And Uniformly sampling N frames from the original video to obtain a frame sequence: Adopting a multi-prompt word mask generation strategy to simultaneously extract three types of masks of a child main body, an interactive object and an interactive character, and giving a t-th frame image Prompt word Using the SEG output probability mask constructed by the Sa2VA pre-training model: Wherein, the For the t-th frame image The position belongs to the prompt word Probability of a region; Based on the probability mask, mask pooling operation is carried out on patch level features obtained after video frames pass through the CLIP visual encoder: Wherein, the For the prompt word in the image of the t frame Weighting characteristics of the regions; Is a real number domain; is the dimension of the visual feature; dividing the frame image into the number of patch level features; the ith patch level feature of the image of the t frame belongs to the prompt word Probability of a region; The ith patch level feature of the image of the t frame; is a constant for preventing zero removal; introducing a learnable mask token into an example feature sequence And applying a linear transformation to the enhanced example features: Wherein, the Is a weight matrix; Is a bias vector; Is the text modal space dimension.
3. The autism notch board behavior recognition system based on the lightweight mask enhanced spatio-temporal cue framework of claim 1, wherein the lightweight expansion convolution based timing network DTCN is specifically: The method is divided into an instance level time sequence modeling part and a frame level time sequence modeling part, DTCN is operated in a time dimension, and the first is that The layer's dilation convolution operation is formalized as: Wherein, the Is the first The convolution kernel size of the layer's dilation convolution; is the first Expansion rate of the layer; is an input feature; Frame number of the image; is a bias term; Stacking the total receptive field of the dilation convolution operation: Wherein, the The total number of layers is DTCN.
4. The autism notch board behavior recognition system based on the lightweight mask enhanced spatio-temporal alert framework of claim 1, wherein based on visual instance and text category interaction network ICIM, specifically: Consists of multi-layered stacked multi-head attention, feed forward network and Fusion MLP; Natural language description corresponding to category label Processed and input to a pre-trained text encoder Word level embedding of each category label is obtained Stacking to obtain the whole class label word level representation set : Wherein, the Is the number of category labels; Is a real number domain; text length for the c-th category label; Dimension for word embedding; for example level features output through the expansion convolution timing modeling module First, it is flattened into ICIM is entered by first setting Key, query and Value to be all through the self-attention layer : Wherein, the Is of batch size; N is the size of another dimension; Is a feature dimension; , , ; 、、 Flattening features for visual instances, respectively Conversion to a Query matrix Linear transformation weights of (2), Conversion to Key matrix Linear transformation weights of (2), Conversion to Value matrix Is a linear transformation weight of (2); instance-to-text cross-attention layer will visual instance-level features As Query, text token as Key and Value: Wherein, the , , ; 、、 Respectively visual example features To a Query matrix Linear transformation weights of (a) text token features To Key matrix Linear transformation weights of (a) and text token features To Value matrix Is a linear transformation weight of (2); Text-to-instance reverse cross-attention layer interacts updated text features in a reverse manner: Wherein, the , , ; 、、 Respectively text token features To a Query matrix Linear transform weights of (a) visual instance features To Key matrix Linear transformation weights and visual instance features of (a) To Value matrix Is a linear transformation weight of (2); in each layer, after multi-head attention operation, the output is processed by residual connection and layer normalization: finally, go through After stacking layers, the model obtains the features after cross-modal semantic interaction ; In the feature fusion stage, first pair Aggregation is performed to reflect the semantic correlation between each instance and the language description: Wherein the aggregate weights Derived from the attention scores of the cross-modal semantic interaction module and subjected to softmax normalization processing to represent the first The first video sample In the frame, the first Correlation of the individual instances to the semantic query; frame-level features generated by a visual encoder at an early stage The overall scene semantics are still reserved, and the two characteristics are weighted and fused to obtain the comprehensive characterization : Finally, the fused multimodal representation is: 。
5. The autism notch board behavior recognition system based on the lightweight mask enhanced space-time prompt framework of claim 1, wherein the hidden state of [ EOS ] token is used as sentence level representation of text description, all normalized sentence level vectors are stacked according to categories to obtain a category sentence level representation matrix, and the method is specifically as follows: Get the first The vector of the last non-filled token in the category label description is taken as the global semantic vector of the category: Wherein, the Describing a corresponding word embedding matrix for the c-th category label; A number of valid tokens described for the c-th category label; Is the c category Embedding vectors of the token; Is a real number domain; Feature dimensions embedded for token; stacking all normalized sentence-level vectors according to categories to obtain a category sentence-level representation matrix: Wherein, the Is the number of category labels.
6. The autism notch board behavior recognition system based on the lightweight mask enhanced spatio-temporal alert framework of claim 1, wherein a multi-modal attention-syndication mechanism of text semantic guidance is introduced to obtain a video representation of the text guidance, specifically: for each category label Calculate the first Similarity of frame features to each text token: Wherein, the Is the c-th class label, t-th frame feature and the t-th frame feature Similarity between the text token; the fusion feature vector corresponding to the t frame; The c category label corresponds to Feature vectors of the text token; Is the temperature coefficient; normalizing the time dimension to obtain the attention weight of the word-frame: Wherein, the Under the label of the c category, the t frame and the t frame Word-frame attention weights corresponding to the individual text token; total frame number in time dimension; Average aggregation is carried out on the attention of all the token to obtain category labels Frame weight distribution in the time dimension: Wherein the frame weights Reflect the first Frame pair class The degree of importance of action semantics; A number of valid tokens described for the c-th category label; finally, a text-guided video representation is obtained by means of weighted summation: Wherein, the Video feature representation for the c-th category label; the fusion feature vector corresponding to the t frame; a set of video feature representations corresponding to all category labels; is the total number of category labels.
7. The system for identifying autism notch in a lightweight masking enhanced spatio-temporal alert framework based on claim 1, wherein the training goal is to keep the text-guided video representation and the corresponding class sentence-level representation matrix highly consistent in the shared embedding space, maximizing the score of matching video-tag pairs while minimizing the scores of other non-matching pairs, in particular: in each training batch, the model first calculates the first Matching similarity of individual videos to semantic embedments of various categories: Wherein, the Matching similarity scores between the ith video and the c-th category semantic embedment; To enhance the visual representation; Text semantic representation for the c-th category label; is a cosine similarity function; is a temperature super parameter and is used for adjusting the smoothness degree of distribution; Minimizing the loss of semantically guided video representations and real class label descriptions using a cross entropy loss function: Wherein, the Training loss values for visual-text semantic matching; training batch size; An enhanced visual representation for the ith video; for the ith video true category A corresponding text semantic representation; is the total number of category labels.

Description

Autism notch board behavior recognition system based on light mask enhanced space-time prompt frame Technical Field The invention relates to the field of automatic recognition and analysis of autism notch behaviors, in particular to an autism notch behavior recognition system based on a light mask enhanced space-time prompt frame. Background Autism spectrum disorder (Autism Spectrum Disorder, ASD) is a neurological disorder characterized by social communication deficits, non-verbal communication disorders, and restricted and repetitive behaviors. Such disorders often occur during early stages of childhood development and may have a long-term negative impact on an individual's quality of life. However, current diagnostic procedures rely largely on manual evaluation by clinical professionals through tools such as the autism diagnostic observation scale (Autism Diagnostic Observation Schedule, ADOS), which makes the diagnostic procedure delayed and difficult to popularize on a large scale. In recent years, artificial Intelligence (AI) technology has been widely used in the medical field. With these advances and the push for AI diagnostic potential, researchers have begun to explore methods for early detection of ASD using multi-modal cues, including facial features, functional magnetic resonance imaging (fMRI), and eye-tracking data, among others. In addition to these physiological and neural signals, notch behavior (stereotyped behaviors, SBDs), a restricted and repetitive motion common in ASD children (such as beating an arm or repeatedly tapping the head with a hand), is also a key indicator in clinical diagnosis. To facilitate early screening, some computer vision-based studies attempted to automatically identify these behavioral patterns from video. However, most of the existing methods rely on frame-level or short-segment-level supervision, it is difficult to capture subtle differences in the space-time dimension of the stencil behavior, and full-scale fine tuning of large-scale pre-trained Visual Language Models (VLMs) is performed directly on small-scale ASD datasets, which is not only computationally expensive, but also may destroy its original visual-text alignment capability, resulting in performance degradation. Therefore, how to provide an autism notch board behavior recognition system based on a light mask enhanced space-time prompt frame, which can effectively solve the problems of multi-mode information fusion and result interpretation in the existing autism auxiliary diagnosis, is a problem to be solved by the person skilled in the art. Disclosure of Invention In view of the above, the invention provides an autism notch behavior recognition system based on a lightweight mask enhanced space-time cue framework. In order to achieve the above purpose, the present invention adopts the following technical scheme: an autism notch board behavior recognition system based on a lightweight mask enhanced spatiotemporal hint frame, comprising: The visual feature extraction module is used for generating instance level representation in a semantic set by focusing a guide model on a behavior main body area through an explicit semantic mask constraint based on a mask guide space enhancement mechanism; The expansion convolution time sequence modeling module is used for effectively modeling time sequence dependence among video frames based on the lightweight expansion convolution time sequence network DTCN; An instance-category interaction module for fusing multi-stage information from instance-instance to instance-text based on visual instance and text category interaction network ICIM; The action recognition module is used for adopting the hidden state of the [ EOS ] token as sentence level representation of text description, stacking all normalized sentence level vectors according to categories to obtain category sentence level representation matrixes, introducing a multi-mode attention aggregation mechanism of text semantic guidance, and obtaining text-guided video representation; And the training module is used for taking the training target that the text-guided video representation and the corresponding category sentence-level representation matrix keep high consistency in the shared embedded space, maximizing the score of the matched video-label pair and minimizing the scores of other unmatched pairs. Optionally, the spatial enhancement mechanism based on mask guidance is specifically: Given a video clip Wherein, the method comprises the steps of,Is the number of frames; And Uniformly sampling N frames from the original video to obtain a frame sequence: Adopting a multi-prompt word mask generation strategy to simultaneously extract three types of masks of a child main body, an interactive object and an interactive character, and giving a t-th frame image Prompt wordUsing the SEG output probability mask constructed by the Sa2VA pre-training model: Wherein, the For the t-th frame imageThe position belongs