CN-122024308-A - Action recognition system and method based on cooperation of size models

CN122024308ACN 122024308 ACN122024308 ACN 122024308ACN-122024308-A

Abstract

The invention discloses a large-small model collaborative action recognition system and method, wherein the system comprises a large model, a small model and a gating network, the large model comprises a shallow visual encoder, a text encoder, a deep visual encoder, a multi-mode fusion module and a language reasoning head, the small model comprises a lightweight visual encoder and a deep recognition structure, the gating network judges the semantic complexity of an input sample, when the complexity score is higher than a threshold value, the large model is called to conduct subsequent reasoning, a prediction result is output, and when the complexity score is lower than the threshold value, the small model is used for conducting subsequent reasoning, and the prediction result is output. The method effectively improves the accuracy of complex action recognition and the reasoning efficiency of the model, and can be widely applied to the fields of video monitoring, intelligent security, man-machine interaction, sports analysis and the like.

Inventors

WEN JINGXIN
ZHONG DAQING
Tang Shangyi
Tang Boyue
YAN YING

Assignees

四川大学锦江学院

Dates

Publication Date: 20260512
Application Date: 20251212

Claims (9)

1. A motion recognition system based on size model cooperation is characterized by comprising a large model, a small model and a gating network, wherein, The large model comprises a shallow visual encoder, a text encoder, a deep visual encoder, a multi-mode fusion module and a language reasoning head, wherein the shallow visual encoder extracts first visual features according to an input video sequence, the text encoder extracts text input to obtain prompt texts, the deep visual encoder is used for carrying out semantic enhancement on the first visual features to obtain deep visual features, the multi-mode fusion module carries out multi-mode fusion on the deep visual features and the prompt texts, and the language reasoning head outputs action category probabilities and multi-mode semantic embedments; the small model comprises a lightweight visual encoder and a deep recognition structure, the lightweight visual encoder extracts a second visual characteristic according to an input video sequence, and the deep recognition structure outputs a prediction result according to the second visual characteristic; And when the complexity score is lower than the threshold value, the follow-up reasoning is carried out only through the small model, and the prediction result is output.
2. The collaborative motion recognition system of claim 1 wherein gating network G (·) analyzes dual features { F L ,F S } to produce a complexity score α ε [0,1]: α=G ([ F L ,F S ]), where [ ·, ] represents feature stitching along the channel dimension, the score α reflects the perceived and semantic complexity of the current input, determining whether to activate a deeper large model pathway, F L and F S are the first visual feature and the second visual feature, respectively; The gating network is optimized through training: During the training process, each sample is labeled as a binary target The target is generated by a teacher enforced policy: ; Wherein, the Is a boundary controlling the tolerance of the prediction error, Is a real tag that is not a real tag, The gating network trains by minimizing the binary cross entropy loss: ; In the reasoning process, alpha output by the gating network is compared with a threshold tau, and a decision sample is subjected to subsequent reasoning by a large model or a small model.
3. The collaborative motion recognition system of claim 2 wherein the system inherits fine-grained motion understanding and contextual reasoning for a large model by a small model through knowledge distillation, and wherein the total loss for a small model includes classification loss and distillation loss: ; Wherein, the Is the cross-entropy classification loss, Is the super parameter for controlling distillation contribution and distillation loss The KL divergence between the softmax outputs of the large and small models is defined by the calculation of: ; Wherein, the Is a function of the softmax of the device, Is a temperature parameter, KL represents the KL divergence, And outputting a prediction result for the large model.
4. The collaborative based on size model motion recognition system of claim 1 further comprising a hint-based adaptation module that introduces a set of learnable hint vectors The prior information which is used as the task specific information is injected into a transducer or an attention layer of a large model and a small model, and the self-adaptive characteristic expression formula is as follows: ; Wherein the method comprises the steps of The splicing of the prompt vector and the multi-mode fusion characteristic is represented; in the task of a small sample, parameters of a main network of a large model and a main network of a small model are kept frozen, and only prompt parameters are updated : ; The self-adaptive module transmits the cross-modal knowledge from the large model to the small model by prompting the aligned embedded space, so that knowledge migration between the large model and the small model is realized.
5. The action recognition method based on the cooperation of the size models is characterized by comprising the following steps of: The method comprises the steps of 1, constructing and training a collaborative action recognition framework, wherein the collaborative action recognition framework comprises a large model, a small model and a gating network, the large model comprises a shallow visual encoder, a text encoder, a deep visual encoder, a multi-mode fusion module and a language reasoning head, the shallow visual encoder extracts first visual features according to an input video sequence, the text encoder extracts texts to obtain prompt texts, the deep visual encoder is used for carrying out semantic enhancement on the first visual features to obtain deep visual features, the multi-mode fusion module carries out multi-mode fusion on the deep visual features and the prompt texts, and the language reasoning head outputs action category probabilities and multi-mode semantic embedding; And 2, adopting a collaborative action recognition framework to recognize the input video sequence, wherein the gating network obtains a complexity score according to the first visual characteristic and the second visual characteristic, when the complexity score is higher than a threshold value, invoking a large model to carry out subsequent reasoning and output a prediction result, and when the complexity score is lower than the threshold value, carrying out subsequent reasoning only through a small model and outputting the prediction result.
6. The method of claim 5, wherein the gating network G (-) analyzes the dual features { F L ,F S } to produce a complexity score α ε [0,1]: α=G ([ F L ,F S ]), where [ ·, ] represents feature stitching along the channel dimension, the score α reflects the perceived and semantic complexity of the current input, determining whether to activate the deeper large model path, F L and F S are the first visual feature and the second visual feature, respectively; The gating network is optimized through training: in the training process, generating a binary routing label for each sample based on the error between the small model output result and the real label, wherein each sample is marked as a binary target The target is generated by a teacher enforced policy: ; Wherein, the Is a boundary controlling the tolerance of the prediction error, Is a real tag that is not a real tag, Binary cross entropy training is carried out on the complexity scores output by the routing labels and the gating network: ; In the reasoning process, alpha output by the gating network is compared with a threshold tau, and a decision sample is subjected to subsequent reasoning by a large model or a small model.
7. The collaborative motion recognition method based on size models of claim 6 wherein in step 1, the collaborative motion recognition framework alternates between three phases during training: (1) Pre-training, namely independently training a large model on a large-scale data set to enable the large model to acquire global semantic understanding capability so as to realize semantic understanding; (2) Distillation, training a small model under ground truth and supervision by using the large model output; (3) Freezing the large model reasoning path and the small model reasoning path, and training the gating network G (·) to realize efficient dynamic routing.
8. The collaborative motion recognition method based on the size model according to claim 7, wherein in step 1, soft labels are constructed by using the class probability output by the large model and the class probability output by the small model, and KL divergence between the soft labels after temperature smoothing is used as distillation loss; the total loss of the small model includes classification loss and distillation loss: ; Wherein, the Is the cross-entropy classification loss, Is the super parameter for controlling distillation contribution and distillation loss The KL divergence between the softmax outputs of the large and small models is defined by the calculation of: ; Wherein, the Is a function of the softmax of the device, Is a temperature parameter, KL represents the KL divergence, And outputting a prediction result for the large model.
9. The collaborative motion recognition method based on size model according to claim 7, wherein said collaborative motion recognition framework further comprises a hint-based adaptation module that introduces a set of learnable hint vectors The prior information which is used as the task specific information is injected into a transducer or an attention layer of a large model and a small model, and the self-adaptive characteristic expression formula is as follows: ; Wherein the method comprises the steps of The splicing of the prompt vector and the multi-mode fusion characteristic is represented; in the task of a small sample, parameters of a main network of a large model and a main network of a small model are kept frozen, and only prompt parameters are updated : ; And through prompt-guided feature adjustment, the cross-modal knowledge is transferred from the large model to the small model, so that knowledge migration between the large model and the small model is realized.

Description

Action recognition system and method based on cooperation of size models Technical Field The invention relates to the technical field of action recognition, in particular to a system and a method for recognizing actions based on cooperation of size models. Background Human motion recognition (Human Action Recognition, HAR for short) is an important research direction in the field of computer vision, whose core task is to recognize and understand the behavior patterns of humans by analyzing video or sensor data. The technology is widely applied to the fields of security monitoring, intelligent medical treatment, sports analysis, man-machine interaction and the like. The method can accurately understand the actions of different levels from simple gestures to complex group behaviors (such as sports games) of human beings, and is an important basis for constructing an intelligent video analysis system. In recent years, with the rapid development of artificial intelligence technology, a large language model (Large Language Model, LLM) has made a breakthrough progress. Based on this, researchers have further proposed multi-modal large language models (Multimodal Large Language Model, MLLM) that combine visual encoders with the language models to provide the models with the ability to "see" and "understand" the image and video content. Such models are typically trained on large-scale image-text pairs, can achieve zero sample and small sample learning capabilities in tasks such as visual question-answering, video summarization, image understanding and the like, and exhibit powerful general perception and reasoning performance. This multi-modal understanding capability is particularly important for motion recognition because the task often lacks large scale annotation data, and is costly and labor intensive. However, there are still a number of problems with existing motion recognition schemes based on large multimodal models: first, the computational complexity and inference delay are high. The large model has huge parameter quantity and obvious reasoning overhead, and is difficult to be deployed in terminal equipment with limited resources or higher real-time requirements. Second, timing fine-grained modeling is inadequate. Although the multi-modal large model stands out in terms of global semantic understanding, when processing long-time-sequence video, local fine motion information and temporal structure are easily ignored, resulting in motion boundary blurring. Third, the variety and complexity of action categories. The human actions have obvious intra-class differences and inter-class similarities, the same actions can show great differences due to different characters, visual angles, speeds and the like, and fine-granularity action recognition tasks also need to distinguish fine human-object interactions from local gesture changes. These problems result in a single model that has difficulty in combining "semantic understanding depth" with "computational efficiency". When the human visual system perceives dynamic information, the human visual system often relies on the cooperative processing of two cognitive mechanisms, namely 'local rapid perception' and 'global depth understanding'. In view of the foregoing, there is still a lack of a motion recognition framework in the prior art that balances accuracy and efficiency, and has dynamic routing and cross-modal semantic migration capabilities. Disclosure of Invention The technical problem to be solved by the embodiment of the invention is to provide a motion recognition system and a motion recognition method based on size model cooperation, so that high-performance and low-delay dynamic motion recognition can be realized in diversified and complicated video recognition scenes. In order to solve the technical problems, the embodiment of the invention provides a motion recognition system based on cooperation of a large model and a small model and a gating network, wherein, The large model comprises a shallow visual encoder, a text encoder, a deep visual encoder, a multi-mode fusion module and a language reasoning head, wherein the shallow visual encoder extracts first visual features according to an input video sequence, the text encoder extracts text input to obtain prompt texts, the deep visual encoder is used for carrying out semantic enhancement on the first visual features to obtain deep visual features, the multi-mode fusion module carries out multi-mode fusion on the deep visual features and the prompt texts, and the language reasoning head outputs action category probabilities and multi-mode semantic embedments; the small model comprises a lightweight visual encoder and a deep recognition structure, the lightweight visual encoder extracts a second visual characteristic according to an input video sequence, and the deep recognition structure outputs a prediction result according to the second visual characteristic; And when the complexity score is lower tha