Search

CN-122023926-A - Fine granularity identification method, system and application based on multi-mode large model collaborative distillation

CN122023926ACN 122023926 ACN122023926 ACN 122023926ACN-122023926-A

Abstract

The invention discloses a fine granularity recognition method, a system and an application based on multi-mode large model collaborative distillation, wherein the multi-mode large model is used as an expert teacher network, a local attention distribution map of an input image is extracted through a visual encoder of the multi-mode large model, deep semantic embedding characteristics aiming at a fine granularity image are generated through a language encoder of the multi-mode large model, a fine granularity student network comprising a backbone recognition network, a classification head, a quality evaluation head and an attribute analysis head is constructed in a training stage, knowledge information and local perception capability in the pre-trained multi-mode large model are migrated into the student network through cross-mode semantic alignment loss and attention distillation loss, and in an reasoning stage, fine granularity image recognition comparison, image quality evaluation and fine granularity attribute analysis can be realized only by using the backbone recognition network constructed in the training stage. The method can reduce the data labeling cost, enhance the generalization capability of the model in complex scenes, and simultaneously meet the real-time and lightweight requirements of actual system deployment.

Inventors

  • WU XIANG
  • SU XIAOSHENG
  • ZHAN DONGHUI

Assignees

  • 厦门瑞为信息技术股份有限公司

Dates

Publication Date
20260512
Application Date
20260204

Claims (10)

  1. 1. A fine granularity recognition system based on multi-mode large model collaborative distillation is characterized by comprising an expert teacher network module, a student network module, a training module and an reasoning module; The expert teacher network module adopts a multi-mode large model which at least has visual understanding and language interaction capability, is internally provided with a visual encoder and a language encoder, and is used for extracting a local attention distribution map of an input image and generating deep semantic embedding characteristics; The student network module is a single-mode vision model and comprises a backbone identification network, a classification head, a quality assessment head and an attribute analysis head, and is used for receiving knowledge information and local perception capability migrated by the expert teacher network module in a training stage and realizing fine-granularity image identification comparison, image quality assessment and fine-granularity attribute analysis in an reasoning stage; The training module is used for transferring knowledge in the expert teacher network module to the student network module through cross-modal semantic alignment loss and attention distillation loss, and training the student network module by using an Adam optimization algorithm; and the reasoning module is used for calling a backbone recognition network in the student network module to complete the relevant tasks of fine granularity recognition in the reasoning stage.
  2. 2. A fine-grained recognition system based on collaborative distillation of a multi-modal large model as set forth in claim 1, wherein the multi-modal large model uses a Qwen2.5-VL-7B multi-modal large model and the student network uses a Tiny-ViT-21M single-modal vision model.
  3. 3. A fine-grained identification method of multi-modal large-model collaborative distillation based on the fine-grained identification system of multi-modal large-model collaborative distillation according to claim 1 or 2, characterized by comprising the steps of: s1, taking a multi-mode large model as an expert teacher network, extracting a local attention distribution map of an input image through a visual encoder of the expert teacher network, and generating deep semantic embedding features aiming at a fine-granularity image by utilizing a language encoder of the expert teacher network; S2, constructing a fine-grained student network comprising a backbone identification network, a classification head, a quality evaluation head and an attribute analysis head; S3, in a training stage, knowledge information and local perception capability in the pre-trained multi-mode large model are migrated to a student network through cross-mode semantic alignment loss and attention distillation loss; and S4, in the reasoning stage, only using the backbone recognition network constructed in the training stage to complete fine granularity image recognition comparison, image quality evaluation and fine granularity attribute analysis.
  4. 4. The method for identifying fine granularity based on collaborative distillation of a multi-modal large model as set forth in claim 3, wherein in step S1, the multi-modal large model is used as an expert teacher network T, and visual characteristics of the multi-modal large model are output The method comprises the following steps: Wherein, the method comprises the steps of, Is a visual network in a multi-modal large model; The number of local feature modules of the expert teacher network; feature dimensions that are visual features; Is an input image; Semantic embedding features for multi-modal large models The method comprises the following steps: Wherein, the method comprises the steps of, Is a language model network in a multimodal big model, In order to present the word(s), Feature dimensions that embed features for semantics; Is an input image.
  5. 5. The fine-granularity recognition method based on the multi-mode large model collaborative distillation according to claim 3, wherein in step S2, the student network is used for fine-granularity image recognition, quality evaluation and attribute recognition tasks, and the output characteristic diagram of the student network The method comprises the following steps: Wherein, the method comprises the steps of, The number of local feature modules for the student network; outputting feature dimensions of the feature map for the student network; visual embedded features for student networks Wherein Feature dimensions are embedded for the vision of the student network; for quality assessment, student network quality score branches out as The core function of the quality assessment is to carry out multidimensional quality analysis on the input fine-grained image and output normalized comprehensive quality fractions; for fine-grained attribute identification, the attribute output branches of the student network are Wherein For the size of the predefined attribute dictionary, for the first Seed attribute, if it exists, then And if it does not exist 。
  6. 6. The method for identifying fine granularity based on collaborative distillation of a multi-modal large model as set forth in claim 3, wherein in order to migrate the visual local perception capability in the multi-modal large model in step S1 to the student network in step S2, a local perception distillation loss is provided for aligning the student network with the feature map output of the visual network in the multi-modal large model; visual feature output due to multi-modal large model Output characteristic diagram with student network The number of the local feature modules is not matched with the feature dimension, and the processing is carried out by using an average pooling mode aiming at the matching of the number of the local feature modules , Mapping the student network characteristic output map to a multi-mode large model visual characteristic space; thus, the local perceived distillation loss function The method comprises the following steps: Wherein the method comprises the steps of An average pooling operation; , for a visual network in a multimodal large model, For the number of local feature modules of the expert teacher network, As the feature dimension of the visual feature, Is an input image; , the number of local feature modules for the student network; outputting feature dimensions of the feature map for the student network; in order to introduce logic knowledge in the multi-modal large model into a student network, a cross-modal semantic alignment distillation loss is provided for aligning semantic embedded features of the multi-modal large model with visual embedded features of the student network, wherein the semantic embedded features are as follows , wherein, In order to present the word(s), In order to input an image of the subject, A language model network in a multi-modal large model; introducing a linear projection matrix Mapping the visual embedded features of the student network into the language large model hidden embedded space of the multi-modal large model, and performing semantic alignment by using cosine similarity, thereby distilling loss of semantic alignment The method comprises the following steps: Embedding features for the vision of the student network; A quality evaluation module is introduced, multi-dimensional disassembly evaluation is carried out on fine granularity image quality by utilizing a multi-mode large model, fine granularity image quality evaluation is carried out on fine granularity identification samples by utilizing prompt words, the output format is JSON { "definition": x1, "illumination": y 1} "shielding": z1}, wherein x1 is a definition score, y1 is an illumination score, z1 is a shielding score, and three-dimensional original scores of JSON are analyzed And mapped to The method comprises the following steps: ; Wherein, the =x1; =y1; =z1; Then, using weighting coefficients to define quality scores for expert teacher network synthesis The method comprises the following steps: ; Wherein the method comprises the steps of Is a weighted parameter and satisfies ; Quality assessment loss The method comprises the following steps: ; Wherein, the The comprehensive quality score is output for the student network quality assessment head; regarding fine-grained attribute analysis, converting unstructured visual features into structured multi-label semantic features by using a multi-mode large model, and describing fine-grained visual attributes of a target object by using a prompt word; Defining a global attribute dictionary Scale of The output result of the multi-mode large model is mapped into one Two-dimensional vector of dimensions For the first Seed attribute, if it exists, then And if it does not exist Attribute predictive multi-tag classification loss The method comprises the following steps: Wherein, the The size of the attribute dictionary is predefined; for the sigmoid activation function, The function is to predict the attribute of the student network output Mapping to the [0,1] interval, converting into the' first Probability of the presence of a species "; Showing from property 1 to property 1 The seed attribute accumulates the loss items and covers all the predefined fine granularity attributes; output for expert teacher network The true label of the seed attribute has a value of 0or 1; output branch pair for student network attribute Original predicted values of seed attributes; The method is a logarithmic function and is used for carrying out logarithmic conversion on the probability value after Sigmoid activation and amplifying the loss when the predicted value deviates from the real label; In summary, the total loss function for network optimization is: Wherein the method comprises the steps of Based on the loss function, in the training process of step S3, the Adam optimization algorithm is used for counter propagation to fix the expert teacher network with a constant value Training only to optimize the student network And obtaining the final training result.
  7. 7. The method for identifying fine granularity based on collaborative distillation of a multi-mode large model according to claim 3, wherein in step S1, the attention distribution map is extracted by a multi-mode large model visual encoder to extract layered characteristics of an input image, and the association weight of each pixel point and surrounding pixels is calculated through a self-attention mechanism to generate a two-dimensional attention weight matrix with dimensions corresponding to the resolution of the input image, namely the local attention distribution map.
  8. 8. The method for identifying fine granularity based on collaborative distillation of a multi-mode large model according to claim 3, wherein in step S3, when training is performed by using an Adam optimization algorithm, an initial learning rate and a weight attenuation coefficient are set, the training iteration number is 200-300, learning rate attenuation is performed every 30-50 rounds, and the attenuation coefficient is 0.5.
  9. 9. The method for identifying fine granularity based on multi-mode large model collaborative distillation according to claim 3, wherein in step S4, the reasoning flow of fine granularity identification is that the fraction output by the quality assessment head is firstly used for Screening the image when And for the screened image, the backbone recognition network outputs a classification result and an attribute vector, and the final recognition result is output by combining the classification confidence and the attribute matching degree.
  10. 10. An application of a fine-granularity recognition method based on multi-mode large model collaborative distillation, which is characterized in that the fine-granularity recognition method based on multi-mode large model collaborative distillation according to any one of claims 3-8 is applied to any scene of pedestrian re-recognition, face recognition, vehicle model recognition or biological species recognition.

Description

Fine granularity identification method, system and application based on multi-mode large model collaborative distillation Technical Field The invention relates to the technical field of computer vision, in particular to a fine granularity identification method, a fine granularity identification system and application based on multi-mode large model collaborative distillation. Background Fine-grained recognition (FINE GRAINED Visual Classification, FGVC) is one of the important research directions in the current field of computer vision, aimed at distinguishing highly similar subclasses belonging to the same general class, such as face recognition, vehicle model recognition, and biological species recognition. The current mainstream fine granularity recognition method is based on a deep convolutional neural network or a visual transducer, and the judgment and perception capability of a model to key region details are enhanced through a local attention mechanism and measurement learning. However, the existing fine-granularity recognition technology has the remarkable defects that firstly, fine-granularity recognition is highly dependent on data labeling of a component level or an attribute level, the acquisition cost of such labeling data is extremely high, the generalization capability of a model is limited, secondly, a traditional model based on a convolutional neural network or a visual transducer lacks explicit semantic constraint, the performance of the model can be remarkably reduced when a complex scene or data distribution changes, and in practical application, independent models are generally adopted for quality evaluation, attribute analysis and classification tasks, so that recognition precision is limited due to lack of feature sharing among multiple tasks, and storage and calculation burden during end-side deployment is increased. Therefore, how to solve the problem of high dependence of the traditional method on component-level or attribute-level data annotation, how to solve the problem of insufficient generalization capability caused by the disjoint of traditional pure visual features and deep semantic information, and how to meet the deployment requirements of real-time and light weight of an actual system while introducing a large model of a mode to improve the performance, become the key problem to be solved urgently in the current fine-grained identification technology. Disclosure of Invention The invention aims to overcome the defects of the prior art, and provides a fine granularity identification method, a system and an application based on multi-mode large model collaborative distillation, which can reduce the data marking cost, enhance the generalization capability of a model in a complex scene and simultaneously meet the real-time and lightweight requirements of actual system deployment. In order to achieve the above object, the solution of the present invention is: a fine granularity recognition system based on multi-mode large model collaborative distillation comprises an expert teacher network module, a student network module, a training module and an reasoning module; The expert teacher network module adopts a multi-mode large model which at least has visual understanding and language interaction capability, is internally provided with a visual encoder and a language encoder, and is used for extracting a local attention distribution map of an input image and generating deep semantic embedding characteristics; The student network module is a single-mode vision model and comprises a backbone identification network, a classification head, a quality assessment head and an attribute analysis head, and is used for receiving knowledge information and local perception capability migrated by the expert teacher network module in a training stage and realizing fine-granularity image identification comparison, image quality assessment and fine-granularity attribute analysis in an reasoning stage; The training module is used for transferring knowledge in the expert teacher network module to the student network module through cross-modal semantic alignment loss and attention distillation loss, and training the student network module by using an Adam optimization algorithm; and the reasoning module is used for calling a backbone recognition network in the student network module to complete the relevant tasks of fine granularity recognition in the reasoning stage. Further, the multi-modal large model uses a Qwen2.5-VL-7B multi-modal large model, and the student network uses a Tiny-ViT-21M single-modal visual model. The fine granularity recognition method of the multi-mode large model collaborative distillation based on the fine granularity recognition system of the multi-mode large model collaborative distillation comprises the following steps: s1, taking a multi-mode large model as an expert teacher network, extracting a local attention distribution map of an input image through a visual encode