CN-121982466-A - Multi-mode feature dynamic alignment fusion-based image large model optimization method and system

CN121982466ACN 121982466 ACN121982466 ACN 121982466ACN-121982466-A

Abstract

The invention discloses an image large model optimization method and system based on multi-mode feature dynamic alignment fusion, and relates to the technical field of artificial intelligence and computer vision. The method comprises the steps of obtaining multi-mode input data and extracting cross-mode characteristics, realizing Token-level fine granularity characteristic alignment through a dynamic alignment module, adaptively adjusting attention temperature coefficient and Ji Quan weight pairs, adopting a hierarchical dynamic fusion mechanism, combining a gating attention unit with characteristic importance evaluation to realize mode information complementation fusion, training an image large model based on the optimized fusion characteristics, and outputting an optimization result. According to the invention, through the collaborative design of dynamic alignment and fusion strategies, the generalization capability and feature fusion precision of the model in a complex scene are remarkably improved, the influence of noise data on the performance of the model is reduced, and the method can be widely applied to the fields of medical image analysis, automatic driving perception, e-commerce image-text retrieval and the like, and the requirements of cloud and edge end deployment are met.

Inventors

Mei Shuhuan
ZHOU JIAQI
XU LONG

Assignees

合肥星河矩阵科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. The image large model optimization method based on the multi-mode feature dynamic alignment fusion is characterized by comprising the following steps of: s1, acquiring multi-mode input data, wherein the multi-mode input data at least comprises image mode data and text mode data, after preprocessing the multi-mode input data, respectively extracting image mode characteristics and text mode characteristics through a double-branch characteristic extraction network, and the double-branch characteristic extraction network supports the adaptive deployment of a lightweight model and a complex model; S2, inputting the image mode characteristics and the text mode characteristics into a dynamic alignment module, constructing a cross-mode correlation matrix through Token-level cosine similarity calculation, realizing fine granularity characteristic alignment by combining a bipartite graph matching algorithm, and simultaneously, adaptively adjusting an attention temperature coefficient based on a training progress and a characteristic consistency index and dynamically optimizing alignment weight distribution; S3, performing hierarchical dynamic fusion operation on the cross-modal features subjected to dynamic alignment, and evaluating task contribution degrees of different modal features by using a gating attention unit through cooperative work of a local feature fusion layer and a global feature fusion layer to output self-adaptive fusion features; S4, constructing a multi-loss combined optimization system, wherein the optimization system comprises output layer classification loss, middle layer fine granularity alignment loss and fusion characteristic consistency loss, dynamically adjusting the weight of each loss item based on a model training state, and updating model parameters through back propagation; S5, inputting the self-adaptive fusion characteristics into a downstream task layer of the image large model, completing model training optimization, and outputting a prediction result aiming at a target task.
2. The method for optimizing the large image model based on the dynamic alignment fusion of the multi-modal features according to claim 1, wherein the preprocessing in the step S1 comprises data cleaning, normalization and modal complementation, a label softening strategy is adopted for adjusting a matched label value aiming at noise data, the dual-branch feature extraction network comprises a visual branch and a text branch, the visual branch adopts ResNet series or Swin transducer series models, and the text branch adopts BERT series models.
3. The method for optimizing the large image model based on the dynamic alignment fusion of the multi-modal features according to claim 1, wherein the temperature coefficient adjustment rule of the dynamic alignment module in step S2 is that a larger temperature coefficient is set in the initial training stage to enable the attention distribution to be uniform, the correlation of the focusing core features of the temperature coefficient is gradually reduced along with the increase of the training iteration number, and when the feature consistency index is lower than a preset threshold value, the global alignment is restarted by temporarily increasing the temperature coefficient.
4. The method for optimizing the large image model based on dynamic alignment fusion of multi-modal features according to claim 1, wherein the gated attention unit calculates the modal feature weight through a formula in step S3.
5. The method for optimizing an image large model based on dynamic alignment fusion of multi-modal features according to claim 1, wherein the total loss function of the multi-loss joint optimization system in step S4 is l_total=ω 1 L_cls+ω 2 L_align+ω 3 l_ons, where ω 1 、ω 2 、ω 3 is a dynamic adjustment weight, l_cls is a cross entropy loss, l_align is a mean square error loss, and l_ons is a contrast loss.
6. An image large model optimization system based on multi-modal feature dynamic alignment fusion, which is characterized by comprising: the data preprocessing and feature extraction module is used for acquiring multi-mode input data and executing preprocessing operation, and extracting image mode features and text mode features through a double-branch feature extraction network; The dynamic alignment module is used for receiving the cross-modal characteristics and realizing fine granularity cross-modal characteristic alignment through Token level similarity calculation, bipartite graph matching and self-adaptive temperature coefficient adjustment; The hierarchical dynamic fusion module comprises a local feature fusion layer and a global feature fusion layer, and a built-in gating attention unit is used for evaluating the contribution degree of the modal features and outputting self-adaptive fusion features; The multi-loss joint optimization module is used for constructing a multi-loss function system, dynamically adjusting loss weights and updating model parameters through back propagation; and the model training and reasoning module is used for receiving the self-adaptive fusion characteristics, completing training optimization and downstream task reasoning of the image large model and outputting a prediction result.
7. The image large model optimization system based on the dynamic alignment fusion of the multi-modal features according to claim 6, wherein the dynamic alignment module comprises a similarity calculation unit, a bipartite graph matching unit and a temperature coefficient self-adaption unit, the temperature coefficient self-adaption unit monitors feature consistency indexes in real time, and dynamically adjusts the value range of the temperature coefficient to be 0.05-0.5 based on a preset rule.
8. The image large model optimization system based on multi-modal feature dynamic alignment fusion according to claim 6, wherein the local feature fusion layer of the hierarchical dynamic fusion module adopts 1D-CNN to extract modal local correlation features, the global feature fusion layer adopts a transducer encoder to capture long-distance dependency, and the gated attention unit and the two-layer fusion structure work cooperatively to realize feature screening and enhancement.
9. The multi-modal feature dynamic alignment fusion-based image large model optimization system according to claim 6, wherein a weight adjuster is built in the multi-loss joint optimization module, weights of loss items are dynamically distributed through model training accuracy and convergence speed, the weight ratio of omega 2 to omega 3 in the initial training period is not lower than 60%, and the weight ratio of omega 1 in the later training period is gradually increased to 50% -60%.
10. The image large model optimization system based on the dynamic alignment fusion of the multi-modal features, according to claim 6, is characterized in that the system supports cloud end and edge end deployment, a MobileNet-V2 lightweight model is adopted as a visual branch during edge end deployment, and the reasoning delay is controlled within 150ms through model pruning and quantitative optimization reasoning speed.

Description

Multi-mode feature dynamic alignment fusion-based image large model optimization method and system Technical Field The invention relates to the technical field of artificial intelligence, computer vision and multi-modal fusion, in particular to an image large model optimization method and system based on multi-modal feature dynamic alignment fusion. Background Along with the rapid development of artificial intelligence technology, the multi-mode image large model is increasingly widely applied to the fields of medical image analysis, automatic driving, electronic commerce retrieval and the like. The multi-mode image model can break through the limitation of single-mode data by fusing various mode information such as images, texts and the like, and improves the understanding capability of the model on complex scenes. However, the existing multi-mode image large model still has a plurality of technical bottlenecks in the characteristic alignment and fusion process, and further improvement of the model performance is severely restricted. In the prior art, a static strategy is mostly adopted for multi-modal feature alignment, such as a fixed attention temperature coefficient, a preset pair Ji Quan weight and the like, and the multi-modal feature alignment cannot adapt to the difference between dynamically-changed input data and modal features. For example, mainstream models such as CLIP rely on output layer feature calculation loss, neglecting alignment requirements of middle layer fine granularity features, and when input data has a non-strict corresponding relationship (such as similar text describing different images), the input data is easy to be interfered by alignment noise, so that model generalization is poor. Meanwhile, the static attention mechanism often has the problem of 'mode hegemony', namely, one mode characteristic is excessively strengthened, and the other mode characteristic information is diluted, for example, the spatial position information of a visual mode is covered by the semantic information of a language mode, so that the cross-mode semantic mismatch rate is high. In the aspect of feature fusion, the traditional method mostly adopts a single strategy of early fusion or late fusion, while the early fusion can keep more detail information, the calculation cost is high and noise interference is easy to amplify, and the late fusion reduces the calculation cost and loses complementary information among modes, so that the fusion precision and the reasoning efficiency are difficult to be compatible. In addition, the existing loss function design focuses on the performance optimization of an output layer, and lacks constraint on alignment quality of the middle layer and consistency of fusion characteristics, so that problems of slow convergence, local optimum and the like are easy to occur in the model training process. Some prior art attempts to improve performance through middle layer fusion and multiple loss optimization, such as middle layer feature fusion and triple loss mechanism proposed by Hua patent, but the scheme still adopts a fixed attention distribution strategy, has insufficient adaptability to dynamic noise data, and has insufficient inference efficiency optimization in a light deployment scene. Therefore, how to invent an image large model optimization method and system based on multi-modal feature dynamic alignment fusion to improve the problems becomes a problem to be solved by the person skilled in the art. Disclosure of Invention The invention aims to provide an image large model optimization scheme capable of realizing dynamic feature alignment, self-adaptive fusion strategy and multi-dimensional loss optimization, and solves the problems in the background technology. In a first aspect, an embodiment of the present application provides an image large model optimization method based on dynamic alignment fusion of multi-modal features, including the following steps: s1, acquiring multi-mode input data, wherein the multi-mode input data at least comprises image mode data and text mode data, after preprocessing the multi-mode input data, respectively extracting image mode characteristics and text mode characteristics through a double-branch characteristic extraction network, and the double-branch characteristic extraction network supports the adaptive deployment of a lightweight model and a complex model; In a specific embodiment, the preprocessing in step S1 includes data cleaning, normalization and modal complementation, a label softening strategy is adopted to adjust the matched label value for noise data, the dual-branch feature extraction network includes a visual branch and a text branch, the visual branch adopts ResNet series or Swin transducer series models, and the text branch adopts BERT series models. S2, inputting the image mode characteristics and the text mode characteristics into a dynamic alignment module, constructing a cross-mode correlation matrix through