CN-121981171-A - Training method, device, equipment and medium for multi-modal large language model

CN121981171ACN 121981171 ACN121981171 ACN 121981171ACN-121981171-A

Abstract

The invention discloses a training method, device, equipment and medium of a multi-modal large language model, which are applied to financial/medical scenes, and are used for preprocessing a multi-modal data training set which is obtained and contains sample images and corresponding sample texts to obtain a target multi-modal data training set, carrying out loss calculation on a first training structure obtained by connecting a visual encoder and the large language model in series according to the target multi-modal data training set to obtain language modeling loss, carrying out loss calculation on a second training structure obtained by connecting a visual representation alignment middle layer, a visual basic model, a visual-language projection module and an alignment loss module in series according to the target multi-modal data training set to obtain alignment loss, training the multi-modal large language model based on the language modeling loss and the alignment loss until a model training end condition is reached to obtain a target multi-modal large language model, further improving the alignment effect of multi-modal information and improving the understanding capability of the model on image visual information.

Inventors

WANG JIANZONG
ZHANG XULONG
Bao Xikun

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260120

Claims (10)

1. A method of training a multimodal large language model, the multimodal large language model comprising a visual encoder, a large language model, a visual-to-language projection module, a visual base model, and an alignment loss module, the method comprising: Acquiring a multi-modal data training set containing a sample image and a corresponding sample text, and preprocessing the multi-modal data training set to obtain a target multi-modal data training set; The visual encoder and the large language model are connected in series to obtain a first training structure, and loss calculation is carried out on the first training structure according to the target multi-mode data training set to obtain language modeling loss; Determining a visual representation alignment middle layer in the multi-modal large language model, and connecting the visual representation alignment middle layer, the visual basic model, the visual-language projection module and the alignment loss module in series to obtain a second training structure, and carrying out loss calculation on the second training structure according to the target multi-modal data training set to obtain alignment loss; and training the multi-modal large language model based on the language modeling loss and the alignment loss until a model training ending condition is reached, so as to obtain a target multi-modal large language model.
2. The method for training a multi-modal large language model according to claim 1, wherein the performing a loss calculation on the visual encoder and the large language model in the first training structure according to the target multi-modal data training set to obtain a language modeling loss comprises: inputting the target multi-modal data training set into the visual encoder to obtain a first visual feature corresponding to the sample image; Inputting the first visual features into the large language model to obtain text features corresponding to the first visual features; And carrying out loss calculation on the first visual features corresponding to each sample image and the text features corresponding to each sample text to obtain language modeling loss.
3. The method of training a multimodal large language model according to claim 1, wherein the determining a visual representation alignment middle layer in the multimodal large language model comprises: Acquiring a transducer structure of the multi-modal large language model, wherein the transducer structure comprises a plurality of network layers; in the network layers of the multiple hierarchies, modeling probability distribution of visual representations output by each network layer, and calculating information entropy of the visual representations of each network layer based on the probability distribution; calculating the preliminary similarity between the information entropy of each network layer visual representation and preset visual characteristics; Determining a first weight coefficient corresponding to the information entropy and a second weight coefficient corresponding to the preliminary similarity, and calculating the comprehensive score of each network layer according to the information entropy, the preliminary similarity, the first weight coefficient and the second weight coefficient; and selecting a network layer with a comprehensive score of K at the top of the ranking as a current iteration period for visually representing the alignment middle layers, wherein K is a preset positive integer, and the number of the visual representation alignment middle layers is one or more.
4. The method for training a multimodal large language model according to claim 1, wherein said performing a loss calculation on the second training structure according to the target multimodal data training set to obtain an alignment loss comprises: Inputting the target multi-modal data training set into the vision basic model to perform feature extraction to obtain a second vision feature, wherein the number of the vision basic models is one or more; Mapping the visual representation alignment intermediate layer to the same feature space as the second visual feature through the visual-language projection module to obtain a mapped intermediate layer visual representation feature; And calculating the loss of the second visual characteristic and the middle layer visual representation characteristic through the alignment loss module to obtain the alignment loss.
5. The method for training a multi-modal large language model according to claim 1, wherein training the multi-modal large language model based on the language modeling loss and the alignment loss until a model training end condition is reached, and obtaining a target multi-modal large language model, comprises: determining a total loss function based on the language modeling loss and the alignment loss; Inputting the target multi-modal data training set into a trained multi-modal large language model through back propagation according to the total loss function, and outputting a prediction result of the target multi-modal data; Calculating the accuracy of the multi-modal large language model based on the prediction result and the real result; repeating the test until the accuracy of the multi-modal large language model reaches a preset value to obtain a target multi-modal large language model; the total loss function is: Wherein, the Represented as the language modeling penalty of the language, Denoted as the said loss of alignment, A weight coefficient expressed as the alignment loss.
6. The method for training a multimodal large language model according to claim 1, wherein preprocessing the multimodal data training set to obtain a target multimodal data training set comprises: performing size normalization, pixel standardization and data enhancement on the sample images in the multi-mode data training set to obtain a processed sample image data set; carrying out word segmentation, mask coding and length truncation processing on the sample text in the multi-mode data training set to obtain a processed sample text data set; And fusing the processed sample image dataset and the processed sample text dataset to obtain a target multi-modal data training set.
7. A training device for a multimodal large language model, the multimodal large language model comprising a visual encoder, a large language model, a visual-to-language projection module, a visual base model, and an alignment loss module, the training device comprising: The acquisition module is used for acquiring a multi-modal data training set containing a sample image and a corresponding sample text, and preprocessing the multi-modal data training set to obtain a target multi-modal data training set; The calculation module is used for connecting the visual encoder and the large language model in series to obtain a first training structure, and carrying out loss calculation on the first training structure according to the target multi-mode data training set to obtain language modeling loss; The determining module is used for determining a visual representation alignment middle layer in the multi-modal large language model, connecting the visual representation alignment middle layer, the visual basic model, the visual-language projection module and the alignment loss module in series to obtain a second training structure, and carrying out loss calculation on the second training structure according to the target multi-modal data training set to obtain alignment loss; And the training module is used for training the multi-modal large language model based on the language modeling loss and the alignment loss until the model training ending condition is reached, so as to obtain the target multi-modal large language model.
8. The training apparatus of a multimodal large language model of claim 7 wherein said computing module is specifically configured to: inputting the target multi-modal data training set into the visual encoder to obtain a first visual feature corresponding to the sample image; Inputting the first visual features into the large language model to obtain text features corresponding to the first visual features; And carrying out loss calculation on the first visual features corresponding to each sample image and the text features corresponding to each sample text to obtain language modeling loss.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the training method of the multimodal large language model as claimed in any one of claims 1 to 6 when executing the computer program.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the method of training a multimodal large language model as claimed in any one of claims 1 to 6.

Description

Training method, device, equipment and medium for multi-modal large language model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a training method, device, equipment and medium of a multi-modal large language model. Background The multi-mode large language model is used as a basic research task of the general artificial intelligence, and aims to return an answer which is most consistent with the question description through a given picture and the question description, so that a user is better helped to understand the picture content or generate text description related to the picture. With remarkable progress in recent years, by expanding the data scale and model size, these large language models have stimulated remarkable potential capabilities, for example, intelligent dialogue systems based on multi-modal large language models have been widely used in many fields such as financial customer service, medical consultation, and the like. However, the existing multi-modal large language model still has performance deficiency in visual center tasks such as object counting, spatial relationship reasoning and the like, mainly depends on a text-output supervision mode, a visual path is often only supervised indirectly through the language task, and a direct supervision mechanism for visual representation is lacking, so that intermediate visual representation is easy to lose fine granularity/space/structure information, further the performance of the intermediate visual representation in the visual center task is influenced, and the defect that the multi-modal large language model cannot well understand image characteristics and modality alignment is caused. Therefore, how to improve the alignment effect of the multi-modal information, so that the understanding capability of the multi-modal large language model to the image visual information is improved, is a technical problem to be solved urgently. Disclosure of Invention Based on the above, it is necessary to solve the above-mentioned technical problems, and embodiments of the present invention provide a training method, apparatus, device, and medium for a multi-modal large language model, which can improve the alignment effect of multi-modal information, so that the understanding capability of the multi-modal large language model on image visual information is improved. A first aspect of an embodiment of the present application provides a training method of a multimodal large language model, the multimodal large language model including a visual encoder, a large language model, a visual-language projection module, a visual base model, and an alignment loss module, the training method including: Acquiring a multi-modal data training set containing a sample image and a corresponding sample text, and preprocessing the multi-modal data training set to obtain a target multi-modal data training set; The visual encoder and the large language model are connected in series to obtain a first training structure, and loss calculation is carried out on the first training structure according to the target multi-mode data training set to obtain language modeling loss; Determining a visual representation alignment middle layer in the multi-modal large language model, and connecting the visual representation alignment middle layer, the visual basic model, the visual-language projection module and the alignment loss module in series to obtain a second training structure, and carrying out loss calculation on the second training structure according to the target multi-modal data training set to obtain alignment loss; and training the multi-modal large language model based on the language modeling loss and the alignment loss until a model training ending condition is reached, so as to obtain a target multi-modal large language model. A second aspect of an embodiment of the present application provides a training apparatus for a multimodal large language model including a visual encoder, a large language model, a visual-language projection module, a visual base model, and an alignment loss module, the training apparatus comprising: The acquisition module is used for acquiring a multi-modal data training set containing a sample image and a corresponding sample text, and preprocessing the multi-modal data training set to obtain a target multi-modal data training set; The calculation module is used for connecting the visual encoder and the large language model in series to obtain a first training structure, and carrying out loss calculation on the first training structure according to the target multi-mode data training set to obtain language modeling loss; The determining module is used for determining a visual representation alignment middle layer in the multi-modal large language model, connecting the visual representation alignment middle layer, the visual basic model, the visual-language projection module and the alignm