CN-122021923-A - Multi-mode large model construction method and device, electronic equipment and storage medium

CN122021923ACN 122021923 ACN122021923 ACN 122021923ACN-122021923-A

Abstract

The present disclosure provides a method, a device, an electronic device and a storage medium for constructing a multi-modal large model, and relates to the technical field of artificial intelligence such as multi-modal large model, model construction, etc., the method includes determining identification information of each modal component according to configuration requirement information of a target multi-modal large model; the method comprises the steps of respectively extracting independent configuration parameters corresponding to each identification information from a hierarchical configuration management platform, fusing each independent configuration parameter to obtain complete configuration parameters, configuring a coding process according to the coding parameters corresponding to each mode in the complete configuration parameters by utilizing a multi-mode coding model, configuring a decoding process according to the decoding parameters corresponding to each mode in the complete configuration parameters by utilizing a multi-mode decoding model to obtain configured encoders and decoders of each mode, and splicing the configured encoders and decoders of each mode with a language big model according to a data processing time sequence by utilizing a multi-mode splicing model to obtain a target multi-mode big model. The method can be used for conveniently constructing the needed multi-mode large model.

Inventors

SHEN KUN
XIAO ZHIWEN
YE XIAOCHUAN
CHEN YOUWEI
JIANG HAICHENG
SUN YUEHANG
ZHANG HENGHUA
SHEN DOU
LI SHIYONG
WANG YANPENG

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260213

Claims (13)

1. A method for constructing a multi-mode large model comprises the following steps: Determining identification information of each mode component forming the target multi-mode large model according to the acquired configuration requirement information of the target multi-mode large model; Independent configuration parameters corresponding to the identification information are respectively extracted from a preset hierarchical configuration management platform, and the independent configuration parameters are fused to obtain complete configuration parameters corresponding to the target multi-mode large model; Configuring the encoding process according to the encoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode encoding model, and configuring the decoding process according to the decoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode decoding model to obtain configured encoders and decoders of each mode; And splicing the configured encoders and decoders of all modes with the language big model serving as the base by using a preset multi-mode splicing model according to the data processing time sequence to obtain a constructed target multi-mode big model.
2. The method of claim 1, wherein the fusing each of the independent configuration parameters to obtain a complete configuration parameter corresponding to the target multi-modal large model comprises: Determining the association relation between the modal components according to the processing sequence of the data between the modal components; Determining a target modal component pair with an input-output relationship according to the association relationship, and performing alignment processing of an input-output format on independent configuration parameters of the target modal component pair to obtain configuration parameters after the alignment processing; And based on the processed configuration parameters and the independent configuration parameters of the modal components which are not the target modal component pair, fusing to obtain the complete configuration parameters.
3. The method of claim 1, wherein configuring the encoding process according to the encoding parameters corresponding to each modality in the complete configuration parameters by using a preset multi-modality encoding model includes: Extracting original coding parameters corresponding to each mode from the complete configuration parameters by utilizing the multi-mode coding model; The original coding parameters corresponding to each mode are adjusted according to the overall parameter scale of each mode model described in the complete configuration parameters by utilizing the multi-mode coding model to obtain adjusted coding parameters matched with the scale, wherein the larger the overall parameter scale of the corresponding mode model is, the more the parameter quantity of the adjusted coding parameters in the corresponding mode is, the more the encoder structure is complicated, and the longer the output coded characteristics are; And carrying out parameter configuration on the encoder under each mode according to the adjusted coding parameters by utilizing the multi-mode coding model.
4. A method according to claim 3, further comprising: After the configured encoders of all modes are obtained, the characteristics output by the encoders of all the modes are fused by utilizing fusion strategy parameters corresponding to the fusion of the multi-mode characteristics in the complete configuration parameters, so as to obtain fused multi-mode characteristics, wherein the fusion strategy parameters comprise fusion types and weight coefficients of the characteristics of all the modes.
5. The method of claim 1, wherein configuring the decoding process according to the decoding parameters corresponding to each modality in the complete configuration parameters by using a preset multi-modality decoding model includes: Extracting original decoding parameters corresponding to each mode from the complete configuration parameters by utilizing the multi-mode decoding model; The original decoding parameters corresponding to each mode are adjusted according to the overall parameter scale of each mode model described in the complete configuration parameters by utilizing the multi-mode decoding model to obtain adjusted decoding parameters matched with the scale, wherein the larger the overall parameter scale of the corresponding mode model is, the more the parameter quantity of the adjusted decoding parameters under the corresponding mode is, the more the decoder structure is complex, and the larger the storage space occupied by the output decoding data is; And carrying out parameter configuration on the decoder under each mode according to the adjusted decoding parameters by utilizing the multi-mode decoding model.
6. The method of claim 1, wherein the splicing the configured encoders and decoders of each modality with the language big model as a base according to the data processing time sequence by using a preset multi-modality splicing model to obtain the constructed target multi-modality big model comprises: Splicing the output end of the configured encoder of each mode with the input end of the language big model by utilizing the multi-mode splicing model; and splicing the output end of the language big model with the input end of the decoder of each configured mode by using the multi-mode splicing model to obtain the constructed target multi-mode big model.
7. The method of claim 6, further comprising: extracting parallel strategy parameters corresponding to the identification information from the hierarchical configuration management platform, wherein the parallel strategy parameters comprise at least one of data parallel scale, model parallel scale or pipeline parallel scale; and guiding a parallel processing mode in at least one of an encoding process, a decoding process and a training process by using the parallel strategy.
8. The method of claim 1, further comprising: layering all components under each mode model according to a preset abstract classification standard in advance, wherein the layering available under the abstract classification standard comprises a basic model class and a model component class; And respectively determining independent configuration parameters corresponding to the components under each hierarchy, and constructing a hierarchical configuration management platform for independently managing the corresponding independent configuration parameters of the components under the subordinate hierarchy.
9. The method of any of claims 1-8, further comprising: In response to receiving configuration adjustment information for the target multi-modal large model, determining differential configuration information compared to the configuration requirement information according to the configuration adjustment information; determining an original mode component to be replaced and a new mode component as a replacement object according to the difference configuration information; extracting new independent configuration parameters from the hierarchical configuration management platform by utilizing the identification information of the new modal component; Updating to obtain a new complete configuration parameter based on the complete configuration parameter and the new independent configuration parameter; And constructing a new target multi-mode large model corresponding to the configuration adjustment information based on the new complete configuration parameters, the multi-mode coding model, the multi-mode decoding model and the multi-mode splicing model.
10. A multi-modal large model building apparatus comprising: an identification information determining unit configured to determine identification information of each modal component constituting the target multi-modal large model according to the acquired configuration requirement information of the target multi-modal large model; the configuration parameter extraction and fusion unit is configured to extract independent configuration parameters corresponding to the identification information from a preset hierarchical configuration management platform respectively, and fuse the independent configuration parameters to obtain complete configuration parameters corresponding to the target multi-mode big model; The encoding and decoding configuration unit is configured to configure an encoding process according to encoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode encoding model, and configure a decoding process according to decoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode decoding model, so as to obtain configured encoders and decoders of each mode; the multi-mode large model construction unit is configured to splice the configured encoders and decoders of all modes with the language large model serving as a base according to a data processing time sequence by utilizing a preset multi-mode splicing model to obtain a constructed target multi-mode large model.
11. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of constructing a multimodal mass model as claimed in any of claims 1 to 9.
12. A non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of constructing a multimodal large model according to any of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements a method of building a multimodal big model according to any of claims 1-9.

Description

Multi-mode large model construction method and device, electronic equipment and storage medium Technical Field The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as multi-modal large model and model construction, and in particular, to a method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product for constructing a multi-modal large model. Background With the deep development of artificial intelligence technology, a large model capable of processing and understanding information of multiple modes such as text, image, audio and video has become an important trend. However, building such a multi-modal large model faces significant challenges, that is, the conventional model building manner is generally designed integrally for a fixed, preset model architecture, and the processing components (such as a visual encoder and a language model) of each mode are deeply coupled to the whole frame, so that the configuration parameters are mixed and difficult to adjust independently. When a component needs to be replaced to adapt to a new mode, upgrade the model capability or optimize a specific task, the data flow and parameter system of the whole model often needs to be redesigned and verified, so that the development period is long, the technical reusability is low and the trial and error cost is high. In addition, due to the lack of the ability to unify and refine parameter management for heterogeneous components, it is difficult to flexibly allocate computing resources according to the characteristics of each component, limiting the efficiency of model construction and the upper limit of final performance. Disclosure of Invention The embodiment of the disclosure provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for constructing a multi-mode large model. According to the method, identification information of each mode component forming the target multi-mode big model is determined according to acquired configuration requirement information of the target multi-mode big model, independent configuration parameters corresponding to the identification information are respectively extracted from a preset hierarchical configuration management platform, the independent configuration parameters are fused to obtain complete configuration parameters corresponding to the target multi-mode big model, a coding process is configured according to the coding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode coding model, a decoding process is configured according to the decoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-mode decoding model, a configured encoder and decoder of each mode are obtained, and the configured encoder and decoder of each mode are spliced with a language big model serving as a base by using the preset multi-mode splicing model according to a data processing time sequence, so that the constructed target multi-mode big model is obtained. In a second aspect, an embodiment of the present disclosure proposes a device for constructing a multi-modal large model, including an identification information determining unit configured to determine identification information of each modal component constituting the target multi-modal large model according to acquired configuration requirement information of the target multi-modal large model, a configuration parameter extracting and fusing unit configured to extract independent configuration parameters corresponding to each identification information from a preset hierarchical configuration management platform, and fuse each independent configuration parameter to obtain a complete configuration parameter corresponding to the target multi-modal large model, a coding and decoding configuration unit configured to configure a coding process according to coding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-modal coding model, configure a decoding process according to decoding parameters corresponding to each mode in the complete configuration parameters by using a preset multi-modal decoding model, and obtain configured encoders and decoders of each mode, and a multi-modal large model constructing unit configured to splice each configured encoder and each decoder of each configured to be used as a time-sequential large model of a base according to data processing by using a preset multi-modal splicing model, and obtain a constructed target multi-modal large model. In a third aspect, an embodiment of the present disclosure provides an electronic device, including at least one processor, and a memory communicatively coupled to the at least one processor, where the memory stores instructions executa