CN-121686166-B - Multi-mode large model construction method for water conservancy scene
Abstract
The invention discloses a multi-mode large model construction method for a water conservancy scene, which comprises the following steps of data acquisition, data preprocessing, construction of a water conservancy multi-mode data set, training of a reference multi-mode large model, saving of model weight by adopting a BF16 data format, and 4-bit k quantization technology of a attention mechanism key tensor. The invention provides the quantitative weight decomposition low-rank adapter for supervised fine adjustment, effectively improves the performance of the reference multi-mode large model on specific tasks in the water conservancy field, reduces the calculation cost of training through a parameter efficient fine adjustment strategy, and provides convenience for subsequent iterative optimization.
Inventors
- ZHOU WEI
- ZHANG JUN
- Nie wang
- XU HAIXIA
- ZHAO SILIANG
- YAN CHONG
- REN YIHE
Assignees
- 湘潭大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260206
Claims (8)
- 1. A multi-mode large model construction method facing water conservancy scenes is characterized by comprising the following steps: step S1, data acquisition, namely acquiring original data by adopting an unmanned aerial vehicle-mounted camera to acquire image data of the whole shooting area; step S2, data preprocessing, namely, downsampling the acquired image data and selecting an image related to water conservancy; S3, constructing a water conservancy multi-mode data set, namely taking a general multi-mode large model as a reference multi-mode large model, classifying subjects of the preprocessed image data set by utilizing zero sample learning capacity of the reference multi-mode large model, and finally constructing the water conservancy multi-mode data set; Step S4, training a reference multi-mode large model, namely training the reference multi-mode large model in the water conservancy field to obtain a multi-mode large model HydroMLLM facing the water conservancy scene; The specific process of the step S4 is as follows: S41, performing supervised fine tuning training by using a quantization weight decomposition low-rank adapter QDoRA in the training process, and updating model weights by quantizing model parameters and using a low-rank matrix; the specific process of step S41 is as follows: Original weights First quantized to k bits, the quantized original weights are Dynamically dequantizing the approximate full-precision weights at forward computation, The original weight after dequantization; in the overall weight update of QDoRA, the trimmed weight Is the original weight after dequantization Adding an incremental matrix The obtained product is used for the treatment of the skin, The specific expression of (2) is: ; Wherein, the ; Is a dequantization function; Into magnitude vectors and direction components calculated by two low rank adaptation matrices, The calculation mode of (a) is as follows: ; Wherein, the Representation pair Normalizing; And All are QDoRA introduced low-rank adaptation matrices; representing the Frobenius norm; is a vector of magnitudes that can be learned, , The real number domain is represented by the number, Representing the output dimension of the weight matrix; representing element-by-element multiplication; for any quantifiable parameter The expression is: ; Wherein, the A quantization scaling factor calculated for the parameter block; P is the original floating point parameter, round is the rounding function; In performing the calculation, the quantized parameters are dequantized in real time, and the expression is: ; Is that A dequantized result; the forward propagation process of the trimmed reference multi-mode large model is expressed as follows: ; given input activation in each layer of a reference multi-modal large model in a fine tuning process And output activation In the back propagation, however, only And the quantization parameter receives the gradient and updates it Then it remains frozen; s42, constructing a water conservancy preference data set for direct preference optimization alignment training, namely DPO alignment training, namely designing a data engine, constructing the water conservancy preference data set for the DPO alignment training, and then carrying out alignment training of the water conservancy field on the reference multi-mode large model by combining the data engine with the DPO alignment training to obtain a multi-mode large model HydroMLLM facing the water conservancy scene; S43, designing a Water-CoT prompt method of a Water conservancy thinking chain, and improving the performance level of HydroMLLM in the Water conservancy field in the reasoning stage; And S5, saving model weights by HydroMLLM in a BF16 data format, and obtaining a final multi-mode large model oriented to the water conservancy scene by using a 4-bit k quantization technology for the attention mechanism key tensor in HydroMLLM.
- 2. The method for constructing a multi-modal large model for a water conservancy scene according to claim 1, wherein the specific process of step S2 is as follows: s21, image downsampling; Uniformly downsampling pixels of all original images to a standard size of 600×400 pixels; S22, screening an original image; and screening the original image, and selecting the image related to water conservancy.
- 3. The method for constructing a multi-modal large model for a water conservancy scene according to claim 1, wherein in the step S3, the construction process of the water conservancy multi-modal dataset is as follows: S31, classifying zero sample images, inputting the preprocessed image data into a reference multi-mode large model, and guiding the reference multi-mode large model to analyze and identify the image content by combining with a designed prompt word; s32, classifying and correcting the images by utilizing zero sample reasoning capacity of the reference multi-mode large model to obtain classified and corrected class labels; S33, setting a theme judging mechanism, namely setting a judging mechanism in consideration of the condition that the actual image has fuzzy content and contains various themes or is not matched with preset categories, and outputting a 'no obvious image theme' label if the reference multi-mode large model judges that any theme in the image is not matched with the preset categories; S34, manually correcting and checking, organizing more than 2 professionals in the water conservancy field to form a checking team, sampling and checking classified and corrected class labels and non-obvious image subject labels, wherein the sampling proportion is not less than 30% of the total data amount, checking contents comprise the matching degree of the labels and the actual content of the images and the accuracy of class judgment, correcting the labels and recording the type of misjudgment if the misjudgment labels are found, and performing full manual checking on all the images under the class aiming at the class with the misjudgment rate of more than 5%, so as to ensure the accuracy of the labels; and S35, forming an 'image file name-category' binary group by the category label and the image file name after manual correction and verification, and forming a water conservancy multi-mode data set.
- 4. The method for constructing a multi-modal large model for a water conservancy scene according to claim 3, wherein in the step S42, the representative error generated when the reference multi-modal large model is used for answering the water conservancy-related questions is used as a negative sample in the water conservancy preference dataset, and the negative sample represents an answer which is not expected to be generated; Providing an accurate corrected version for the representative error, the corrected version being considered as a corresponding positive sample, the positive sample representing the answer that is expected to be generated; The data engine supports repeated iterative operation, gradually accumulates preference pairs containing diversified error modes and corresponding correct answers thereof through a cycle of multi-round model output-correction-data collection-DPO alignment training, and finally builds a water conservancy preference data set.
- 5. The method for constructing a multi-modal large model for a water conservancy scene as set forth in claim 4, wherein in step S42, the data engine is a training closed loop based on iterative optimization of a water conservancy preference data set, and the core flow is as follows: ① Training the reference multi-mode large model to be converged by using the existing water conservancy preference data set; ② The training standard multi-mode big model is taken for reasoning, on one hand, representative errors of the model answer water conservancy questions are collected as a first type negative sample, on the other hand, the comparison between the model output and the original dataset label is carried out, and the situation that the label is inconsistent with the actual content of the image is identified as a second type negative sample, namely the dataset label is misjudged; ③ Manually correcting two types of negative samples, namely providing accurate answers to model answer errors and misjudging correction labels for data set labels, combining negative sample-positive sample pairs into a water benefit preference data set, and simultaneously updating the error labels of the original water conservancy multi-mode data set; ④ Performing a round of DPO alignment training on the reference multi-mode large model trained by QDoRA by using the water conservancy preference data set to obtain an updated reference multi-mode large model; ⑤ Returning to the step ②, the circulation is continued until the reference multi-mode large model performance is not improved any more, and the label accuracy of the original water conservancy multi-mode dataset is more than or equal to 99%.
- 6. The method for constructing the multi-modal large model for the Water conservancy scene as set forth in claim 4, wherein in the step S43, the Water conservancy thinking chain Water-CoT prompting method is that a key intelligent cutting and focusing stage of the region of interest is introduced in advance before the reference multi-modal large model question-answer interaction flow, and the specific process is as follows: 1) Acquiring a target area focused by a user according to an original problem of the user by using a section of prompt combined with knowledge in the water conservancy field; 2) Cutting an original image aiming at a target area according to positioning coordinate information which is fed back by a reference multi-mode large model in a first stage and is used for identifying the target object, so that one or more image slices of the region of interest only containing a clear target object are generated; 3) In the second successive dialogue or analysis task, the image slice of the interested region containing the target detail information is submitted to the reference multi-mode large model together with the specific problem or analysis instruction proposed by the user for deep analysis, attribute identification and intelligent solution.
- 7. The method for constructing a multi-modal large model for a water conservancy scene as set forth in claim 4, wherein in the step S5, the process of storing the model weights in BF16 data format is as follows: BF16 fully retains the exponent portion, compressing only the mantissa portion, in the mathematical form: ; Wherein, the Representing the input original full-precision floating point number; Represents floating point numbers after being converted into BF16 data format.
- 8. The method of claim 4, wherein in the step S5, the attention mechanism key tensor in HydroMLLM uses a 4-bit k quantization technique, and the 4-bit k quantization technique adopts a weight organization method based on a super-block structure, wherein weights are organized into a hierarchical structure, each super-block contains 8 blocks, each block contains 32 weight elements, and the weight recovery process formula of the 4-bit k quantization: ; Wherein, the Representing the original weight value after the restoration, Is an integer value quantized to 4-bit k, Is the block scale factor of the block, Is the block minimum; the quantization process firstly divides a weight matrix into a plurality of blocks, then calculates the statistical characteristic of each block, and then carries out linear quantization on each weight, wherein the linear quantization formula is as follows: ; Mapping the floating point weight to a 4-bit integer space, and quantizing the block scale factors and the block minimum value to further reduce the memory use; and finally, packaging the quantized value and the original data into a compact memory layout, and completing the whole quantization process.
Description
Multi-mode large model construction method for water conservancy scene Technical Field The invention relates to the field of water conservancy, in particular to a multi-mode large model construction method for a water conservancy scene. Background The decision mode of the traditional water conservancy information system has bottlenecks in data acquisition, analysis and risk assessment. On one hand, the expert needs to spend a great deal of time to extract key information from multi-source heterogeneous data such as historical data, meteorological data, hydrological data, engineering structure information and the like, the process is tedious and the efficiency is low, on the other hand, the expert experience is easily influenced by subjective factors, the risk assessment result is possibly biased, and the objectivity and the accuracy of decision are difficult to guarantee. The traditional decision mode has relatively lagged response speed when dealing with sudden water conservancy safety problems. In the face of extreme weather events or sudden engineering accidents, expert consultants and scheme making are time-consuming, and decision-making delay can lead to expansion of disaster loss. Along with the rapid development of information technology, massive multi-source heterogeneous data are accumulated in the water conservancy field, but the traditional decision mode has a bottleneck in processing and analyzing the data, and the information value contained in the data is difficult to fully mine and utilize. Disclosure of Invention In order to solve the technical problems, the invention provides the multi-mode large model construction method for the water conservancy scene, which is simple in algorithm and high in practicability. The technical scheme for solving the technical problems is that the multi-mode large model construction method for the water conservancy scene comprises the following steps: step S1, data acquisition, namely acquiring original data by adopting an unmanned aerial vehicle-mounted camera to acquire image data of the whole shooting area; step S2, data preprocessing, namely, downsampling the acquired image data and selecting an image related to water conservancy; S3, constructing a water conservancy multi-mode data set, namely taking a general multi-mode large model as a reference multi-mode large model, classifying subjects of the preprocessed image data set by utilizing zero sample learning capacity of the reference multi-mode large model, and finally constructing the water conservancy multi-mode data set; Step S4, training a reference multi-mode large model, namely training the reference multi-mode large model in the water conservancy field to obtain a multi-mode large model HydroMLLM facing the water conservancy scene; And S5, saving model weights by HydroMLLM in a BF16 data format, and obtaining a final multi-mode large model oriented to the water conservancy scene by using a 4-bit k quantization technology for the attention mechanism key tensor in HydroMLLM. The specific process of the step S2 is as follows: s21, image downsampling; Uniformly downsampling pixels of all original images to a standard size of 600×400 pixels; S22, screening an original image; and screening the original image, and selecting the image related to water conservancy. In the above method for constructing a multi-mode large model for a water conservancy scene, in step S3, the construction process of the water conservancy multi-mode dataset is as follows: S31, classifying zero sample images, inputting the preprocessed image data into a reference multi-mode large model, and guiding the reference multi-mode large model to analyze and identify the image content by combining with a designed prompt word; s32, classifying and correcting the images by utilizing zero sample reasoning capacity of the reference multi-mode large model to obtain classified and corrected class labels; S33, setting a theme judging mechanism, namely setting a judging mechanism in consideration of the condition that the actual image has fuzzy content and contains various themes or is not matched with preset categories, and outputting a 'no obvious image theme' label if the reference multi-mode large model judges that any theme in the image is not matched with the preset categories; S34, manually correcting and checking, organizing more than 2 professionals in the water conservancy field to form a checking team, sampling and checking classified and corrected class labels and non-obvious image subject labels, wherein the sampling proportion is not less than 30% of the total data amount, checking contents comprise the matching degree of the labels and the actual content of the images and the accuracy of class judgment, correcting the labels and recording the type of misjudgment if the misjudgment labels are found, and performing full manual checking on all the images under the class aiming at the class with the misjudgment rate of more than 5%, so