CN-121982440-A - Cloud edge collaborative training deployment method for visual language model
Abstract
The invention discloses a cloud edge collaborative training deployment method of a visual language model, which is applied to edge equipment. The method comprises the steps of firstly loading a pre-training visual language basic model which adopts a torch.bfoat 16 mixed precision format and enables a gradient check point mechanism, configuring a low-rank self-adaptation LoRA adapter, freezing basic model parameters, utilizing a multi-mode training data set to finely tune the adapter, obtaining a complete model through matrix addition and fusion weights, compiling and quantizing through a MLIR tool chain to generate an edge equipment special format file, initializing an edge reasoning engine containing a shared memory mechanism, loading the model and self-adapting configuration parameters according to equipment resources, and processing real-time images and text data to generate a detection result of an object to be detected. The invention greatly reduces the occupation and cost of the training video memory, compresses the model volume, improves the reasoning efficiency of the edge equipment, ensures the data privacy, and is suitable for the scenes of industrial quality inspection, intelligent monitoring and the like.
Inventors
- LI XING
- He wenshan
- XIN HAITAO
- WANG DUO
- ZHANG DAJIAN
- SUN JIAN
- ZHANG YI
Assignees
- 西安长远电子工程有限责任公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251130
Claims (10)
- 1. A cloud edge collaborative training deployment method of a visual language model is characterized by being applied to edge equipment and comprising the following steps: loading a pre-trained visual language basic model, wherein the basic model adopts a torch.bflat16 mixed precision format and enables a gradient check point mechanism; Configuring a low-rank self-adaptation LoRA adapter, wherein a target module of the low-rank self-adaptation LoRA adapter comprises a projection layer of an attention layer, a gating projection layer of a feedforward network layer, an upper projection layer and a lower projection layer, inserting the low-rank self-adaptation LoRA adapter into a basic model, freezing basic model parameters, and utilizing an acquired multi-mode training data set to finely tune the parameters of the training low-rank self-adaptation LoRA adapter so as to obtain the finely tuned adapter; the method comprises the steps of adding and fusing weight matrixes of a trimmed adapter and a basic model through matrix addition and fusion to obtain a complete model, storing the complete model into a complete serialization format, compiling and quantizing the complete model in the complete serialization format through a MLIR tool chain to generate a special format model file adapting to edge equipment; Initializing an edge reasoning engine adopting a shared memory mechanism, loading a special format model file, and configuring a maximum proportion parameter of a visual sequence and a KV cache step length parameter according to the memory capacity and the computing power of edge equipment to obtain a self-adaptive reasoning model; And processing the image and text data input in real time according to the self-adaptive reasoning model to generate a detection result of the object to be detected.
- 2. The cloud-edge co-training deployment method of a visual language model of claim 1, wherein prior to loading the pre-trained visual language base model, the method further comprises: and reading the marked multi-mode training data set containing the image-text pairing data on the cloud server, and dividing the multi-mode training data set into a training set and a verification set according to the proportion of 8:2.
- 3. The cloud edge co-training deployment method of the visual language model of claim 1, wherein the parameters of the low-rank adaptive LoRA adapter comprise: the batch size of each device, the gradient accumulation step number, the training wheel number and the learning rate, and the setting of the gradient accumulation step number enables the equivalent batch size to reach a preset value.
- 4. The cloud edge co-training deployment method of the visual language model of claim 1, wherein MLIR tool chains comprise a model conversion module, a graph optimization module and a quantization module; Compiling and quantifying the complete model in the full-serialization format through MLIR tool chains to generate a special format model file adapting to the edge equipment, wherein the method comprises the following steps: Converting the complete model in the safety serialization format into MLIR intermediate representation format through a model conversion module; performing operator fusion and memory optimization on the MLIR intermediate representation format based on the calculation graph through a graph optimization module to obtain an optimization model; and executing INT4 quantization on the optimization model through a quantization module, compiling and generating BModel format to obtain a format model file matched with the edge equipment.
- 5. The cloud edge collaborative training deployment method of the visual language model according to claim 1, wherein the processing of the image and text data input in real time according to the adaptive reasoning model to form and output the detection result of the object to be detected comprises: Receiving image and text data input in real time, respectively preprocessing the image and the text data, and correspondingly generating a visual token sequence and a text; acquiring the memory capacity according to the edge reasoning engine, and giving the memory capacity a dynamic configuration visual sequence length; And inputting the visual token sequence and the text token after being spliced into an adaptive reasoning model, processing the visual sequence by adopting a windowed attention mechanism of the adaptive reasoning model to obtain each historical calculation result, multiplexing the historical calculation results by utilizing a KV cache mechanism, and executing autoregressive on the multiplexed historical calculation results to generate a detection result of a model to-be-detected object, wherein the to-be-detected object comprises industrial equipment or a monitored target in a monitored image.
- 6. The cloud-edge co-training deployment method of the visual language model of claim 5, wherein the processing of the visual sequence using the windowed attention mechanism of the adaptive inference model comprises: Calculating the total number of the visual token sequences, and scaling each dimension size of the image grid when the total number exceeds the preset maximum sequence length so that the total number of the adjusted visual token sequences does not exceed the memory limit of the edge equipment, thereby obtaining the adjusted visual token sequences; the adjusted visual token sequence is divided into a plurality of windows, global attention is performed inside each window, and sparse attention calculation is performed between the windows.
- 7. The cloud edge collaborative training deployment method of the visual language model of claim 5, wherein preprocessing the image and text data respectively to generate a visual token sequence and text correspondingly comprises: after the input image is adjusted to a preset resolution, the input image is processed by a visual encoder to generate a corresponding visual token sequence; and obtaining the corresponding text token according to prompt toke corresponding to the text query.
- 8. Cloud edge collaborative training deployment device of visual language model, characterized by comprising: The model loading module is used for loading a pre-trained visual language basic model, and the basic model adopts a torch. Bflat 16 mixed precision format and enables a gradient check point mechanism; The adapter training module is used for configuring a low-rank self-adaption LoRA adapter, and the target module of the low-rank self-adaption LoRA adapter comprises a projection layer of an attention layer, a gating projection layer of a feedforward network layer, an upper projection layer and a lower projection layer, inserting the low-rank self-adaption LoRA adapter into a basic model, freezing basic model parameters, and utilizing an acquired multi-mode training data set to finely tune the parameters of the training low-rank self-adaption LoRA adapter so as to obtain the finely tuned adapter; The model compiling module is used for adding and fusing the weight matrixes of the trimmed adapter and the basic model through matrix addition and fusion to obtain a complete model, storing the complete model into a safe serialization format, and compiling and quantizing the complete model in the complete serialization format through a MLIR tool chain to generate a special format model file adapting to the edge equipment; the model adaptation module is used for initializing an edge reasoning engine adopting a shared memory mechanism, loading a special format model file, and configuring a maximum proportion parameter of a visual sequence and a KV cache step length parameter according to the memory capacity and the calculation power of edge equipment to obtain a self-adaptive reasoning model; The detection module is used for processing the image and text data input in real time according to the self-adaptive reasoning model and generating a detection result of the object to be detected.
- 9. A computer readable storage medium comprising instructions that when executed on a computer cause the computer to perform the cloud-edge co-training deployment method of the visual language model of any one of claims 1-7.
- 10. An electronic device, characterized in that the electronic device comprises: at least one processor, memory, and input output unit; The memory is used for storing a computer program, and the processor is used for calling the computer program stored in the memory to execute the cloud edge collaborative training deployment method of the visual language model according to any one of claims 1-7.
Description
Cloud edge collaborative training deployment method for visual language model Technical Field The application relates to the technical fields of artificial intelligence, edge calculation, multi-mode large models and the like, in particular to a cloud edge collaborative training deployment method, device, medium and equipment of a visual language model. Background With the rapid development of deep learning technology, large language models (Large Language Models, LLMs) represented by GPT, LLaMA, qwen and the like have made breakthrough progress in the field of natural language processing. On the basis, a Visual Language Model (VLMs) developed by combining a computer visual technology, such as CLIP, flamingo, qwen-VL, can understand image and text information simultaneously, has strong capability in tasks such as image description generation, visual question-answering, multi-mode content understanding and the like, and becomes a research hotspot in the field of artificial intelligence. The models are widely applied to scenes such as intelligent monitoring, industrial quality inspection, medical image analysis, automatic driving and the like, and the intelligent level of man-machine interaction is greatly improved. However, visual language models are typically parametric in large scale, e.g., qwen2.5-VL-3B model parameters reach 30 billion, model file volumes exceeding 6GB. On one hand, full-parameter Fine tuning (Full Fine-tuning) needs to update all parameters of the model, the display memory occupied in the training process is up to more than 60GB, the training time is up to several days, common enterprises and research institutions are hard to bear, on the other hand, the large models usually need high-performance GPU servers to operate, the cloud deployment mode has the problems of large network delay, high bandwidth occupation, data privacy leakage and the like, and the requirements of edge scenes (such as intelligent security, industrial detection and mobile terminals) on real-time performance, privacy and low power consumption are hard to meet. In recent years, edge computing is an emerging computing mode, and data processing capability is sunk to edge devices close to a data source, so that delay can be effectively reduced, bandwidth can be saved, and data privacy can be protected. The cloud-trained model (such as safetensors file) needs to be exported by ONNX, quantized and compressed, chip compiled and other complex conversion processes, the technical threshold is high, and precision loss is easy to occur. Disclosure of Invention The application mainly aims to provide a cloud edge collaborative training deployment method, device, medium and equipment for a visual language model, which aim to solve the key technical problems of how to efficiently fine-tune a large-scale visual language model at a cloud end and rapidly deploy the large-scale visual language model to edge equipment, and realize the reduction of training cost, automatic conversion of a model format and the optimization of edge reasoning performance. In order to achieve the above object, the present application provides a cloud edge collaborative training deployment method for a visual language model, which is applied to an edge device, and includes: loading a pre-trained visual language basic model, wherein the basic model adopts a torch.bflat16 mixed precision format and enables a gradient check point mechanism; Configuring a low-rank self-adaptation LoRA adapter, wherein a target module of the low-rank self-adaptation LoRA adapter comprises a projection layer of an attention layer, a gating projection layer of a feedforward network layer, an upper projection layer and a lower projection layer, inserting the low-rank self-adaptation LoRA adapter into a basic model, freezing basic model parameters, and utilizing an acquired multi-mode training data set to finely tune the parameters of the training low-rank self-adaptation LoRA adapter so as to obtain the finely tuned adapter; the method comprises the steps of adding and fusing weight matrixes of a trimmed adapter and a basic model through matrix addition and fusion to obtain a complete model, storing the complete model into a complete serialization format, compiling and quantizing the complete model in the complete serialization format through a MLIR tool chain to generate a special format model file adapting to edge equipment; Initializing an edge reasoning engine adopting a shared memory mechanism, loading a special format model file, and configuring a maximum proportion parameter of a visual sequence and a KV cache step length parameter according to the memory capacity and the computing power of edge equipment to obtain a self-adaptive reasoning model; And processing the image and text data input in real time according to the self-adaptive reasoning model to generate a detection result of the object to be detected. Optionally, before loading the pre-trained visual language base mode