CN-121996399-A - Model loading method, device, system, electronic equipment and storage medium

CN121996399ACN 121996399 ACN121996399 ACN 121996399ACN-121996399-A

Abstract

The embodiment of the application discloses a model loading method, a device, electronic equipment and a storage medium, wherein the method comprises the steps of receiving a first reasoning request and acquiring low-rank adaptation parameters from the first reasoning request; and determining a target GPU in the multiple GPUs according to the low-rank adaptive parameters, wherein a basic model and an initial low-rank adaptive model are preloaded in each GPU, and executing an reasoning task requested by the first reasoning request through the target GPU. The embodiment of the application can improve the utilization rate of GPU resources.

Inventors

WANG JIANDONG
GONG LI
HU YI
WANG WEI

Assignees

北京罗克维尔斯科技有限公司

Dates

Publication Date: 20260508
Application Date: 20241107

Claims (16)

1.A method of model loading, comprising: Receiving a first reasoning request, and acquiring a low-rank adaptation parameter from the first reasoning request; Determining a target GPU in a plurality of GPUs according to the low-rank adaptive parameters, wherein a basic model and an initial low-rank adaptive model are preloaded in each GPU; And executing the reasoning task requested by the first reasoning request through the target GPU.
2. The method of claim 1, wherein determining a target GPU among the plurality of GPUs based on the low-rank adaptation parameter comprises: And if the target low-rank adaptation model corresponding to the low-rank adaptation parameters exists in the initial low-rank adaptation models in the plurality of GPUs, taking the GPU corresponding to the target low-rank adaptation model as a target GPU.
3. The method of claim 1, wherein determining a target GPU among the plurality of GPUs based on the low-rank adaptation parameter comprises: and if the initial low-rank adaptation model in the plurality of GPUs does not have the low-rank adaptation model corresponding to the low-rank adaptation parameters, randomly determining one GPU as a target GPU.
4. A method according to claim 1 or 3, wherein said executing, by the target GPU, the reasoning task requested by the first reasoning request comprises: loading a first low-rank adaptation model corresponding to the low-rank adaptation parameter from a CPU memory into the target GPU, wherein a plurality of low-rank adaptation models are preloaded in the CPU memory; Constructing a first inference network comprising the base model and the first low-rank adaptation model; and executing the reasoning task requested by the first reasoning request through the first reasoning network.
5. The method of claim 4, further comprising, after said performing said first inference task requested by said first inference request via said first inference network: And receiving a second reasoning request, and executing a reasoning task requested by the second reasoning request through the first reasoning network if the low-rank adaptation parameter acquired from the second reasoning request is the same as the low-rank adaptation parameter in the first reasoning request.
6. The method of claim 4, further comprising, prior to loading the first low-rank adaptation model corresponding to the low-rank adaptation parameter from CPU memory into the target GPU: And if the resources of the target GPU are insufficient to load the first low-rank adaptation model, releasing the initial low-rank adaptation model from the target GPU.
7. The method of claim 4, further comprising, prior to said receiving a first inference request, obtaining a low rank adaptation parameter from said first inference request: And loading a plurality of low-rank adaptive models into a CPU (Central processing Unit) memory, loading a basic model into a plurality of GPUs, and respectively loading each initial low-rank adaptive model into each GPU.
8. The method of claim 7, wherein loading the base model into a plurality of GPUs and each initial low-rank adaptation model into each GPU comprises: And loading the basic model into the GPU corresponding to the worker object through each worker object, and loading the initial low-rank adaptive model into the GPU corresponding to the worker object through each worker object.
9. A method of model loading, comprising: Receiving a first reasoning request, and acquiring a low-rank adaptation parameter from the first reasoning request; Loading a first low-rank adaptation model corresponding to the low-rank adaptation parameter from a CPU memory into a target GPU of a plurality of GPUs, wherein the CPU memory is preloaded with the plurality of low-rank adaptation models, and each GPU is preloaded with a basic model; Constructing a first inference network comprising the base model and the first low-rank adaptation model; and executing the reasoning task requested by the first reasoning request through the first reasoning network.
10. A model loading apparatus, characterized by comprising: The first parameter acquisition module is used for receiving a first reasoning request and acquiring low-rank adaptive parameters from the first reasoning request; The system comprises a target GPU determining module, a target GPU determining module and a target processing module, wherein the target GPU determining module is used for determining a target GPU in a plurality of GPUs according to the low-rank adaptation parameters, and a basic model and an initial low-rank adaptation model are preloaded in each GPU; And the first task execution module is used for executing the reasoning task requested by the first reasoning request through the target GPU.
11. A model loading apparatus, characterized by comprising: The second parameter acquisition module is used for receiving a first reasoning request and acquiring low-rank adaptive parameters from the first reasoning request; the first model switching module is used for loading a first low-rank adaptation model corresponding to the low-rank adaptation parameters from a CPU memory to a target GPU in the plurality of GPUs, wherein the CPU memory is preloaded with the plurality of low-rank adaptation models, and each GPU is preloaded with a basic model; an inference network construction module for constructing a first inference network comprising the base model and the first low-rank adaptation model; And the third task execution module is used for executing the reasoning task requested by the first reasoning request through the first reasoning network.
12. A model loading system, comprising: The CPU is used for loading a plurality of low-rank adaptation models into the memory; the dynamic loading module is operated in the CPU and used for receiving a first reasoning request, acquiring a low-rank adaptation parameter from the first reasoning request and loading a first low-rank adaptation model corresponding to the low-rank adaptation parameter from the CPU to a target GPU in a plurality of GPUs; and each GPU is used for loading a basic model, and a target GPU in the GPUs is also used for constructing a first reasoning network comprising the basic model and the first low-rank adaptation model, and executing the reasoning task requested by the first reasoning request through the first reasoning network.
13. The system of claim 12, wherein the dynamic loading module comprises a worker manager and a plurality of worker objects, each of the worker objects corresponding to a respective one of the GPUs; the worker manager is used for dispatching the plurality of worker objects, acquiring low-rank adaptation parameters from a first reasoning request when the first reasoning request is received, determining a target GPU from a plurality of GPUs and calling the target worker object corresponding to the target GPU; Each worker object is used for controlling loading of a basic model in the GPU corresponding to the worker object; The target worker object is further configured to load a first low-rank adaptation model corresponding to the low-rank adaptation parameter from the CPU into the target GPU.
14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the model loading method of any one of claims 1 to 8 or the model loading method of claim 9 when executing the computer program.
15. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the model loading method according to any one of claims 1 to 8 or the steps of the model loading method according to claim 9.
16. A computer program product comprising a computer program or computer instructions which, when executed by a processor, implement the steps of the model loading method of any one of claims 1 to 8 or the steps of the model loading method of claim 9.

Description

Model loading method, device, system, electronic equipment and storage medium Technical Field The embodiment of the application relates to the technical field of computers, in particular to a model loading method, a model loading device, a model loading system, electronic equipment and a model storage medium. Background When the online reasoning service is deployed, a loading scheme of a single machine and a single card is generally adopted, namely, a whole reasoning network formed by a basic model and a low-rank adaptation model is statically and fixedly loaded in a GPU (Graphics Processing Unit, a graphics processor), and one GPU uses a certain memory. According to the loading scheme of the single machine and the single card, due to the fact that the low-rank adaptive models of the requests are unevenly distributed in the distributed environment, some model services are in an idle state, and other model services need to be queued, so that the utilization rate of GPU resources is unbalanced, and the overall utilization rate is not high. Disclosure of Invention The embodiment of the application provides a model loading method, a model loading device, a model loading system, an electronic device and a model storage medium, which are beneficial to improving the utilization rate of GPU resources and improving the use balance of the GPU resources. In order to solve the above problem, in a first aspect, an embodiment of the present application provides a model loading method, including: Receiving a first reasoning request, and acquiring a low-rank adaptation parameter from the first reasoning request; Determining a target GPU in a plurality of GPUs according to the low-rank adaptive parameters, wherein a basic model and an initial low-rank adaptive model are preloaded in each GPU; And executing the reasoning task requested by the first reasoning request through the target GPU. In a second aspect, an embodiment of the present application provides a model loading method, including: Receiving a first reasoning request, and acquiring a low-rank adaptation parameter from the first reasoning request; Loading a first low-rank adaptation model corresponding to the low-rank adaptation parameter from a CPU memory into a target GPU of a plurality of GPUs, wherein the CPU memory is preloaded with the plurality of low-rank adaptation models, and each GPU is preloaded with a basic model; Constructing a first inference network comprising the base model and the first low-rank adaptation model; and executing the reasoning task requested by the first reasoning request through the first reasoning network. In a third aspect, an embodiment of the present application provides a model loading apparatus, including: The first parameter acquisition module is used for receiving a first reasoning request and acquiring low-rank adaptive parameters from the first reasoning request; The system comprises a target GPU determining module, a target GPU determining module and a target processing module, wherein the target GPU determining module is used for determining a target GPU in a plurality of GPUs according to the low-rank adaptation parameters, and a basic model and an initial low-rank adaptation model are preloaded in each GPU; And the first task execution module is used for executing the reasoning task requested by the first reasoning request through the target GPU. In a fourth aspect, an embodiment of the present application provides a model loading apparatus, including: The second parameter acquisition module is used for receiving a first reasoning request and acquiring low-rank adaptive parameters from the first reasoning request; The first model switching module is used for loading a first low-rank adaptation model corresponding to the low-rank adaptation parameters from a CPU memory to a target GPU in a plurality of GPUs, wherein the CPU memory is preloaded with the plurality of low-rank adaptation models, and each GPU is preloaded with a basic model; an inference network construction module for constructing a first inference network comprising the base model and the first low-rank adaptation model; And the third task execution module is used for executing the reasoning task requested by the first reasoning request through the first reasoning network. In a fifth aspect, an embodiment of the present application further provides a model loading system, including: The CPU is used for loading a plurality of low-rank adaptation models into the memory; the dynamic loading module is operated in the CPU and used for receiving a first reasoning request, acquiring a low-rank adaptation parameter from the first reasoning request and loading a first low-rank adaptation model corresponding to the low-rank adaptation parameter from the CPU to a target GPU in a plurality of GPUs; and each GPU is used for loading a basic model, and a target GPU in the GPUs is also used for constructing a first reasoning network comprising the basic model and the fir