CN-121996354-A - Large model training and reasoning oriented containerized operation method and computing equipment

CN121996354ACN 121996354 ACN121996354 ACN 121996354ACN-121996354-A

Abstract

The invention discloses a containerized operation method and computing equipment for training and reasoning of a large model, wherein the method comprises the steps of receiving a training starting request or a reasoning starting request for the large model, and determining training parameters corresponding to the training starting request or reasoning parameters corresponding to the reasoning starting request; for a start training request, creating an reasoner instance and inputting training parameters to start a training container based on the training parameters by the trainer instance and running a training program by the training container to perform a large model training task based on the training program, and for a start reasoning request, creating an reasoner instance and inputting reasoning parameters to start a reasoning container based on the reasoning parameters by the reasoner instance and running a reasoning program by the reasoning container to perform a large model reasoning task based on the reasoning program. Based on the method, training and reasoning tasks of the large model can be efficiently and conveniently executed in a containerized environment, embeddable containerized operation capability can be provided for a model development platform, and rapid integration of the large model development capability by other platforms is facilitated.

Inventors

CHEN JIAN
QIAO NAN
Zhai Xiaogeng

Assignees

北京并行科技股份有限公司
北京北龙超级云计算有限责任公司

Dates

Publication Date: 20260508
Application Date: 20260123

Claims (11)

1. A containerized execution method for large model training and reasoning, executed in a computing device, wherein an operating system of the computing device is adapted to deploy a training container for running a training program and a reasoning container for running a reasoning program, the method comprising: Receiving a training starting request or an reasoning starting request for a large model, and determining training parameters corresponding to the training starting request or reasoning parameters corresponding to the reasoning starting request; For the start training request, creating a trainer instance and transmitting the training parameters to start a training container based on the training parameters through the trainer instance, and running a training program through the training container so as to execute a large model training task based on the training program; For the launch reasoning request, creating a reasoner instance and inputting the reasoning parameters to launch a reasoning container based on the reasoning parameters through the reasoner instance, and running a reasoning program through the reasoning container to execute a large model reasoning task based on the reasoning program.
2. The method of claim 1, wherein the operating system has a container client deployed thereon; starting, by the trainer instance, a training container based on the training parameters, comprising: Initiating, by the trainer instance, a training container via the container client based on the training parameters; Starting, by the reasoner instance, an inference container based on the inference parameters, comprising: an inference container is launched via the container client by the reasoner instance based on the inference parameters.
3. The method of claim 2, wherein launching, by the trainer instance, a training container via the container client based on the training parameters, comprises: the trainer example converts the training parameters into first starting parameters of the container client, calls a starting container interface of the container client and transmits the first starting parameters; the container client converts the first starting parameters into corresponding first starting character string arrays, and transmits training container starting commands and the first starting character string arrays to the container client program through an operation interface of a program calling tool so as to start the training container through the container client program; initiating, by the reasoner instance, an inference container via the container client based on the inference parameters, comprising: The reasoner instance converts the reasoning parameters into second starting parameters of the container client, calls a starting container interface of the container client and transmits the second starting parameters; The container client converts the second starting parameters into corresponding second starting character string arrays, and transmits an inference container starting command and the second starting character string arrays to the container client program through an operation interface of the program calling tool so as to start the inference container through the container client program.
4. The method of any one of claim 1 to 3, wherein, After the training container is started based on the training parameters through the trainer instance, the method further comprises the steps of acquiring a training container number of the training container through the trainer instance; After initiating an inference container based on the inference parameters by the reasoner instance, further comprising obtaining an inference container number for the inference container by the reasoner instance.
5. The method of any one of claims 1-4, further comprising: receiving a training stopping request or an reasoning stopping request for a large model, and acquiring a training container number corresponding to the training stopping request or a reasoning container number corresponding to the reasoning stopping request; for the training stopping request, creating a trainer instance, and stopping the operation of the corresponding training container through a container client based on the training container number by the trainer instance so as to stop the operation of the training program; and creating an reasoner instance for the stopping reasoning request, and stopping the operation of the corresponding reasoning container through the container client based on the reasoning container number by the reasoner instance to stop the operation of the reasoning program.
6. The method of claim 5, wherein stopping, by the trainer instance, operation of a corresponding training container via a container client based on the training container number, comprises: The trainer instance calls a stopping container interface of the container client and transmits the training container number; The container client transmits a training container stopping command corresponding to the training container number to the container client program through an operation interface of a program calling tool based on the training container number so as to stop the operation of the training container corresponding to the training container number through the container client program; Stopping, by the reasoner instance, operation of a corresponding inference container via a container client based on the inference container number, including: The reasoner instance invokes a stop container interface of the container client and transmits the reasoner container number; and the container client transmits an inference container stopping command corresponding to the inference container number to the container client program through an operation interface of a program calling tool based on the inference container number so as to stop the operation of the inference container corresponding to the inference container number through the container client program.
7. The method of any of claims 1-6, further comprising: Receiving a training query request or an reasoning query request for a large model, and acquiring a training container number corresponding to the training query request or a reasoning container number corresponding to the reasoning query request, wherein the training query request comprises at least one of a training state query request, a training log query request, a training progress query request and a training check point query request, and the reasoning query request comprises at least one of a reasoning state query request, a reasoning log query request, a reasoning progress query request and a reasoning result query request; For the training query request, creating a trainer instance, and acquiring a training state value, a training log, training progress information or a training check point path of a corresponding training container based on the training container number through the trainer instance; And for the reasoning inquiry request, creating a reasoner instance, and acquiring a reasoning state value, a reasoning log, reasoning progress information or a reasoning result of the corresponding reasoning container based on the reasoning container number through the reasoner instance.
8. The method of any of claims 1-7, wherein the training parameters include one or more of a mirror name, a large model path, a dataset path, a training program parameter corresponding to the training container; the reasoning parameters comprise one or more of a mirror name, a big model path, a data set path, a reasoning program path and a reasoning program parameter corresponding to the reasoning container.
9. The method of any of claims 1-8, wherein the operating system of the computing device has a Web service deployed thereon, the Web service adapted to perform the method, receiving a launch training request or a launch reasoning request for a large model, comprising: the Web service receives a training starting request or an reasoning starting request sent by clicking a training starting button or a reasoning starting button on an interface of a Web client for a large model.
10. A computing device, comprising: At least one processor, and A memory storing program instructions, wherein the program instructions are configured to be adapted to be processed by the at least one processor, the program instructions comprising instructions for processing the method of any of claims 1-9.
11. A computer program product comprising computer program instructions which, when executed by a processor, implement the method of any of claims 1-9.

Description

Large model training and reasoning oriented containerized operation method and computing equipment Technical Field The invention relates to the technical field of artificial intelligent model development, in particular to a containerized operation method and computing equipment for large model training and reasoning. Background With the rapid development of artificial intelligence technology, the dependence of model training and reasoning tasks on the running environment is increasingly complex. Different projects often require different versions of Python, deep learning frameworks (e.g., pyTorch, tensorFlow), and their dependent libraries. The traditional deployment mode has the problems of environment conflict, complex configuration, poor reproducibility and the like. Although the container technologies such as Docker can isolate environments, common users still face higher use thresholds in the aspects of constructing images, configuring starting parameters, mounting data volumes, managing the life cycle of containers and the like. Although the existing platform (such as Kubeflow, MLflow) supports the containerization task, the interface is complex, the coupling degree is high, and other systems are difficult to embed as a lightweight base component. In view of this, there is a need for a standardized, low-coupling, embeddable, large model training and reasoning oriented containerized method of operation that solves the problems presented in the above-described solution. Disclosure of Invention To this end, the present invention provides a containerized operation method oriented to large model training and reasoning to solve or at least alleviate the above-presented problems. According to one aspect of the invention, a containerized running method for training and reasoning of a large model is provided and is executed in a computing device, wherein an operating system of the computing device is suitable for deploying a training container for running the training program and a reasoning container for running the reasoning program, the method comprises the steps of receiving a starting training request or a starting reasoning request of the large model, determining training parameters corresponding to the starting training request or reasoning parameters corresponding to the starting reasoning request, creating a trainer instance for the starting training request and inputting the training parameters so as to start the training container based on the training parameters through the trainer instance and run the training program through the training container so as to execute a large model training task based on the training program, creating a reasoner instance for the starting reasoning request and inputting the reasoning parameters so as to start the reasoning container based on the reasoning parameters through the reasoner instance and run the reasoning program through the reasoning container so as to execute the reasoning program based on the large model training task. Optionally, in the large model training and reasoning oriented containerized operation method, a container client is deployed on the operating system, the training container is started by the trainer instance based on the training parameters through the container client, the reasoning container is started by the reasoner instance based on the reasoning parameters, and the reasoning container is started by the reasoner instance based on the reasoning parameters through the container client. Optionally, in the containerized operation method for large model training and reasoning, training containers are started through the container clients based on the training parameters through the training device instance, the training device instance converts the training parameters into first starting parameters of the container clients, the container clients are called to start container interfaces of the container clients and transmit the first starting parameters, the container clients convert the first starting parameters into corresponding first starting character string arrays, training container starting commands and the first starting character string arrays are transmitted to container client programs through operation interfaces of program calling tools, so that training containers are started through the container client programs, and the container clients are started through the container clients based on the reasoning parameters through the reasoning device instance, wherein the container clients are called to start container interfaces of the container clients and transmit the second starting parameters into the second starting parameters based on the reasoning parameters through the container client instance, and the container clients are called to transmit the second starting character string arrays of the second starting parameters into corresponding container starting character string arrays to the container client programs. Optional