CN-122019009-A - Method, device, equipment and medium for starting model reasoning service instance

CN122019009ACN 122019009 ACN122019009 ACN 122019009ACN-122019009-A

Abstract

The disclosure provides a method, a device, equipment and a medium for starting a model reasoning service instance, relates to the technical field of artificial intelligence and cloud computing, and particularly relates to a reasoning service capacity expansion technology. The method comprises the steps of determining a first processing unit for running a first service instance to be started, wherein the first service instance is used for executing reasoning of a target model, establishing a data transmission channel of the first processing unit and a second processing unit which is already running and is used for reasoning a second service instance of the target model, wherein loading information corresponding to the target model is stored in a storage space corresponding to the second processing unit, the loading information comprises at least one of weight of the target model, model structure configuration information of the target model or compiling cache information of the target model, reading the loading information corresponding to the target model from the storage space corresponding to the second processing unit based on the data transmission channel, and initializing based on the loading information corresponding to the target model to start the first service instance.

Inventors

WANG HAO
ZHAO YINGZHUO
Pei Chaohan
YANG HONGXING
FANG ZHOU
XIAO SONG

Assignees

北京百度网讯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260128

Claims (13)

1. A starting method of model reasoning service instance includes: determining a first processing unit for running a first service instance to be started, wherein the first service instance is used for executing reasoning of a target model; Establishing a data transmission channel of the first processing unit and a second processing unit which is operated for reasoning a second service instance of the target model, wherein a storage space corresponding to the second processing unit stores loading information corresponding to the target model, and the loading information comprises at least one of weight of the target model, model structure configuration information of the target model or compiling cache information of the target model; Reading loading information corresponding to the target model from a storage space corresponding to the second processing unit based on the data transmission channel, and And initializing based on the loading information corresponding to the target model so as to start the first service instance.
2. The method of claim 1, wherein the first processing unit comprises a plurality of processors, the establishing a data transmission channel of the first processing unit with a second processing unit that has run a second inference service of the object model comprising: establishing a data transmission channel between each of the plurality of processors and the second processing unit, And wherein the reading, based on the data transmission channel, the loading information corresponding to the object model from the storage space corresponding to the second processing unit includes: and reading the loading information corresponding to the target model from the storage space corresponding to the second processing unit by using the plurality of processors in parallel.
3. The method of claim 2, wherein the data transmission channel comprises a communication connection based on a remote memory direct access technology or a communication connection based on an inter-chip high-speed interconnect technology.
4. A method as in any of claims 1-3, further comprising: in response to determining that the target model includes a plurality of computational graphs, determining at least one target computational graph from the plurality of computational graphs based on a memory footprint specification of the plurality of computational graphs; acquiring the at least one target calculation map, and And distributing the video memory for the first service example based on the respective corresponding video memory occupation specification of the at least one target calculation graph.
5. The method of claim 4, further comprising: after the first service instance is started, other calculation graphs except the at least one target calculation graph in the plurality of calculation graphs are obtained.
6. The method of claim 5, wherein the obtaining the other computation graph of the plurality of computation graphs than the at least one target computation graph after the first service instance is started comprises: in response to receiving a request to perform reasoning about the target model with the first service instance after startup, other ones of the plurality of computational graphs than the at least one target computational graph are obtained.
7. The method of any of claims 1-6, wherein the determining a first processing unit to run a first service instance to be started comprises: determining a target processing unit running a daemon instance corresponding to the target model as the first processing unit, wherein the daemon instance is obtained by pre-starting an initial service instance for reasoning the target model, And wherein initializing the first service instance based on the loading information corresponding to the target model comprises: And backfilling loading information corresponding to the target model to the daemon instance to obtain the initialized first service instance.
8. The method of claim 7, wherein the daemon instance comprises a main process for reasoning about the target model and a loading information management process corresponding to the target model, the loading information management process for performing operations of reading loading information corresponding to the target model.
9. The method of any of claims 1-8, wherein the initializing to launch the first service instance based on the loading information corresponding to the target model comprises: After a main process of the first service instance is started, in response to determining that the main process does not create a hardware interaction context for the first processing unit, creating a sub-process of the first service instance for performing model reasoning by copying context information of the main process.
10. A device for starting a model reasoning service instance, comprising: A determining unit configured to determine a first processing unit for running a first service instance to be started, wherein the first service instance is used for performing reasoning of a target model; The system comprises a building unit, a first processing unit and a second processing unit, wherein the first processing unit is configured to build a data transmission channel of the second processing unit which is operated for reasoning a second service instance of the target model, a storage space corresponding to the second processing unit stores loading information corresponding to the target model, and the loading information comprises at least one of weight of the target model, model structure configuration information of the target model or compiling cache information of the target model; A reading unit configured to read the loading information corresponding to the object model from the storage space corresponding to the second processing unit based on the data transmission channel, and And the starting unit is configured to initialize based on the loading information corresponding to the target model so as to start the first service instance.
11. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-9.
13. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to any of claims 1-9.

Description

Method, device, equipment and medium for starting model reasoning service instance Technical Field The present disclosure relates to the field of artificial intelligence and cloud computing technologies, and in particular, to an inference service capacity expansion technology, and in particular, to a method and apparatus for starting a model inference service instance, an electronic device, a computer readable storage medium, and a computer program product. Background Artificial intelligence is the discipline of studying certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) that make computers simulate humans, and artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, etc. Cloud computing (cloud computing) refers to a technical system that accesses an elastically extensible shared physical or virtual resource pool through a network, wherein resources can include servers, operating systems, networks, software, applications, storage devices and the like, and can be deployed and managed in an on-demand and self-service manner. Through cloud computing technology, high-efficiency and powerful data processing capability can be provided for technical application such as artificial intelligence and model reasoning. With the rapid development of artificial intelligence technology, large language models (Large Language Model, LLM) are widely used in the fields of natural language processing, content generation, intelligent question-answering and the like. To support the operation of large-scale models, it is often necessary to build large-scale model reasoning clusters, such as cloud computing clusters, deploying and managing model reasoning services using containerized technology and heterogeneous computing resources. In the inference clusters, in order to cope with fluctuations of traffic flow or implement fault tolerance, a system often needs to dynamically expand or restart an inference service instance. The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated. Disclosure of Invention The present disclosure provides a method, apparatus, electronic device, computer-readable storage medium, and computer program product for starting a model reasoning service instance. According to one aspect of the disclosure, a starting method of a model reasoning service instance is provided, which comprises the steps of determining a first processing unit used for running a first service instance to be started, wherein the first service instance is used for executing reasoning of a target model, establishing a data transmission channel of the first processing unit and a second processing unit which is already running and is used for reasoning a second service instance of the target model, wherein a corresponding storage space of the second processing unit stores loading information corresponding to the target model, the loading information comprises at least one of weight of the target model, model structure configuration information of the target model or compiling cache information of the target model, reading the loading information corresponding to the target model from the corresponding storage space of the second processing unit based on the data transmission channel, and initializing based on the loading information corresponding to the target model to start the first service instance. According to one aspect of the disclosure, a starting device of a model reasoning service instance is provided, which comprises a determining unit, a building unit and a starting unit, wherein the determining unit is configured to determine a first processing unit used for running a first service instance to be started, the first service instance is used for executing reasoning of a target model, the building unit is configured to build a data transmission channel of the first processing unit and a second processing unit which is used for reasoning a second service instance of the target model, a corresponding storage space of the second processing unit stores loading information corresponding to the target model, the loading information comprises at least one of weight of the target model, model structure configuration information of the target model or compiling cache information of the target model, the reading unit is configured to read the loading information corresponding to the target model from the storage space corresponding to the second processing unit based on th