CN-122027693-A - Inference request scheduling method and computing device
Abstract
The application relates to the field of computation, in particular to a method and a computing device for scheduling inference requests, wherein the method for scheduling inference requests comprises the following steps of receiving a plurality of inference requests; dividing the multiple reasoning requests into independent reasoning requests and/or at least one sharing group, wherein the sharing group comprises multiple reasoning requests with the same prefix, the independent reasoning requests refer to the reasoning requests which are not divided into any sharing group, selecting a scheduling mode from a prefix group priority mode and a DP ratio priority mode according to the load characteristics of the multiple reasoning requests, the prefix group priority mode refers to the reasoning operation of the sharing prefixes of the priority execution sharing group, the DP ratio priority mode refers to the reasoning operation of the reasoning objects which are sequenced according to the DP ratio and then sequentially executed, and scheduling the independent reasoning requests and/or the at least one sharing group based on the scheduling mode.
Inventors
- YANG HAO
Assignees
- 超聚变数字技术股份有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260116
Claims (12)
- 1. An inferential request scheduling method, comprising: Receiving a plurality of reasoning requests; Dividing the plurality of reasoning requests into independent reasoning requests and/or at least one sharing group, wherein the sharing group comprises a plurality of reasoning requests with the same prefix, the independent reasoning requests refer to the reasoning requests which are not divided into any sharing group, and the prefix refers to a first sentence fragment obtained after the input text of the reasoning requests is segmented; Selecting a scheduling mode from a prefix group priority mode and a DP ratio priority mode according to the load characteristics of the plurality of reasoning requests, wherein the prefix group priority mode is used for preferentially executing the reasoning operation of the shared prefix of the shared group, the DP ratio priority mode is used for sequentially executing the reasoning operation of the reasoning objects after the reasoning objects are ordered according to the DP ratio, the reasoning objects comprise any reasoning request and/or shared group, and the DP ratio is the ratio of the length of an output text to the length of an input text; the independent reasoning requests and/or at least one shared group are scheduled based on the scheduling pattern.
- 2. The method of claim 1, wherein the load characteristics include a share group coverage, the share group coverage being a proportion of inference requests divided into the at least one share group to the plurality of inference requests, the selecting a scheduling mode from a prefix group priority mode and a DP ratio priority mode according to the load characteristics of the plurality of inference requests comprising: Determining the prefix group priority mode to be preferentially executed as the scheduling mode under the condition that the coverage rate of the shared group is larger than or equal to a coverage rate threshold value; and determining the DP ratio priority mode to be executed preferentially as the scheduling mode under the condition that the coverage rate of the sharing group is smaller than the coverage rate threshold value.
- 3. The method according to claim 1 or 2, wherein said scheduling said independent reasoning requests and/or at least one shared group based on said scheduling pattern comprises: And in the case that the scheduling mode comprises the prefix group priority mode, scheduling the independent reasoning request and/or at least one sharing group according to the prefix group priority mode.
- 4. A method according to claim 3, wherein said scheduling said independent inference requests and/or at least one shared group in said prefix group priority mode comprises: Re-partitioning the independent reasoning request and/or the at least one shared group to obtain at least one execution window, wherein the execution window comprises the shared group and/or the independent reasoning request; And respectively scheduling a sharing group and/or independent reasoning requests in each execution window, wherein the priority of the sharing group is greater than that of the independent reasoning requests.
- 5. The method of claim 4, wherein the step of scheduling shared group and/or independent inference requests in any execution window comprises: Sequencing the reasoning requests and/or independent reasoning requests in the sharing group in the execution window according to the DP ratio priority mode to obtain the arrangement sequence of each reasoning request and/or the arrangement sequence of independent reasoning requests in the sharing group; And sequentially scheduling each reasoning request and/or the independent reasoning request according to the arrangement sequence of each reasoning request and/or the arrangement sequence of the independent reasoning request.
- 6. The method according to claim 1 or 2, wherein said scheduling said independent reasoning requests and/or at least one shared group based on said scheduling pattern comprises: And in the case that the scheduling mode comprises the priority execution of the DP ratio priority mode, scheduling the independent reasoning request and/or at least one sharing group according to the DP ratio priority mode.
- 7. The method of claim 6, wherein the scheduling the independent inference request and/or at least one shared group in the DP ratio priority mode comprises: dividing the sharing group and/or the independent reasoning request with the highest DP ratio similarity into the same execution group according to the corresponding first DP ratio and/or the DP ratio of the independent reasoning request of the at least one sharing group to obtain one or more execution groups; Sequencing the execution sequence of each of the one or more execution groups according to the second DP ratio corresponding to each of the execution groups to obtain the execution sequence of each of the execution groups, wherein the second DP ratio of each of the execution groups is obtained based on the first DP ratio of the shared group in the group and/or the DP ratio average value calculation of independent reasoning requests; And scheduling the sharing group and/or independent reasoning requests in each execution group in turn according to the execution sequence of each execution group.
- 8. The method of claim 7, wherein the step of scheduling shared group and/or independent inference requests within any execution group comprises: And according to the prefix group priority mode, firstly executing the reasoning operation of the shared group in the execution group, and/or finally executing the reasoning operation of the independent reasoning request.
- 9. The method according to any one of claims 1-8, wherein the step of calculating the DP ratio of the inference request comprises: Querying a target historical request with highest semantic similarity with the input text of the reasoning request from a historical request database, wherein the historical request database comprises at least one historical request; predicting the output text length of the reasoning request according to the output text length corresponding to the target history request; Calculating the quotient of the predicted output text length of the reasoning request and the input text length of the reasoning request, and obtaining the DP ratio of the reasoning request.
- 10. The method of claim 1, wherein the dividing the plurality of inference requests into independent inference requests and/or at least one shared group comprises: Determining prefixes corresponding to the multiple reasoning requests respectively; Dividing the reasoning requests with the same prefix in the prefixes of all the reasoning requests into the same sharing group, obtaining at least one sharing group, and determining the reasoning requests which are not divided into any sharing group as independent reasoning requests.
- 11. The method of claim 10, wherein the dividing the inference requests having the same prefix among the prefixes of the inference requests into the same sharing group to obtain at least one sharing group includes: constructing a shared prefix tree, wherein the shared prefix tree comprises one or more shared prefixes, and the one or more shared prefixes are child nodes of a root node of the shared prefix tree; Traversing the plurality of reasoning requests, and for the current reasoning request, inquiring a target shared prefix which is the same as the prefix of the reasoning request from the shared prefix tree; Inserting the suffix of the current reasoning request into the shared prefix tree as a child node of the target shared prefix, wherein the suffix refers to a second sentence fragment obtained after word segmentation of an input text of the reasoning request; And dividing the reasoning requests corresponding to the child nodes contained in each shared prefix into the same shared group according to the shared prefix tree, and obtaining at least one shared group.
- 12. A computing device comprising a processor and a memory, the memory storing a computer program for use by the processor to invoke to perform the inference request scheduling method of any of claims 1-11.
Description
Inference request scheduling method and computing device Technical Field The present application relates to the field of computing, and in particular, to a method and a computing device for scheduling inference requests. Background With the rapid development of artificial intelligence technology, the application of online reasoning services of large language models is becoming wider and wider. Currently, online reasoning services for large language models are typically implemented by a reasoning engine. Specifically, the user can initiate an inference request through the inference engine, and the background server processes the corresponding inference request based on the large language model obtained through training and feeds the inference result back to the user. To improve reasoning efficiency, large language models can handle large numbers of reasoning requests in batches. The processing time of different reasoning requests is different, so that if the reasoning requests with excessively long time consumption exist, the whole batch of reasoning requests can have excessively long scheduling time, and the batch processing efficiency is lowered. Disclosure of Invention The embodiment of the application provides an inference request scheduling method and computing equipment, which adopt a scheduling mode of combining shared prefix and flexible scheduling mode selection for batch inference requests to realize double optimization of inference efficiency and resource management. According to a first aspect of an embodiment of the present application, there is provided an inference request scheduling method, including: Receiving a plurality of reasoning requests; dividing the plurality of reasoning requests into independent reasoning requests and/or at least one sharing group, wherein the sharing group comprises the plurality of reasoning requests with the same prefix, and the independent reasoning requests refer to the reasoning requests which are not divided into any sharing group; According to load characteristics of a plurality of reasoning requests, a scheduling mode is selected from a prefix group priority mode and a DP ratio priority mode, the scheduling mode is selected from the prefix group priority mode and the DP ratio priority mode, the prefix group priority mode refers to the reasoning operation of sharing prefixes of a sharing group, the DP ratio priority mode refers to the reasoning operation of sequentially executing the reasoning objects after sequencing the reasoning objects according to the DP ratio, the reasoning objects comprise any reasoning request and/or sharing group, and the DP ratio refers to the ratio of the length of an output text to the length of an input text; based on the scheduling pattern, the independent inferences are scheduled for the request and/or the at least one shared group. In the embodiment of the application, for a plurality of inference requests in batches, prefixes corresponding to the plurality of inference requests respectively can be determined, and the plurality of inference requests are divided into independent inference requests and/or at least one sharing group. And the sharing group comprises a plurality of reasoning requests with the same prefix, and the independent reasoning requests refer to the reasoning requests which are not divided into any sharing group, so that reasonable grouping is realized. The reasoning operation of the shared prefix in the shared group can be realized through the grouping of the reasoning request, repeated calculation can be reduced, and the cache utilization rate is improved. And the selection of the prefix group priority mode and the DP ratio priority mode provides a more flexible scheduling mode. The load characteristics of a plurality of reasoning requests are specifically referred to when the scheduling mode is selected, so that the finally selected scheduling mode is matched with the global load, multiplexing of the shared prefix is considered, global optimal allocation of resources is guaranteed, the overall throughput, the resource utilization rate and the response stability of the large language model reasoning service are remarkably improved, and double optimization from reasoning efficiency and resource management is realized. With reference to the first aspect, in some implementations of the first aspect, the load feature includes a share group coverage, where the share group coverage refers to a proportion of the inference requests divided into at least one share group to the plurality of inference requests, and selecting, according to the load feature of the plurality of inference requests, a scheduling mode from a prefix group priority mode and a DP ratio priority mode includes: Determining a priority mode of the priority execution prefix group as a scheduling mode under the condition that the coverage rate of the shared group is larger than or equal to a coverage rate threshold value; and in the case that the