Search

CN-122019114-A - Dense LLM reasoning parallelization configuration decision method and system based on machine learning

CN122019114ACN 122019114 ACN122019114 ACN 122019114ACN-122019114-A

Abstract

The invention discloses a dense LLM reasoning parallelization configuration decision method and system based on machine learning, wherein the method receives a query request containing input length and output length of a dense large language model to be configured, enumerates all feasible parallelization strategies of a target model, inputs the parallelization strategies into a trained mixed prediction model, and predicts throughput of each parallelization strategy under the constraint of the query request; the parallelization strategy comprises tensor parallelism, pipeline parallelism and GPU quantity, and the optimal parallelization strategy is selected as the parallelization strategy of target model reasoning. According to the invention, the behavior rules are learned from the real system operation data, complex and inaccurate modeling is avoided, intelligent decision making is supported under the constraint of changed resources, and an optimal dense large language model reasoning parallelization strategy is output.

Inventors

  • QIAN XIAOYAN
  • HAN LEI
  • XIAO FU
  • SUN PEIJIE
  • HU YIXIAO
  • GU YAN
  • WEI SHUWEN
  • ZHU XIYAO

Assignees

  • 南京航空航天大学
  • 南京邮电大学

Dates

Publication Date
20260512
Application Date
20260414

Claims (10)

  1. 1. A dense LLM reasoning parallelization configuration decision method based on machine learning is characterized by comprising the following steps: Receiving a query request of a user, wherein the query request comprises an input length and an output length of a target model, and the target model is a dense large language model to be configured; Enumerating all feasible parallelization strategies of the target model, inputting the parallelization strategies into a trained mixed prediction model, and predicting throughput of each parallelization strategy under the constraint of the query request, wherein the parallelization strategies comprise tensor parallelism, pipeline parallelism and GPU quantity; And selecting an optimal parallelization strategy as a parallelization strategy of target model reasoning according to the prediction result of the throughput.
  2. 2. The machine learning based dense LLM reasoning parallelization configuration decision method of claim 1, wherein the hybrid predictive model comprises a first sub-model and a second sub-model; The first sub-model is a regression model and is used for obtaining a predicted first predicted throughput according to the parallelization strategy; The second sub-model is a decision tree model and is used for predicting the residual error of the first prediction throughput to obtain a first prediction residual error; and adding the first predicted throughput and the first predicted residual to obtain a predicted result of the throughput.
  3. 3. The machine learning based dense LLM reasoning parallelization configuration decision method of claim 2, wherein the training method of the hybrid predictive model comprises: Acquiring a plurality of configuration combinations of a sample model and actual running throughput under each configuration combination, wherein the configuration combinations comprise input length, output length, tensor parallelism, pipeline parallelism and GPU number; The sample model comprises a plurality of dense large language models; inputting a feature polynomial constructed according to the configuration combination into a first sub-model to train the first sub-model, the first sub-model outputting a second predicted throughput; And inputting the characteristic polynomial constructed according to the configuration combination into a second sub-model, training the second sub-model by taking the difference value between the first predicted throughput and the actual running throughput as a target, outputting a second predicted residual by the second sub-model, and adding the second predicted throughput and the second predicted residual to obtain a predicted result of the throughput.
  4. 4. The machine learning based dense LLM reasoning parallelization configuration decision method of claim 3, wherein the actual execution throughput obtaining method comprises: acquiring a number of configuration combinations of the sample model and actual operating throughput under each configuration combination includes: dividing the sample model according to tensor parallelism and pipeline parallelism in configuration combination to obtain model fragments; setting GPUs according to the number of GPUs in the configuration combination to load the model fragments; setting a random ID sequence according to the input length and the output length in the configuration combination, executing autoregressive generation, and recording the total number of the tokens and the total time of the token calculation; And calculating the actual operation throughput according to the total number of the tokens and the total time of the token calculation.
  5. 5. The machine learning based dense LLM inference parallelization configuration decision method of claim 3, wherein the first sub-model is a Ridge regression model and the second sub-model is a gradient lifting decision tree model.
  6. 6. The machine learning based dense LLM reasoning parallelization configuration decision method of claim 1, wherein the query request further comprises a maximum number of GPUs, and the feasible parallelization strategy satisfies that a product of tensor parallelism and pipeline parallelism is not greater than the maximum number of GPUs, and the tensor parallelism and the pipeline parallelism are all power of 2.
  7. 7. The machine learning based dense LLM reasoning parallelization configuration decision method of claim 1 or 6, wherein the query request further comprises a minimum throughput; when the query request contains the minimum throughput requirement, selecting a parallelization strategy with the minimum quantity of corresponding GPUs from throughput prediction results meeting the minimum throughput requirement as an optimal parallelization strategy; and otherwise, selecting a parallelization strategy corresponding to the maximum value of the predicted throughput as an optimal parallelization strategy.
  8. 8. A machine learning-based dense LLM reasoning parallelization configuration decision system, comprising: the system comprises a request accepting unit, a request processing unit and a processing unit, wherein the request accepting unit is used for receiving a query request of a user, and the query request comprises an input length and an output length of a target model; The parallelization strategy decision unit is used for enumerating all feasible parallelization strategies of the target model, inputting the parallelization strategies into a trained mixed prediction model, predicting throughput of each parallelization strategy under the constraint of the query request, wherein the parallelization strategies comprise tensor parallelism, pipeline parallelism and GPU quantity, and selecting the optimal parallelization strategy as the parallelization strategy of target model reasoning according to the prediction result of the throughput.
  9. 9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when loaded into the processor implements the machine learning based dense LLM reasoning parallelized configuration decision method according to any of claims 1-7.
  10. 10. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements a machine learning based dense LLM reasoning parallelization configuration decision method according to any of claims 1-7.

Description

Dense LLM reasoning parallelization configuration decision method and system based on machine learning Technical Field The invention relates to the field of distributed computation and machine learning, in particular to a dense LLM reasoning parallelization configuration decision method and system based on machine learning. Background With the explosive growth of large language models (Large Language Model, LLM) and computing demands, its reasoning services have become a core application consuming huge computing resources. To meet the low-latency, high-throughput reasoning requirements, distributed parallel techniques, particularly tensor parallel (Tensor Parallelism, TP) and pipelined parallel (PIPELINE PARALLELISM, PP), have become the standard paradigm for deploying LLM reasoning services. However, in practical production environments, decision-parallelization strategies for specific LLM models are a very challenging problem, and currently, there are mainly two types of methods in industry and academia (1) simulation-based search systems, which enumerate or heuristic search in policy space by building a fine cost model. However, the cost model is often based on a simplified assumption, and is difficult to adapt to memory management and scheduling strategies specific to different reasoning engines, so that deviation exists in decision results in a real system. At the same time, its online search overhead may be large. (2) Rule-based expert experience or grid search-engineers determine a set of "relatively good" fixed configurations empirically or through small-scale testing. The method cannot adapt to different workloads, and in order to find the global optimal solution, a large amount of time-consuming actual benchmark tests are needed, the trial-and-error cost is high, and the method has no expandability. Disclosure of Invention The invention aims to provide a machine learning-based dense LLM reasoning parallelization configuration decision method and system, which can accurately, efficiently and automatically recommend optimal parallelization strategies and hardware resource configurations for dense large language model reasoning services. The invention discloses a machine learning-based dense LLM reasoning parallelization configuration decision method, which comprises the following steps: Receiving a query request of a user, wherein the query request comprises an input length and an output length of a target model, and the target model is a dense large language model to be configured; Enumerating all feasible parallelization strategies of the target model, inputting the parallelization strategies into a trained mixed prediction model, and predicting throughput of each parallelization strategy under the constraint of the query request, wherein the parallelization strategies comprise tensor parallelism, pipeline parallelism and GPU quantity; And selecting an optimal parallelization strategy as a parallelization strategy of target model reasoning according to the prediction result of the throughput. Further, the hybrid prediction model includes a first sub-model and a second sub-model; The first sub-model is a regression model and is used for obtaining a predicted first predicted throughput according to the parallelization strategy; The second sub-model is a decision tree model and is used for predicting the residual error of the first prediction throughput to obtain a first prediction residual error; and adding the first predicted throughput and the first predicted residual to obtain a predicted result of the throughput. Further, the training method of the hybrid prediction model comprises the following steps: Acquiring a plurality of configuration combinations of a sample model and actual running throughput under each configuration combination, wherein the configuration combinations comprise input length, output length, tensor parallelism, pipeline parallelism and GPU number; The sample model comprises a plurality of dense large language models; inputting a feature polynomial constructed according to the configuration combination into a first sub-model to train the first sub-model, the first sub-model outputting a second predicted throughput; And inputting the characteristic polynomial constructed according to the configuration combination into a second sub-model, training the second sub-model by taking the difference value between the first predicted throughput and the actual running throughput as a target, outputting a second predicted residual by the second sub-model, and adding the second predicted throughput and the second predicted residual to obtain a predicted result of the throughput. Further, the method for acquiring the actual running throughput comprises the following steps: acquiring a number of configuration combinations of the sample model and actual operating throughput under each configuration combination includes: dividing the sample model according to tensor parallelism and pipeline