Search

CN-121996390-A - Model reasoning scheduling optimization method, system, terminal and storage medium

CN121996390ACN 121996390 ACN121996390 ACN 121996390ACN-121996390-A

Abstract

The application discloses a model reasoning scheduling optimization method, a system, a terminal and a storage medium, which relate to the technical field of model reasoning scheduling and are used for carrying out scheduling control on a target hardware platform, wherein the target hardware platform is used for carrying a reasoning calculation process of a preset model, and the preset model is a model based on a coder-decoder framework; and according to the current reasoning stage, acquiring a target scheduling strategy corresponding to the current reasoning stage, and controlling a target hardware platform to execute a reasoning calculation process corresponding to the current reasoning stage, wherein the target scheduling strategy is used for controlling a data access mode of the target hardware platform. Thus, the model reasoning calculation efficiency can be improved.

Inventors

  • LI ANG
  • YU HAO
  • Peng Haoxiang
  • LIU HANG

Assignees

  • 南方科技大学

Dates

Publication Date
20260508
Application Date
20260407

Claims (10)

  1. 1. The model reasoning scheduling optimization method is characterized in that the method is used for scheduling control of a target hardware platform, the target hardware platform is used for bearing a reasoning calculation process of a preset model, the preset model is a model based on a codec architecture, and the method comprises the following steps: Acquiring reasoning stage judgment data from the target hardware platform, wherein the reasoning stage judgment data comprises at least one of model input data which is used for the preset model to perform reasoning calculation at the current moment, indication information of cross attention key value data corresponding to the preset model and reasoning stage state signals corresponding to the preset model; determining a current reasoning stage from a plurality of preset reasoning stages corresponding to the preset model according to the reasoning stage judging data; And according to the current reasoning stage, acquiring a target scheduling strategy corresponding to the current reasoning stage, and controlling the target hardware platform to execute a reasoning calculation process corresponding to the current reasoning stage according to the target scheduling strategy, wherein the target scheduling strategy is used for controlling a data access mode of the target hardware platform.
  2. 2. The model inference scheduling optimization method according to claim 1, wherein the plurality of preset inference stages corresponding to the preset model include an encoding stage, a cross-attention data initializing stage, and a decoding stage.
  3. 3. The model inference scheduling optimization method according to claim 2, wherein determining a current inference stage from a plurality of preset inference stages corresponding to the preset model according to the inference stage determination data includes: If the data dimension of the model input data is matched with a preset first dimension, determining the current reasoning stage as a coding stage; if the data dimension of the model input data is matched with a preset second dimension, and the indication information indicates that the cross attention key value data does not exist, determining that the current reasoning stage is a cross attention data initialization stage; And if the reasoning stage state signal indicates that the cross-attention buffer initialization completion state is entered, determining that the current reasoning stage is a decoding stage.
  4. 4. The model reasoning scheduling optimization method of claim 2, wherein the obtaining, according to the current reasoning stage, the target scheduling policy corresponding to the preset model includes: If the current reasoning stage is a coding stage, taking a preset throughput priority strategy as a target scheduling strategy corresponding to the preset model; If the current reasoning stage is a cross attention data initialization stage, taking a preset memory priority strategy as a target scheduling strategy corresponding to the preset model; And if the current reasoning stage is a decoding stage, taking a preset delay priority strategy as a target scheduling strategy corresponding to the preset model.
  5. 5. The model inference scheduling optimization method according to claim 4, wherein the target hardware platform comprises a high-bandwidth memory for storing data and a chip for executing a calculation process, the high-bandwidth memory is in communication connection with the chip, and an input data storage area for storing the model input data is arranged in the high-bandwidth memory; If the target scheduling policy is a throughput priority policy, the controlling the target hardware platform to execute the reasoning calculation process corresponding to the current reasoning stage according to the target scheduling policy includes: And taking the model input data stored in the input data storage area in the high-bandwidth memory as the input data corresponding to the chip, controlling the chip to execute the reasoning calculation process corresponding to the coding stage, and overwriting the coding stage output data obtained by the chip calculation to the input data storage area.
  6. 6. The model inference scheduling optimization method according to claim 5, wherein the chip includes a calculation module for performing a calculation process, and an on-chip cache for caching data; The target scheduling policy is further used for indicating a cache region division rule corresponding to the on-chip cache, and controlling the target hardware platform to dynamically divide the on-chip cache into cache regions for caching different data according to the cache region division rule.
  7. 7. The model inference scheduling optimization method according to any one of claims 1 to 6, wherein the target hardware platform is a hardware platform based on a field programmable gate array, and the preset model is a speech recognition model.
  8. 8. The model reasoning scheduling optimization system is characterized by comprising a central processor and a target hardware platform which are in communication connection, wherein the target hardware platform is used for bearing a reasoning calculation process of a preset model, and the preset model is a model based on a coder-decoder framework; The central processing unit is used for carrying out scheduling control on the target hardware platform according to the model reasoning scheduling optimization method of any one of claims 1 to 7.
  9. 9. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the model inference scheduling optimization method of any one of claims 1 to 7.
  10. 10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the model inference scheduling optimization method according to any one of claims 1 to 7.

Description

Model reasoning scheduling optimization method, system, terminal and storage medium Technical Field The application relates to the technical field of model reasoning scheduling, in particular to a model reasoning scheduling optimization method, a system, a terminal and a storage medium. Background With the development of science and technology, particularly the development of deep learning technology, various models are increasingly used, for example, models based on codec architecture (i.e., encoder-Decoder architecture) are increasingly used. In some application scenarios, the model may be deployed on a hardware platform, with the inference computation of the model being implemented based on the computing power of the hardware platform. In the prior art, a model is generally directly deployed on a hardware platform, the whole reasoning calculation process of the model can only be regarded as a whole, a fixed data access mode is preset for the corresponding model, and the same data access mode is always used for the whole reasoning calculation process of the model. The problem in the prior art is that a fixed data access mode is adopted for the whole reasoning calculation process of the model, which is not beneficial to improving the reasoning calculation efficiency of the model. Accordingly, the related art has yet to be improved and developed. Disclosure of Invention The application mainly aims to provide a model reasoning scheduling optimization method, a system, a terminal and a storage medium, and aims to solve the technical problem that the model reasoning calculation efficiency is not improved by adopting a fixed data access mode aiming at the whole model reasoning calculation process in the related technology. In order to achieve the above object, a first aspect of the present application provides a model inference scheduling optimization method, where the method is used for performing scheduling control on a target hardware platform, where the target hardware platform is used for carrying an inference calculation process of a preset model, where the preset model is a model based on a codec architecture, and the method includes: Obtaining reasoning stage judgment data from the target hardware platform, wherein the reasoning stage judgment data comprises at least one of model input data for the preset model to perform reasoning calculation at the current moment, indication information of cross attention key value data corresponding to the preset model and reasoning stage state signals corresponding to the preset model; determining a current reasoning stage from a plurality of preset reasoning stages corresponding to the preset model according to the reasoning stage judging data; And according to the current reasoning stage, acquiring a target scheduling strategy corresponding to the current reasoning stage, and controlling the target hardware platform to execute a reasoning calculation process corresponding to the current reasoning stage according to the target scheduling strategy, wherein the target scheduling strategy is used for controlling a data access mode of the target hardware platform. Optionally, the multiple preset reasoning stages corresponding to the preset model include an encoding stage, a cross attention data initializing stage and a decoding stage. Optionally, the determining the current inference stage according to the inference stage determination data from a plurality of preset inference stages corresponding to the preset model includes: if the data dimension of the model input data is matched with a preset first dimension, determining the current reasoning stage as a coding stage; If the data dimension of the model input data is matched with a preset second dimension, and the indication information indicates that the cross attention key value data does not exist, determining that the current reasoning stage is a cross attention data initialization stage; if the inference phase status signal indicates that the cross-attention buffer initialization completion status has been entered, determining that the current inference phase is a decoding phase. Optionally, the obtaining, according to the current reasoning stage, the target scheduling policy corresponding to the preset model includes: If the current reasoning stage is a coding stage, taking a preset throughput priority strategy as a target scheduling strategy corresponding to the preset model; if the current reasoning stage is a cross attention data initialization stage, taking a preset memory priority strategy as a target scheduling strategy corresponding to the preset model; and if the current reasoning stage is a decoding stage, taking a preset delay priority strategy as a target scheduling strategy corresponding to the preset model. Optionally, the target hardware platform includes a high-bandwidth memory for storing data and a chip for executing a calculation process, where the high-bandwidth memory is communicatively conn