Search

CN-122021859-A - Model reasoning method, system and terminal equipment

CN122021859ACN 122021859 ACN122021859 ACN 122021859ACN-122021859-A

Abstract

Discloses a model reasoning method, a system and terminal equipment, and relates to the technical field of AI. After the processor receives the reasoning request, the operation model processes the input sequence included in the reasoning request to obtain a reasoning result corresponding to the input sequence. And during the process of the processor running the model, controlling the processor to run a first phase of the large model at a first frequency and controlling the processor to run a second phase of the large model at a second frequency. The resource occupancy rate of the first stage to the processor is larger than the resource occupancy rate of the second stage to the processor, and the first frequency is larger than the second frequency. Different frequency operation is adopted for different reasoning stages (a first stage or a second stage) of the model, so that the dynamic adjustment of the frequency of the processor is realized, the processor can operate with different calculation forces in different reasoning stages of the model, the reasoning efficiency of a large model is ensured, and the problems of high power consumption and heating caused by high calculation forces can be solved.

Inventors

  • WANG LI
  • TAN DING
  • HU TIANHAO
  • YAN LEI
  • ZHANG ZEHAO
  • XU SHU
  • ZANG HUI
  • WANG YAO

Assignees

  • 华为技术有限公司

Dates

Publication Date
20260512
Application Date
20250408

Claims (20)

  1. 1. A method of model reasoning, the method comprising: receiving an inference request, the inference request comprising an input sequence; the operation model is used for processing the input sequence to obtain an reasoning result corresponding to the input sequence; And in the running process of the model, controlling a processor to run a first stage of the model at a first frequency and controlling the processor to run a second stage of the model at a second frequency, wherein the resource occupancy rate of the first stage to the processor is larger than the resource occupancy rate of the second stage to the processor, and the first frequency is larger than the second frequency.
  2. 2. The method of claim 1, wherein the processor has first energy efficiency data when running the first phase and second energy efficiency data when running the second phase, wherein the first energy efficiency data and the second energy efficiency data are each used for indicating the utilization rate of the processor when running a model, and wherein the first energy efficiency data and the second energy efficiency data are different.
  3. 3. The method of claim 2, wherein the first energy efficiency data indicates that the model is operating in a pre-fill phase and the second energy efficiency data indicates that the model is operating in a decode phase.
  4. 4. A method according to any one of claims 1 to 3, further comprising: In the running process of the model, controlling the processor to run a third stage of the model at a third frequency, wherein the third stage has a larger resource occupancy rate to the processor than the second stage, and the third frequency is larger than the second frequency, or And in the running process of the model, controlling the processor to run a fourth stage of the model at a fourth frequency, wherein the resource occupancy rate of the fourth stage to the processor is smaller than that of the first stage to the processor, and the fourth frequency is smaller than that of the first frequency.
  5. 5. The method of claim 4, wherein the processor has third energy efficiency data when running the third phase, the third energy efficiency data indicating utilization of the processor when the third phase is running, the third energy efficiency data indicating that the model is running in a pre-fill phase, or The processor is provided with fourth energy efficiency data when the fourth stage is operated, the fourth energy efficiency data is used for indicating the utilization rate of the processor when the fourth stage is operated, and the fourth energy efficiency data indicates that the model is operated in a decoding stage.
  6. 6. The method of claim 4 or 5, wherein the third frequency is derived based on the first frequency, the first stage's resource occupancy for the processor, and the third stage's resource occupancy for the processor, or wherein the fourth frequency is derived based on the second frequency, the second stage's resource occupancy for the processor, and the fourth stage's resource occupancy for the processor.
  7. 7. The method according to any one of claims 1 to 7, further comprising: during the running process of the model, energy efficiency data of the processor are obtained; And determining a target reasoning stage of the processor running the model according to the energy efficiency data of the processor, wherein the target reasoning stage comprises the first stage or the second stage.
  8. 8. The method according to any one of claims 1 to 7, further comprising: adjusting the frequency of the processor to the first frequency by calling an interface based on determining that the processor enters the first stage, or And adjusting the frequency of the processor to the second frequency through the calling interface based on determining that the processor enters the second stage.
  9. 9. The method according to any one of claims 1 to 8, further comprising: controlling the memory to operate at a fifth frequency based on determining that the processor entered the first stage; controlling the memory to operate at a sixth frequency based on determining that the processor entered the second stage; The memory access bandwidth of the processor in the first stage is smaller than the memory access bandwidth of the processor in the second stage, and the fifth frequency is smaller than the sixth frequency.
  10. 10. The method according to any one of claims 1 to 9, wherein the first frequency and the second frequency are related to a type of the input sequence, the type of the input sequence comprising text, image, video or audio; Or the first frequency and the second frequency are related to the type of the model.
  11. 11. An operating system, the system comprising: The communication module is used for receiving an reasoning request, wherein the reasoning request comprises an input sequence; the processing module is used for operating a model, processing the input sequence to obtain an inference result corresponding to the input sequence, and controlling a processor to operate a first stage of the model at a first frequency and a second stage of the model at a second frequency in the operation process of the model, wherein the resource occupancy rate of the first stage to the processor is larger than the resource occupancy rate of the second stage to the processor, and the first frequency is larger than the second frequency; The communication module is used for outputting the reasoning result.
  12. 12. A terminal device is characterized by comprising a processor and an interface circuit; The interface circuit is used for receiving an reasoning request, wherein the reasoning request comprises an input sequence; the processor is used for running a model, processing the input sequence and obtaining an reasoning result corresponding to the input sequence; And in the running process of the model, controlling the processor to run a first stage of the model at a first frequency and controlling the processor to run a second stage of the model at a second frequency, wherein the resource occupancy rate of the first stage to the processor is larger than the resource occupancy rate of the second stage to the processor, and the first frequency is larger than the second frequency.
  13. 13. The terminal device of claim 12, wherein the processor has first energy efficiency data when running the first phase and second energy efficiency data when running the second phase, wherein the first energy efficiency data and the second energy efficiency data are each used for indicating a utilization rate of the processor when a model is running, and wherein the first energy efficiency data and the second energy efficiency data are different.
  14. 14. The terminal device of claim 13, wherein the first energy efficiency data indicates that the model is operating in a pre-fill phase and the second energy efficiency data indicates that the model is operating in a decode phase.
  15. 15. The terminal device according to any of the claims 12 to 14, characterized in that during operation of the model, the processor is controlled to operate a third phase of the model at a third frequency, the third phase having a larger resource occupancy for the processor than the second phase, the third frequency being larger than the second frequency; Or controlling the processor to run a fourth stage of the model at a fourth frequency, wherein the resource occupancy rate of the fourth stage to the processor is smaller than the resource occupancy rate of the first stage to the processor, and the fourth frequency is smaller than the first frequency.
  16. 16. The terminal device of claim 15, wherein the processor has third energy efficiency data when running the third phase, the third energy efficiency data indicating a utilization of the processor when the third phase is running, the third energy efficiency data indicating that a model is running in a pre-fill phase, or The processor is provided with fourth energy efficiency data when the fourth stage is operated, the fourth energy efficiency data is used for indicating the utilization rate of the processor when the fourth stage is operated, and the fourth energy efficiency data indicates that a model is operated in a decoding stage.
  17. 17. The terminal device according to claim 15 or 16, wherein the third frequency is derived based on the first frequency, the first stage's resource occupancy for the processor and the third stage's resource occupancy for the processor, or wherein the fourth frequency is derived based on the second frequency, the second stage's resource occupancy for the processor and the fourth stage's resource occupancy for the processor.
  18. 18. The terminal device according to any one of claims 12 to 17, wherein the processor is further configured to obtain energy efficiency data of the processor during operation of the model, and determine a target inference phase for the processor to operate the model according to the energy efficiency data of the processor, wherein the target inference phase comprises the first phase or the second phase.
  19. 19. The terminal device according to any of the claims 12 to 18, wherein the processor is further specifically configured to adjust the frequency of the processor to the first frequency by means of a call interface based on a determination that the processor enters the first phase or to the second frequency by means of the call interface based on a determination that the processor enters the second phase.
  20. 20. Terminal device according to any of the claims 12-19, characterized in that the terminal device further comprises a memory and a memory controller, The processor is further configured to send a first instruction to the memory controller based on determining that the processor entered a first phase of the model or send a second instruction to the memory controller based on determining that the processor entered a second phase of the model; The memory controller is used for responding to the first instruction and controlling the memory to run at a fifth frequency or responding to the second instruction and controlling the memory to run at a sixth frequency, wherein the memory access bandwidth of the processor to the memory in the first stage is smaller than that of the processor to the memory in the second stage, and the fifth frequency is smaller than the sixth frequency.

Description

Model reasoning method, system and terminal equipment Technical Field The application relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a model reasoning method, a model reasoning system and terminal equipment. Background The end side large model reasoning is a reasoning process of executing a large machine learning model on the terminal equipment, and compared with cloud large model reasoning, the end side large model reasoning can reduce delay and protect data safety of a user. And can realize reasoning under the condition that the terminal equipment is offline. However, the end-side large model has the problems of reasoning delay, low reasoning speed and occupation of a large amount of resources due to the limitation of the computational power and the memory of the terminal equipment. At present, the calculation power and access bandwidth of the large end-side model in operation are mainly improved in a model pruning mode, so that the reasoning speed of the large end-side model is improved. However, the method can increase the resource occupation of the terminal equipment in the terminal side large model reasoning process, increase the power consumption of the terminal equipment and cause the heating of the terminal equipment. Disclosure of Invention The application provides a model reasoning method, a system, terminal equipment and a program product, which are used for solving the problems of high power consumption and heating caused by high calculation power in the end side large model reasoning process. In a first aspect, the application provides a method of reasoning about a model, the method being performed by a processor. After receiving the reasoning request, the processor runs the model, and processes the input sequence included in the reasoning request by using the model to obtain a reasoning result corresponding to the input sequence. And during the process of the processor running the model, controlling the processor to run a first phase of the large model at a first frequency and controlling the processor to run a second phase of the large model at a second frequency. The resource occupancy rate of the first stage to the processor is larger than that of the second stage to the processor, and the first frequency is larger than the second frequency. Optionally, the resource occupancy is used to indicate a ratio between the computing resources that the processor needs to consume in the first phase or the second phase of the processor's operation model and the total computing resources provided by the processor. Based on the first aspect, in the running process of the model, the processor runs at different frequencies for different reasoning stages (a first stage or a second stage) of the model, so that the dynamic adjustment of the frequency of the processor is realized, instead of the same frequency for different reasoning stages, and thus, the frequency of the processor is dynamically adjusted through the reasoning stages of the model, the computational power provided by the processor is ensured to be adapted to the computational power requirements of different reasoning stages, the resources of the processor are fully utilized, the idle time of the processor is reduced, and the resource utilization rate of the processor is further improved. And the resource occupancy rate of the processor in the first stage is larger than that of the processor in the second stage, and the frequency of the processor in the first stage is controlled to be larger than that in the second stage, so that the processor operates at a higher frequency under a higher resource occupancy rate, thereby providing higher calculation power, improving the processing speed of the first stage, reducing the duration of the processor in the first stage, and further improving the reasoning speed. And by ensuring that the processor runs the second stage of the model at a lower frequency than the first stage so that the computing power provided by the processor changes with the change of the reasoning stages of the model instead of running the first stage and the second stage of the model at a higher frequency, the power consumption of the processor in the model reasoning process can be reduced, and the problems of high power consumption and heating caused by running the model at different reasoning stages of high computing power can be solved. In an alternative implementation, the processor has first energy efficiency data when running the first phase and the processor has second energy efficiency data when running the second phase. The first energy efficiency data and the second energy efficiency data are used for indicating the utilization rate of the processor when the model runs, and the first energy efficiency data are different from the second energy efficiency data. Optionally, the energy efficiency data includes one or more of frequency, bandwidth, memory bit width, or