Search

CN-122021896-A - Dynamic voltage frequency adjustment method and system for large language model reasoning

CN122021896ACN 122021896 ACN122021896 ACN 122021896ACN-122021896-A

Abstract

The application relates to the technical field of artificial intelligent chip control, in particular to a dynamic voltage frequency adjustment method and a system for large language model reasoning. The method comprises the steps of identifying that a current working phase is a Prefill phase with intensive computation or a decoding phase with intensive memory bandwidth based on detection signals related to the process of an reasoning task, responding to the identified working phase, determining target core frequency of a computing core corresponding to the current working phase and target memory frequency of an off-chip memory interface, and controlling the frequency and voltage of the computing core and the off-chip memory interface to reconfigure according to the target core frequency and the target memory frequency according to a preset safety protocol. The dynamic voltage frequency adjustment method for large language model reasoning eliminates the power consumption redundancy of non-bottleneck resources in each stage of large language model reasoning task, and greatly improves the calculation efficiency under unit energy consumption.

Inventors

  • TANG XING

Assignees

  • 上海深明奥思半导体科技有限公司

Dates

Publication Date
20260512
Application Date
20260123

Claims (16)

  1. 1. A dynamic voltage frequency adjustment method for large language model reasoning applied to a processor comprising a computing core and an off-chip memory interface, comprising: based on the detection signal related to the progress of the inference task, identifying the current working phase as a computationally intensive Prefill phase or a memory bandwidth intensive Decode phase; determining a target core frequency of a computing core and a target memory frequency of an off-chip memory interface corresponding to the current working phase in response to the identified working phase; And reconfiguring the frequencies and voltages of the computing core and the off-chip memory interface according to a target core frequency and a target memory frequency according to a preset security protocol.
  2. 2. The method for dynamic voltage frequency adjustment for large language model reasoning according to claim 1, wherein the specific step of identifying the current working phase as a Prefill phase or a Decode phase with a memory bandwidth intensive computing based on the detection signal related to the progress of the reasoning task comprises: monitoring detection signals related to the progress of an inference task in real time, wherein the detection signals at least comprise values of a decoding step counter, the decoding step counter is configured to be 0 when the inference task starts, configured to be 1 when the inference task generates a probability distribution of a first token, and configured to be 0 when each new token is generated, the decoding step counter is added with 1, and configured to be reset to be 0 when the inference task outputs a stop mark or reaches the maximum resource limit; When the value of the decoding counter is smaller than 1, judging that the current working stage is in Prefill stages; when the value of the decoding counter is greater than or equal to 1, the current working stage is judged to be in the decoding stage.
  3. 3. The method for dynamic voltage frequency adjustment for large language model reasoning of claim 1, further comprising: monitoring access request sequences of a memory controller or an off-chip memory interface in real time; analyzing address continuity or burst transmission length of the access request sequence; When the memory access request is monitored to show high address continuity and the average burst length is larger than a first preset threshold value, judging or tending to judge the current working stage as Prefill stages; when the access request is monitored to show low address continuity and the average burst length is smaller than a second preset threshold, the current working stage is judged or tends to be judged as the Decode stage.
  4. 4. The method for dynamic voltage frequency adjustment for large language model reasoning according to claim 1, wherein the specific step of determining the target core frequency of the computing core and the target memory frequency of the off-chip memory interface corresponding to the current working phase in response to the identified working phase comprises: If the target memory frequency is identified as Prefill stages, configuring the target memory frequency as a memory precalibrated frequency value, and configuring the target core frequency as a maximum core frequency value, wherein the memory precalibrated frequency value is a precalibrated minimum memory frequency capable of meeting the data supply required by calculating the core peak computing power; If the target memory frequency is identified as the Decode stage, configuring the target memory frequency as a maximum memory frequency value higher than the memory precalibrated frequency value, and configuring the target core frequency as a core precalibrated frequency value lower than the maximum core frequency value, wherein the core precalibrated frequency value is a frequency which is precalibrated and can enable the processing capacity of a computing core to be matched with the maximum memory bandwidth.
  5. 5. The method for dynamic voltage frequency adjustment for large language model reasoning of claim 4, further comprising: And dynamically calculating a core pre-calibration frequency value which enables the processing capacity of a calculation core to be matched with the memory bandwidth based on the calculation density parameter of the current reasoning task and the memory bandwidth corresponding to the maximum memory frequency value.
  6. 6. The method for dynamic voltage frequency adjustment for large language model reasoning according to claim 5, wherein the specific step of dynamically calculating the core pre-calibration frequency value that matches the processing capacity of the computing core with the memory bandwidth based on the calculated density parameter of the current reasoning task and the memory bandwidth corresponding to the maximum memory frequency value adopts the following formula: ; Wherein, the For the core pre-calibrated frequency value, For the memory bandwidth corresponding to the maximum memory frequency value, In order to calculate the density of the particles, In order to calculate the clock frequency of the core, To compute the processing power per clock of the core.
  7. 7. The dynamic voltage frequency adjustment method for large language model reasoning according to claim 2, wherein the preset security protocol comprises: when the frequency raising operation is executed, the voltage of the power domain of the interface of the corresponding computing core or the off-chip memory is raised firstly, and the clock frequency is raised after the voltage is stabilized; when the frequency-reducing operation is executed, the clock frequency is reduced, and the voltage of the power domain of the interface of the corresponding computing core or the off-chip memory is reduced after the frequency is stable.
  8. 8. The method for dynamic voltage frequency adjustment for large language model reasoning of claim 1, further comprising: and performing closed-loop fine tuning on the current frequency of the computing core based on the real-time bandwidth utilization of the off-chip memory interface.
  9. 9. The method for dynamic voltage frequency adjustment for large language model reasoning according to claim 1, wherein the specific step of performing closed-loop fine-tuning on the current frequency of the computing core based on the real-time bandwidth utilization of the off-chip memory interface comprises: Monitoring the current bandwidth utilization rate of the off-chip memory interface in real time; calculating an error between the current bandwidth utilization rate and the target bandwidth utilization rate by taking a preset target bandwidth utilization rate as a reference; and performing inverse fine tuning on the current frequency of the computing core based on the direction and the amplitude of the error.
  10. 10. The method for dynamic voltage frequency adjustment for large language model reasoning of claim 1, further comprising: When the chip temperature or instantaneous power consumption is monitored to exceed a safety threshold, the frequency and voltage of the computing core are forced to be reduced to a safety level.
  11. 11. A dynamic voltage frequency adjustment system for large language model reasoning is characterized in that, At least one computational core for performing inference tasks of the large language model; An off-chip memory interface for connecting the off-chip memory; and the power consumption management unit is respectively connected with the computing core and the off-chip memory interface and is used for executing the dynamic voltage frequency adjustment method for large language model reasoning according to any one of claims 1-10 so as to adjust the frequencies of the computing core and the memory interface.
  12. 12. The dynamic voltage frequency adjustment system for large language model reasoning of claim 11, wherein the power consumption management unit comprises: the stage monitor is used for outputting a corresponding working stage identification signal according to the detection signal related to the progress of the reasoning task; the configuration strategy module is connected with the stage monitor and is used for generating corresponding frequency configuration parameters of the computing core and the memory interface according to a preset configuration strategy according to the working stage identification signal; The clock and voltage control interface is connected with the configuration strategy module and used for converting the frequency configuration parameters into specific clock adjustment commands and voltage adjustment sequences.
  13. 13. The system of claim 12, wherein the configuration policy module includes a bandwidth calculation model for dynamically calculating a core pre-calibration frequency value that matches a processing capacity of a computing core with a memory bandwidth according to a calculation density parameter of a current reasoning task and a memory bandwidth corresponding to a maximum memory frequency value when the working phase identification signal is a Decode phase.
  14. 14. The dynamic voltage frequency adjustment system for large language model reasoning of claim 13 wherein the configuration policy module further comprises: And the self-adaptive frequency controller performs closed-loop fine tuning on the current frequency of the computing core based on the real-time bandwidth utilization rate of the off-chip memory interface.
  15. 15. A processor comprising the dynamic voltage frequency adjustment system for large language model reasoning of any of claims 11-14.
  16. 16. A computer storage medium having stored thereon a computer program, which, when executed by a processor, implements the dynamic voltage frequency adjustment method for large language model reasoning of any of claims 1-10.

Description

Dynamic voltage frequency adjustment method and system for large language model reasoning Technical Field The application relates to the technical field of artificial intelligent chip control, in particular to a dynamic voltage frequency adjustment method and a system for large language model reasoning. Background With the widespread use of large language models (Large Language Model, LLM), their reasoning tasks have become an important component of artificial intelligence computational load. The LLM reasoning process typically includes two distinct core phases, a Prefill (pre-fill) phase and a Decode (Decode) phase. In stage Prefill, the system needs to process the complete input sequence, and performs a computationally intensive matrix-matrix multiplication operation, where the utilization of the computational core is extremely high, and peak demands are placed on the computational power of the processor. In the following decoding stage, the system generates output tokens one by one in an autoregressive manner, and mainly performs matrix-vector multiplication operation with intensive memory bandwidth, which has extremely strict requirements on the data throughput capacity of off-chip memories (e.g. DDR, HBM), and the computing core is often in idle or inefficient state due to waiting for weight data. Currently, the mainstream hardware platforms that carry LLM reasoning tasks, such as general-purpose Graphics Processor (GPU) or general-purpose Artificial Intelligence (AI) acceleration chips, have their power management strategies focused on global thermal throttling (thermal throttling) or on average load-based dynamic voltage frequency scaling (Dynamic Voltage and Frequency Scaling, DVFS). These strategies are typically coarse-grained and reactive, e.g., uniformly adjusting the frequency and voltage of all compute units and memory interfaces according to the chip overall temperature or Power Envelope (Power Envelope). However, this "one-cut" power consumption management approach fails to accurately adapt to the bottleneck of resource requirements that vary sharply between the two stages of LLM reasoning, resulting in significant energy efficiency loss. Specifically, at stage Prefill, maintaining a high frequency, high voltage memory interface at all times for matching the computed peaks creates a large amount of static and dynamic power consumption redundancy. In contrast, in the Decode stage, although the memory bandwidth has become a bottleneck for system performance, the computing core still operates at a high frequency, and the processing speed is far higher than the memory supply capability, so that the core frequently idles, and ineffective dynamic power consumption is generated, and the memory interface may not still operate at the highest energy efficiency point. This "resource-load" mismatch condition not only results in unnecessary energy consumption, but also limits the sustained performance release of the chip and may affect reliability due to long-term high temperature operation. Disclosure of Invention In order to solve the defects existing in the prior art, the application aims to provide a dynamic voltage frequency adjustment method and a system for large language model reasoning, which are used for adjusting the voltage frequency in stages in the large language model reasoning process so as to reduce the overall power consumption of a processor. To achieve the above object, the present application provides a dynamic voltage frequency adjustment method for large language model reasoning, applied to a processor including a computing core and an off-chip memory interface, comprising: based on the detection signal related to the progress of the inference task, identifying the current working phase as a computationally intensive Prefill phase or a memory bandwidth intensive Decode phase; determining a target core frequency of a computing core and a target memory frequency of an off-chip memory interface corresponding to the current working phase in response to the identified working phase; and controlling the frequency and the voltage of the computing core and the off-chip memory interface to reconfigure according to the target core frequency and the target memory frequency according to a preset safety protocol. Further, the specific steps of identifying that the current working phase is a Prefill phase with intensive computation or a Decode phase with intensive memory bandwidth based on the detection signals related to the process of the reasoning task include: monitoring detection signals related to the progress of an inference task in real time, wherein the detection signals at least comprise values of a decoding step counter, the decoding step counter is configured to be 0 when the inference task starts, configured to be 1 when the inference task generates a probability distribution of a first token, and configured to be 0 when each new token is generated, the decoding step counter is adde