Search

CN-122018796-A - Cold and hot data exchange method, device, system, medium and product of key value cache

CN122018796ACN 122018796 ACN122018796 ACN 122018796ACN-122018796-A

Abstract

The invention discloses a cold and hot data exchange method, a device, a system, a medium and a product of key value cache, wherein the method comprises the following steps: the computing core determines a target history key value cache of each target history word element required by generating the current word element, and sends a corresponding key value identification list to the data access control unit. And when the data access control unit determines that the first memory does not store all the target historical key value caches according to the key value identification list and the residual space cannot completely store the missing target historical key value caches, eliminating the plurality of historical key value caches, and carrying the missing target historical key value caches from the second memory to the first memory. When the computing core detects that the current system reaches the generation time point of the current word element, the computing core acquires each target historical key value cache from the first memory at a high speed to generate and output the current word element. According to the technical scheme, the low cost of storing the target history key value cache data can be considered, and meanwhile, the computing core is guaranteed to have enough high bandwidth for accessing the target history key value cache.

Inventors

  • MA WENCHAO
  • ZHU JIANQIU
  • ZHANG YALIN
  • LUO WEI

Assignees

  • 上海云燧科技有限公司

Dates

Publication Date
20260512
Application Date
20260123

Claims (14)

  1. 1. The cold and hot data exchange method of a key value buffer memory, characterized by that, apply to the scene that uses the specialized data processing chip to carry out the task of reasoning of large model, specialized data processing chip and first memory, second memory and data access control unit communication connection respectively, the communication bandwidth of the first memory is greater than the second memory, the storage capacity of the first memory is smaller than the second memory, include the computational core in the specialized data processing chip, the method includes: When determining the target history key value caches of all target history word elements required for generating the current word element through the computing core, sending a key value identification list of each target history key value cache to the data access control unit; When the data access control unit determines that the first memory does not store all target historical key value caches according to the key value identification list and the residual space of the first memory cannot completely store the missing target historical key value caches, the first memory is controlled to eliminate a plurality of historical key value caches, and then the missing target historical key value caches are carried from the second memory to the first memory; when the computing core detects that the current system time reaches the generation time point of the current word element, the computing core acquires each target historical key value cache from the first memory at a high speed, and generates and outputs the current word element according to each target historical key value cache.
  2. 2. The method of claim 1, wherein the determining time point of the target history key value cache of each target history word required for generating the current word is determined to be earlier than the generating time point of the current word, and a time interval having a preset time order is provided between the determining time point and the generating time point.
  3. 3. The method according to claim 1, wherein when it is determined by the data access control unit that the first memory does not store all the target history key caches according to the key identification list and the remaining space of the first memory cannot store the missing target history key caches completely, controlling the first memory to discard the plurality of history key caches, and then transferring the missing target history key caches from the second memory to the first memory, includes: Comparing each key value identifier in the key value identifier list with each history key value cache stored in the first memory through the data access control unit, and identifying a target history key value cache missing in the first memory; Detecting the residual space of the first memory through a data access control unit, and eliminating a plurality of history key value caches from the first memory according to a preset data elimination strategy when determining that the residual space of the first memory cannot completely store the missing target history key value caches; And generating a data handling instruction matched with the missing target history key cache through the data access control unit, and handling the missing target history key cache from the second memory to the first memory through executing the data handling instruction.
  4. 4. A method according to claim 3, wherein the discarding of the plurality of history key-value caches from the first memory by the data access control unit according to a preset data-discarding policy comprises: calculating the space occupation amount required by the missing target history key value cache through a data access control unit; ordering each history key value cache in the first memory according to the history access times and/or the data life time of each history key value cache in the first memory by the data access control unit; And eliminating a plurality of historical key value caches matched with the space occupation amount from the first memory according to the sorting result by the data access control unit.
  5. 5. The method of claim 1, further comprising, after sending, by the compute core, the key identification list of each target history key cache to the data access control unit: And when the first memory is determined to completely store all target historical key value caches according to the key value identification list by the data access control unit, not carrying out the operation of carrying data from the second memory to the first memory And when the data access control unit determines that the first memory does not store all target historical key value caches according to the key value identification list, and the residual space of the first memory can completely store the missing target historical key value caches, the missing target historical key value caches are directly carried from the second memory to the first memory.
  6. 6. The method according to any one of claims 1-5, wherein the step of obtaining, by the computing core, each target history key cache from the first memory at a high speed when it is detected that the current system time reaches a generation time point of the current token, specifically comprises: When detecting that the current system time reaches the generation time point of the current word element, the computing core sends a plurality of data reading requests aiming at each target history key value cache to the data access control unit; and according to the received data reading requests, the data access control unit acquires each target history key value cache from the first memory at a high speed and feeds back each target history key value cache to the computing core.
  7. 7. The method according to claim 6, wherein the step of obtaining each target history key cache from the first memory at a high speed by the data access control unit according to the received plurality of data read requests and feeding each target history key cache back to the computing core comprises: According to the received data reading requests, the data access control unit acquires target historical key value caches corresponding to the data reading requests from the first memory at a high speed; If the data access control unit determines that the current target history key value cache matched with the current data reading request is not currently stored in the first memory, temporarily storing the current data reading request; when receiving the current target history key value cache fed back by the second memory, the data access control unit directly provides the current target history key value cache as a feedback result of the current data reading request to the calculation unit, and synchronously carries the current target history key value cache to the first memory.
  8. 8. The method of any of claims 1-5, further comprising, after generating and outputting, by the computing core, the current lemma from each target history key cache: And the computing core synchronously stores the current key value cache matched with the current word as the latest historical key value cache into the first memory and the second memory for use in generating the next word.
  9. 9. The method of any of claims 1-5, further comprising, after generating and outputting, by the computing core, the current lemma from each target history key cache: after the complete large model reasoning process is finished, the data access control unit identifies all associated historical key value caches which are stored in the first memory and matched with the large model reasoning process; Marking the invalid state of each associated history key value cache in a local management data set through a data access control unit, wherein the cache space occupied by the data marked as the invalid state can be recycled.
  10. 10. The method of claim 1, wherein the number of the first memories and the second memories are each plural, and each of the first memories is divided into an non-exchangeable data area and an exchangeable data area in advance, wherein a history key is cached in the exchangeable data area of each of the first memories.
  11. 11. The utility model provides a cold and hot data exchange device of key value buffering, its characterized in that is applied to in the scene that uses special data processing chip to carry out big model reasoning task, special data processing chip is connected with first memory, second memory and data access control unit communication respectively, and the communication bandwidth of first memory is greater than the second memory, and the storage capacity of first memory is less than the second memory, includes the computational core in the special data processing chip, and this device includes: The key value cache confirming module is used for sending a key value identification list of each target history key value cache to the data access control unit when determining the target history key value cache of each target history word element required by generating the current word element through the computing core; The key value buffer handling module is used for controlling the first memory to eliminate a plurality of history key value buffers when the data access control unit determines that the first memory does not store all target history key value buffers according to the key value identification list and the residual space of the first memory cannot completely store the missing target history key value buffers, and then handling the missing target history key value buffers from the second memory to the first memory; And the word element generating module is used for acquiring each target historical key value cache from the first memory at a high speed when the computing core detects that the current system time reaches the generation time point of the current word element, and generating and outputting the current word element according to each target historical key value cache.
  12. 12. A key value cached cold and hot data exchange system is characterized by comprising a special data processing chip, a first memory, a second memory and a data access control unit, wherein the special data processing chip is respectively in communication connection with the first memory, the second memory and the data access control unit; wherein the cold and hot data exchange method of the key value cache according to any one of claims 1-10 is jointly implemented by the coordinated execution of the computing core and the data access control unit.
  13. 13. A computer readable storage medium storing computer instructions for causing a processor to perform the method of cold and hot data exchange of a key value cache according to any one of claims 1-10.
  14. 14. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements a method of cold and hot data exchange of a key-value cache according to any of claims 1-10.

Description

Cold and hot data exchange method, device, system, medium and product of key value cache Technical Field The invention relates to the technical field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), in particular to a cold and hot data exchange method, a device, a system, a medium and a product of key value cache. Background With the continuous evolution of LLM (Large Language Model ) technology, the capability of the model to output long sequences is gradually enhanced, and the requirements of inference scenes on context understanding are promoted continuously. When the large language model infers and outputs the current word (also can be simply called token), the large language model relies on Key-Value caches (KV caches) of all generated words before the current word to realize attention calculation among the words, but the storage occupation of the Key-Value caches is in square time relation with the length of the output word sequence, and the explosive growth of Key-Value Cache data under a long sequence generation scene is presented, so that brand new challenges are presented to the capacity and bandwidth of a storage medium. In order to solve the problem of explosive growth of the key value cache, a part of models adopt a key value cache sparsification processing scheme to realize a model reasoning process. The key value caches of all the historical lemmas are not used any more in the determining stage of the current lemmas, and the key value caches of a plurality of lemmas which have high correlation with the current lemmas and have fixed quantity in the historical lemmas are selected to execute calculation. The inventor finds that when the key value cache sparsification processing scheme is directly applied under the existing artificial intelligent chip storage system in the process of realizing the invention, the great economic cost is brought, or great reasoning time delay is introduced, and the technical advantages of the key value cache sparsification processing scheme cannot be truly exerted. Disclosure of Invention The embodiment of the invention provides a cold and hot data exchange method, a device, a system, a medium and a product of key value cache, which can simultaneously consider high-bandwidth calculation and low-cost storage in a scene of executing large model reasoning by using a key value cache sparsification processing scheme. According to an aspect of the embodiments of the present invention, there is provided a hot and cold data exchange method of a key value cache, applied to a scenario of executing a large model reasoning task using a dedicated data processing chip, the dedicated data processing chip is respectively in communication connection with a first memory, a second memory and a data access control unit, a communication bandwidth of the first memory is greater than that of the second memory, a storage capacity of the first memory is smaller than that of the second memory, and the dedicated data processing chip includes a computing core, the method includes: When determining the target history key value caches of all target history word elements required for generating the current word element through the computing core, sending a key value identification list of each target history key value cache to the data access control unit; When the data access control unit determines that the first memory does not store all target historical key value caches according to the key value identification list and the residual space of the first memory cannot completely store the missing target historical key value caches, the first memory is controlled to eliminate a plurality of historical key value caches, and then the missing target historical key value caches are carried from the second memory to the first memory; when the computing core detects that the current system time reaches the generation time point of the current word element, the computing core acquires each target historical key value cache from the first memory at a high speed, and generates and outputs the current word element according to each target historical key value cache. According to another aspect of the embodiments of the present invention, there is provided a hot and cold data exchange device of a key value cache, applied to a scenario of performing a large model reasoning task using a dedicated data processing chip, the dedicated data processing chip being communicatively connected to a first memory, a second memory and a data access control unit, respectively, a communication bandwidth of the first memory being larger than that of the second memory, a storage capacity of the first memory being smaller than that of the second memory, the dedicated data processing chip including a computing core, the device comprising: The key value cache confirming module is used for sending a key value identification list of each target history key value cache to the data access control unit when determining the