CN-122021565-A - Method and device for generating text by using large language model

CN122021565ACN 122021565 ACN122021565 ACN 122021565ACN-122021565-A

Abstract

The embodiment of the specification provides a method and a device for generating text by using a large language model, wherein the method comprises the steps of judging whether a sequence index t of a t word element to be generated meets a preset interval condition or not in a thinking stage of the large language model, acquiring importance scores corresponding to each history word element respectively if the interval condition is met, reading key vectors and value vectors corresponding to a first number of history word elements from a history KV cache according to the sequence from high to low of the importance scores to form recall data, forming key vectors and value vectors of reference history word elements based on the recall data, multiplexing the last recall data to form key vectors and value vectors of the reference history word elements if the interval condition is not met, calculating attention scores between a current user problem and the reference history word elements, generating the t word element, classifying the t word element into the generated thinking text, and storing the key vectors and the value vectors of the t word element into the history KV cache.

Inventors

DONG WEI
ZHU JIANGCAI
SHAO KAILAI
CHEN CHAO
HU HAIXIANG

Assignees

支付宝(杭州)数字服务技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260116

Claims (13)

1. A method of generating text using a large language model, comprising: In the thinking stage of the large language model, judging whether the sequence index t of the t-th word element to be generated meets the preset interval condition or not; If the interval condition is met, obtaining importance scores corresponding to the historical words respectively, and reading key vectors and value vectors corresponding to the first number of historical words from a historical KV cache according to the sequence from high to low of the importance scores to form recall data; if the interval condition is not met, multiplexing the last recall data to form a key vector and a value vector of the reference history word element; And calculating the attention score between the current user problem and the reference historical lemma, thereby generating a t-th lemma, classifying the t-th lemma into the generated thinking text, and storing the key vector and the value vector of the t-th lemma into the historical KV cache.
2. The method of claim 1, further comprising: And in the answer generation stage of the large language model, for each word element position, reading the key vector and the value vector corresponding to the full-quantity historical word element from the historical KV cache, calculating the attention score between the current user problem and the full-quantity historical word element so as to generate a next word element, and storing the key vector and the value vector of the next word element into the historical KV cache.
3. The method of claim 1, wherein the history lemma includes the current user question, and generated thinking text.
4. The method of claim 1, wherein the preset interval condition comprises: the relation between the sequence index t and the preset interval parameter accords with the preset relation.
5. The method of claim 4, wherein the preset relationship comprises: the order index t is an integer multiple of the interval parameter.
6. The method of claim 1, wherein the forming key vectors and value vectors of reference history tokens based on recall data comprises: reading key vectors and value vectors corresponding to the latest second number of historical lemmas to serve as latest window data; and forming a key vector and a value vector of the reference history word element based on the recall data and the latest window data.
7. The method of claim 1, wherein multiplexing the last recall data to form a key vector and a value vector of reference history tokens comprises: reading key vectors and value vectors corresponding to the latest second number of historical lemmas to serve as latest window data; and forming a key vector and a value vector of the reference history word element based on the last recall data and the last window data.
8. The method of claim 1, wherein the obtaining the importance scores for each of the historical tokens, respectively, comprises: obtaining each attention score between any first historical word element and other second historical word elements; And averaging all the attention scores to obtain the importance scores of the first historical lemmas.
9. The method of claim 1, wherein the calculating the attention score between the current user question and the reference history lemma, thereby generating a t-th lemma, comprises: calculating a first query vector corresponding to the current user problem; calculating each attention score according to the first query vector and the key vector corresponding to the reference history lemma; and determining the characterization vector of the t-th word element according to the attention scores and the value vectors corresponding to the reference history word elements, so as to determine the t-th word element.
10. The method of claim 2, wherein the calculating an attention score between the current user question and the full amount of history tokens, thereby generating a next token comprises: calculating a first query vector corresponding to the current user problem; calculating each attention score according to the first query vector and key vectors corresponding to the total historical tokens read from the historical KV cache; And determining the characterization vector of the next word element according to the value vector corresponding to each attention score and the full history word element, so as to generate the next word element.
11. An apparatus for generating text using a large language model, comprising: the judging unit is used for judging whether the sequence index t of the t word element to be generated meets the preset interval condition or not in the thinking stage of the large language model; The recall unit is used for acquiring importance scores corresponding to the historical words respectively if the judging unit judges that the interval condition is met, and reading key vectors and value vectors corresponding to the first number of historical words from the historical KV cache according to the sequence from high to low of the importance scores to form recall data; the multiplexing unit is used for multiplexing the recall data of the last recall unit to form a key vector and a value vector of a reference history word if the judging unit judges that the interval condition is not met; And the generating unit is used for calculating the attention score between the current user problem and the reference historical lemma obtained by the recall unit or the multiplexing unit so as to generate a t-th lemma, classifying the t-th lemma into the generated thinking text, and storing the key vector and the value vector of the t-th lemma into the historical KV cache.
12. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-10.
13. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-10.

Description

Method and device for generating text by using large language model Technical Field One or more embodiments of the present specification relate to the field of computers, and more particularly, to a method and apparatus for generating text using a large language model. Background With the rapid development of artificial intelligence technology, generative artificial intelligence represented by a large language model (large language model, LLM) exhibits a strong capability in the field of natural language processing. The large language model can understand complex instructions, generate high-quality texts, conduct multi-round conversations and play key roles in a plurality of application scenes such as question-answering, content creation, code generation and the like. However, large language models face significant computational and memory bottlenecks in the reasoning process, a characteristic that is particularly apparent in long sequence generation tasks. In the prior art, by using key-value (KV) cache as a core optimization means of the decoding node, repeated computation in an attention mechanism can be greatly reduced, but storage and computation overhead of the key-value (KV) cache linearly increases along with the length of a sequence, which can cause delay increase and overhigh resource consumption. Accordingly, there is a need to provide an improved solution to reduce latency and resource consumption and achieve a speed to quality balance. Disclosure of Invention One or more embodiments of the present specification describe a method and apparatus for generating text using a large language model, which can reduce delay and resource consumption and achieve a balance of speed and quality. In a first aspect, a method for generating text using a large language model is provided, comprising: In the thinking stage of the large language model, judging whether the sequence index t of the t-th word element to be generated meets the preset interval condition or not; If the interval condition is met, obtaining importance scores corresponding to the historical words respectively, and reading key vectors and value vectors corresponding to the first number of historical words from a historical KV cache according to the sequence from high to low of the importance scores to form recall data; if the interval condition is not met, multiplexing the last recall data to form a key vector and a value vector of the reference history word element; And calculating the attention score between the current user problem and the reference historical lemma, thereby generating a t-th lemma, classifying the t-th lemma into the generated thinking text, and storing the key vector and the value vector of the t-th lemma into the historical KV cache. In one possible embodiment, the method further comprises: And in the answer generation stage of the large language model, for each word element position, reading the key vector and the value vector corresponding to the full-quantity historical word element from the historical KV cache, calculating the attention score between the current user problem and the full-quantity historical word element so as to generate a next word element, and storing the key vector and the value vector of the next word element into the historical KV cache. In one possible implementation, the history lemma includes the current user question, and the generated thinking text. In one possible embodiment, the preset interval condition includes: the relation between the sequence index t and the preset interval parameter accords with the preset relation. Further, the preset relationship includes: the order index t is an integer multiple of the interval parameter. In one possible implementation, the forming key vectors and value vectors of reference history lemmas based on recall data includes: reading key vectors and value vectors corresponding to the latest second number of historical lemmas to serve as latest window data; and forming a key vector and a value vector of the reference history word element based on the recall data and the latest window data. In one possible implementation, the multiplexing the last recall data to form a key vector and a value vector of reference history tokens includes: reading key vectors and value vectors corresponding to the latest second number of historical lemmas to serve as latest window data; and forming a key vector and a value vector of the reference history word element based on the last recall data and the last window data. In a possible implementation manner, the obtaining the importance scores corresponding to the historical lemmas respectively includes: obtaining each attention score between any first historical word element and other second historical word elements; And averaging all the attention scores to obtain the importance scores of the first historical lemmas. In one possible implementation, the calculating the attention score between the current user question and t