CN-122021922-A - Compression method and device for prompt words of large language model system, electronic equipment, storage medium and program product
Abstract
The disclosure relates to a compression method and device for large language model system prompt words, electronic equipment, storage medium and program product. The method comprises the steps of obtaining at least one structured content block and each placeholder of the at least one structured content block in a system prompt word of a large language model, constructing a mapping table of the at least one structured content block and each placeholder of the at least one structured content block, replacing the at least one structured content block in the system prompt word with each corresponding placeholder to obtain a first prompt word, compressing the first prompt word to obtain a second prompt word, and replacing the placeholders in the second prompt word with each corresponding structured content block based on the mapping table to obtain a compressed prompt word of the system prompt word.
Inventors
- LV BO
Assignees
- 北京达佳互联信息技术有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260213
Claims (15)
- 1. A method for compressing a prompt word in a large language model system, comprising: Acquiring at least one structured content block in a system prompt word of a large language model and respective placeholders of the at least one structured content block; Replacing the at least one structured content block in the system prompt word with each corresponding placeholder to obtain a first prompt word; Constructing a mapping table of the at least one structured content block and placeholders corresponding to the at least one structured content block respectively; compressing the first prompting word to obtain a second prompting word; and replacing placeholders in the second prompting words with respective corresponding structured content blocks based on the mapping table to obtain compressed prompting words of the system prompting words.
- 2. The compression method of claim 1, wherein the compressing the first hint word to obtain a second hint word includes: Dividing the first prompt word into a plurality of paragraphs; Distributing a target word element number to each paragraph according to the mode of uniformly compressing the paragraphs to obtain a reference reserved proportion of each paragraph, wherein the target word element number is the target word element number of the prompting word compressed by the system prompting word; obtaining an offset retention ratio of each paragraph based on the confusion degree and a preset offset intensity coefficient of each paragraph in the plurality of paragraphs; obtaining a target reserve ratio of each paragraph based on the reference reserve ratio and the offset reserve ratio of each paragraph; and respectively compressing each paragraph according to the target retention ratio of each paragraph to obtain the second prompt word.
- 3. The compression method as set forth in claim 2, wherein the compressing each paragraph according to the target retention ratio of each paragraph to obtain the second hint word includes: For each paragraph, the following is performed: Responding to the target retention ratio of the current paragraph being smaller than a first compression threshold value, and adopting a sentence-level clipping mode to clip the current paragraph to obtain a compressed paragraph of the current paragraph, wherein the sentence-level clipping mode is a mode of clipping the paragraph in sentence units; In response to the target retention ratio of the current paragraph being greater than or equal to the first compression threshold and less than or equal to the second compression threshold, clipping the current paragraph by using a phrase level clipping mode to obtain a compressed paragraph of the current paragraph, wherein the phrase level clipping mode is a mode of clipping the paragraph in phrase units; Responding to the target retention ratio of the current paragraph being greater than the second compression threshold, cutting the current paragraph by adopting a word-level cutting mode to obtain the compressed paragraph of the current paragraph, wherein the word-level cutting mode is a mode of cutting the paragraph by taking word elements as units; And obtaining the second prompt word based on the compressed paragraphs of each paragraph.
- 4. A compression method as claimed in claim 3, wherein the clipping the current paragraph using sentence-level clipping to obtain a compressed paragraph of the current paragraph comprises: acquiring self-information of each word element in the current paragraph; determining self-information of each sentence in the current paragraph based on the self-information of each word element; And for sentences with the self information smaller than a clipping threshold value in the current paragraph, only preserving the sentence head guide words of a first preset number of sentences in the sentences to obtain the compressed paragraph of the current paragraph, wherein the first preset number is determined based on the target preserving proportion of the current paragraph.
- 5. The compression method of claim 3, wherein the clipping the current paragraph using phrase level clipping to obtain a compressed paragraph of the current paragraph comprises: acquiring self-information of each word element in the current paragraph; Determining self-information of each phrase in the current paragraph based on the self-information of each word element; and deleting a second preset number of phrases in the phrases which do not belong to grammar core components from the phrases of which the self information is smaller than a clipping threshold value in the current paragraph to obtain the compressed paragraph of the current paragraph, wherein the second preset number is determined based on the target retention ratio of the current paragraph.
- 6. The compression method of claim 3, wherein the clipping the current paragraph using a word-level clipping method to obtain a compressed paragraph of the current paragraph comprises: acquiring self-information of each word element in the current paragraph; And for the lemmas with the self information smaller than the clipping threshold value in the current paragraph, directly deleting a third preset number of lemmas in the lemmas to obtain the compressed paragraph of the current paragraph, wherein the third preset number is determined based on the target retention ratio of the current paragraph.
- 7. The compression method of claim 2, wherein the dividing the first hint word into a plurality of paragraphs includes: Dividing the first prompt word according to paragraph boundaries or sentence boundaries to obtain a plurality of text blocks; Acquiring a semantic vector of each text block in the plurality of text blocks; adding a first text block of the plurality of text blocks to the current initial paragraph; Sequentially executing the following predetermined processing on the text blocks behind the first text block in the text blocks according to the sequence of the text blocks in the system prompt words: obtaining semantic similarity based on the semantic vector of the current initial paragraph and the semantic vector of the current text block, wherein the semantic vector of the current initial paragraph is an average semantic vector of the semantic vectors of all text blocks in the current initial paragraph; Responding to the fact that the semantic similarity is larger than or equal to a first similarity threshold value, and the total length of the current initial paragraph and the current text block is smaller than or equal to a first length threshold value, adding the current text block into the current initial paragraph, taking the next text block of the current text block as a new current initial text, and returning to a step of obtaining the semantic similarity based on the semantic vector of the current initial paragraph and the semantic vector of the current text block; in response to the semantic similarity being less than the first similarity threshold, or a total length of the current initial paragraph and the current text block being greater than the first length threshold, adding the current text block to a next initial paragraph and taking the next initial paragraph as the current initial paragraph; Sequentially executing the preset processing on the text blocks behind the current text block in the text blocks according to the sequence of the text blocks in the system prompt words; based on the plurality of initial paragraphs, the plurality of paragraphs are derived.
- 8. The compression method of claim 7, wherein the deriving the plurality of paragraphs based on a plurality of initial paragraphs comprises: For each of the plurality of initial paragraphs that contains only a single text block, performing the following: obtaining the semantic similarity between a current initial paragraph and the initial paragraph before and after the current initial paragraph; The current initial paragraph is added to any of the preceding initial paragraph and the following initial paragraph in response to the semantic similarity corresponding to the any of the initial paragraph being greater than a second similarity threshold and the total length of the any of the initial paragraph and the current initial paragraph being less than or equal to a second length threshold.
- 9. The compression method of claim 7, wherein the obtaining the semantic vector for each of the plurality of text blocks comprises: for each text block of the plurality of text blocks, performing the following processing: Responding to the condition that the number of the word elements of the current text block is smaller than a word element number threshold value, inputting the current text block into a semantic coding model to obtain a semantic vector of the current initial text; Responding to the condition that the number of the words of the current text block is larger than or equal to the threshold value of the number of the words, segmenting the current text block by taking the threshold value of the number of the words as a window to obtain a plurality of sub-text blocks, wherein a preset number of words are overlapped between a front sub-text block and a rear sub-text block in the plurality of sub-text blocks, respectively inputting each sub-text block into the semantic coding model to obtain a semantic vector of each sub-text block, and obtaining the semantic vector of the current text block based on the semantic vector of each sub-text block.
- 10. The compression method of claim 2, further comprising, after obtaining the compressed alert word of the system alert word: Carrying out grammar check and integrity check on the compressed prompt words; responding to the check result to meet a first error level, taking the compressed prompt word as a final prompt word, wherein the check result is divided into three error levels in advance according to the error degree, the first error level represents slight error, the second error level represents moderate error and the third error level represents severe error; Responding to the verification result to meet a second error level, improving the target word element number, and compressing the system prompt word based on the improved target word element number to obtain a final prompt word; and responding to the check result to meet a third error level, and taking the system prompt word as a final prompt word.
- 11. The compression method of claim 10, further comprising, prior to increasing the target number of tokens: collecting the compression rate and the accuracy rate of each compression treatment for multiple times; dividing the multiple compression processes into multiple subsets according to different scenes; obtaining optimal compression processing of the current subset for each subset in the plurality of subsets, wherein the compression rate and the accuracy of the optimal compression processing are the largest; fitting the compression rate and the accuracy rate of the optimal compression processing in the plurality of subsets to obtain fitting information, wherein the fitting information comprises the relation between the compression rate and the accuracy rate; obtaining the maximum compression rate based on the preset minimum accuracy and the fitting information; wherein the increasing the target number of tokens includes: And in a preset interval of the minimum compression rate and the maximum compression rate, the target word element number is increased.
- 12. A compression device for a large language model system prompt word, comprising: An acquisition unit configured to acquire at least one structured content block in a system prompt word of a large language model and respective placeholders of the at least one structured content block; The replacing unit is configured to replace the at least one structured content block in the system prompt word with each corresponding placeholder to obtain a first prompt word; A building unit configured to build a mapping table of the at least one structured content block and placeholders to which the at least one structured content block corresponds respectively; The compression unit is configured to compress the first prompt word to obtain a second prompt word; and the recovery unit is configured to replace placeholders in the second prompting words with respective corresponding structured content blocks based on the mapping table to obtain compressed prompting words of the system prompting words.
- 13. An electronic device, comprising: At least one processor; at least one memory storing computer-executable instructions, Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform a method of compressing a system hint word of any of claims 1 to 11.
- 14. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by at least one processor, cause the at least one processor to perform a method of compressing a system hint word of any of claims 1 to 11.
- 15. A computer program product comprising computer instructions which, when executed by a processor, implement a method of compressing a system hint word according to any of claims 1 to 11.
Description
Compression method and device for prompt words of large language model system, electronic equipment, storage medium and program product Technical Field The disclosure relates to the field of artificial intelligence, and in particular relates to a compression method and device for a large language model system prompt word, electronic equipment, a storage medium and a program product. Background The length of the System Prompt (System Prompt) of a large language model directly affects the inference speed of the model, e.g., when System Prompt increases from 1000 tokens (token) to 3000 token, the first token delay (Time To First Token, abbreviated TTFT) increases by 40% -60%, and the total response time (Time To Last Token, abbreviated TTLT) increases by 30% -50%. But for real-time interactive scenarios (e.g., customer service conversations, content reviews, etc.), delays of more than 2 seconds will result in a significant degradation of the user experience. Thus, compression is performed for lengthy System Prompt, such as hint word compression (LLMLingua), selective context (SELECTIVE CONTEXT), and the like. These compression methods compress System Prompt (System Prompt) as a whole during compression, which easily results in the deletion of key field definitions in structured content, e.g., brackets, quotations, colon, etc., structured content is treated as highly predictable tokens (token) and is preferentially deleted, the indentation level of YAML is disturbed, markdown code block marks disappear, nesting level is disturbed, etc. The compressed system prompt words obtained by the compression method in the related technology improve the reasoning of the large language model, but the reasoning accuracy of the large language model is low and even the whole assembly line of the large language model is likely to fail because the compressed system prompt words lack key fields. Disclosure of Invention The disclosure provides a compression method and device for large language model system prompt words, electronic equipment, storage media and program products, so as to at least solve the problem that the accuracy of a large language model cannot be guaranteed by the system prompt words compressed by the related technology. According to a first aspect of the embodiment of the disclosure, a compression method of system prompt words of a large language model is provided, and the compression method comprises the steps of obtaining at least one structured content block and each placeholder of the at least one structured content block in the system prompt words of the large language model, constructing a mapping table of the at least one structured content block and each corresponding placeholder of the at least one structured content block, respectively replacing the at least one structured content block in the system prompt words with each corresponding placeholder to obtain a first prompt word, compressing the first prompt word to obtain a second prompt word, and replacing the placeholders in the second prompt word with each corresponding structured content block based on the mapping table to obtain compressed prompt words of the system prompt words. The method comprises the steps of dividing a first prompt word into a plurality of paragraphs, distributing target word elements to each paragraph according to a mode of uniformly compressing the plurality of paragraphs to obtain a reference reserved proportion of each paragraph, wherein the target word element number is the target word element number of the prompt word compressed by a system prompt word, obtaining an offset reserved proportion of each paragraph based on the confusion degree and a preset offset intensity coefficient of each paragraph in the plurality of paragraphs, obtaining the target reserved proportion of each paragraph based on the reference reserved proportion and the offset reserved proportion of each paragraph, and respectively compressing each paragraph according to the target reserved proportion of each paragraph to obtain the second prompt word. The method comprises the steps of executing a sentence-level clipping mode to clip the current paragraph to obtain the compressed paragraph of the current paragraph in response to the target retention ratio of the current paragraph being smaller than a first compression threshold value, wherein the sentence-level clipping mode is a mode of clipping the paragraph in sentence units; and responding to the current paragraph with the target retention ratio greater than or equal to the first compression threshold and less than or equal to the second compression threshold, cutting the current paragraph by a phrase level cutting mode to obtain the compressed paragraph of the current paragraph, wherein the phrase level cutting mode is a mode of cutting the paragraph by taking phrases as units, responding to the current paragraph with the target retention ratio greater than the second compression threshold, cutti