CN-122019710-A - Method and device for generating text

CN122019710ACN 122019710 ACN122019710 ACN 122019710ACN-122019710-A

Abstract

The embodiment of the specification provides a method and a device for generating texts, wherein the method comprises the steps of routing a current user request to a first quantized model instance in a thinking chain CoT generation stage to generate thinking texts of the stage, and routing the current user request and the thinking texts to a first non-quantized model instance in an answer generation stage to generate answer texts of the stage, wherein the parameter precision of the first quantized model instance is lower than that of the first non-quantized model instance. Model performance, resource efficiency and response speed can be effectively balanced.

Inventors

WANG XIMING
ZHU JIANGCAI
SHAO KAILAI
CHEN CHAO
HU HAIXIANG

Assignees

蚂蚁胜信(上海)信息技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260115

Claims (12)

1. A method of generating text, comprising: in the thinking chain CoT generation stage, the current user request is routed to a first quantization model instance, and the thinking text of the stage is generated; and in the answer generation stage, the current user request and the thinking text are routed to a first non-quantized model instance, and an answer text of the stage is generated, wherein the parameter precision of the first quantized model instance is lower than that of the first non-quantized model instance.
2. The method of claim 1, wherein the first quantization model instance is one of a first number of quantization model instances deployed and the first non-quantization model instance is one of a second number of non-quantization model instances deployed.
3. The method of claim 2, wherein the first number is determined based on an expected length of the thought text and the second number is determined based on an expected length of the answer text.
4. The method of claim 1, wherein the current user request includes a user input sentence and a preset alert sentence, the alert sentence including knowledge text in a target area to which the user input sentence belongs.
5. The method of claim 1, wherein the method further comprises: In the CoT generation stage, the current user request is routed to a first unquantized model instance to be subjected to first processing, wherein the first processing comprises the steps of calculating a key matrix and a value matrix corresponding to a first word element sequence in the current user request based on an attention mechanism, and caching the key matrix and the value matrix.
6. The method of claim 5, wherein the first quantized model instance generates the thought text and is executed in parallel with the first processing of the first non-quantized model instance.
7. The method of claim 5, wherein routing the current user request to a first non-quantized model instance for first processing comprises: The current user request is entered into the first non-quantized model instance and is arranged to generate only one output word element.
8. The method of claim 5, wherein the generating answer text for the stage comprises: Calculating a key matrix and a value matrix corresponding to a second word element sequence in the thinking text based on an attention mechanism; calculating based on an attention mechanism by utilizing a key matrix and a value matrix corresponding to the first word element sequence and a key matrix and a value matrix corresponding to the second word element sequence which are cached, so as to obtain a relevance score between the current user request and each input word element in the thinking text; based on the relevance scores among the input lemmas, each output lemma is obtained to form an answer text.
9. The method of claim 2, wherein a ratio between the first number and the second number is a preset value.
10. An apparatus for generating text, comprising: The first generation unit is used for routing the current user request to a first quantization model instance in the thinking chain CoT generation stage to generate thinking text of the stage; and the second generation unit is used for routing the current user request and the thinking text generated by the first generation unit to a first non-quantized model instance to generate an answer text of the answer generation stage, wherein the parameter precision of the first quantized model instance is lower than that of the first non-quantized model instance.
11. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-9.
12. A computing device comprising a memory having executable code stored therein and a processor, which when executing the executable code, implements the method of any of claims 1-9.

Description

Method and device for generating text Technical Field One or more embodiments of the present description relate to the field of computers, and more particularly, to a method and apparatus for generating text. Background With the rapid development of artificial intelligence technology, generative artificial intelligence represented by a large language model (large language model, LLM) exhibits a strong capability in the field of natural language processing. The large language model can understand complex instructions, generate high-quality texts, conduct multi-round conversations and play key roles in a plurality of application scenes such as question-answering, content creation, code generation and the like. However, the powerful capabilities of large language models are also accompanied by significant computational resource consumption and inference delay challenges. In the prior art, the reasoning process of large model generation text typically involves a large number of floating point operations and memory accesses. Especially when dealing with complex tasks, with Chain-of-thinking (CoT) hinting techniques, the model requires an intermediate inference step to be generated, which can significantly increase the computational effort and the generation length, resulting in longer response times and higher resource overhead. Conventional solutions tend to deploy high-precision models to guarantee model performance and thus output quality, but this is in contradiction with the demands of resource efficiency and response speed. Accordingly, there is a need to provide an improved solution to balance model performance, resource efficiency, and response speed. Disclosure of Invention One or more embodiments of the present specification describe a method and apparatus for generating text that can effectively balance model performance, resource efficiency, and response speed. In a first aspect, a method for generating text is provided, including: in the thinking chain CoT generation stage, the current user request is routed to a first quantization model instance, and the thinking text of the stage is generated; and in the answer generation stage, the current user request and the thinking text are routed to a first non-quantized model instance, and an answer text of the stage is generated, wherein the parameter precision of the first quantized model instance is lower than that of the first non-quantized model instance. In one possible implementation, the first quantized model instance is one of a first number of quantized model instances deployed, and the first non-quantized model instance is one of a second number of non-quantized model instances deployed. Further, the first number is determined according to an expected length of the thought text, and the second number is determined according to an expected length of the answer text. In one possible implementation, the current user request includes a user input sentence and a preset alert sentence, where the alert sentence includes knowledge text in a target area to which the user input sentence belongs. In one possible embodiment, the method further comprises: In the CoT generation stage, the current user request is routed to a first unquantized model instance to be subjected to first processing, wherein the first processing comprises the steps of calculating a key matrix and a value matrix corresponding to a first word element sequence in the current user request based on an attention mechanism, and caching the key matrix and the value matrix. Further, the first quantized model instance generates the thought text and performs a first process in parallel with the first non-quantized model instance. Further, routing the current user request to a first non-quantized model instance for first processing, comprising: The current user request is entered into the first non-quantized model instance and is arranged to generate only one output word element. Further, the generating answer text of the stage includes: Calculating a key matrix and a value matrix corresponding to a second word element sequence in the thinking text based on an attention mechanism; calculating based on an attention mechanism by utilizing a key matrix and a value matrix corresponding to the first word element sequence and a key matrix and a value matrix corresponding to the second word element sequence which are cached, so as to obtain a relevance score between the current user request and each input word element in the thinking text; based on the relevance scores among the input lemmas, each output lemma is obtained to form an answer text. Further, the ratio between the first number and the second number is a preset value. In a second aspect, there is provided an apparatus for generating text, comprising: The first generation unit is used for routing the current user request to a first quantization model instance in the thinking chain CoT generation stage to generate thinking text of the stage; an