CN-121981275-A - Large language model reasoning method, device and storage medium
Abstract
The application provides a large language model reasoning method, equipment and a storage medium, wherein a large language model reasoning architecture comprises a plurality of micro services which are decoupled in function and can be independently expanded, communication collaboration is carried out among the micro services through a predefined interface protocol, each service can be independently and transversely expanded according to actual load and resource requirements, and the expandability and reliability of a system are improved. When the model is inferred, the inference request is received and analyzed through the scheduling arrangement service, the inference request is disassembled into a series of standardized inference tasks, and appropriate computing resources can be dynamically allocated to each task during the inference, so that the global resource scheduling and task arrangement are realized. And then, according to the reasoning task execution diagram, a plurality of micro-service collaborative reasoning is scheduled, and because of loose coupling among the micro-services, different instances can be scheduled to execute the reasoning tasks in parallel, so that the response time of the request is obviously shortened, and the throughput and the resource utilization rate of the whole system under a high concurrency scene are improved.
Inventors
- YANG JIANMING
- HUANG ZHENHUA
- YANG YAO
- LIN WEI
- ZHOU YING
- GAO FENG
Assignees
- 之江实验室
Dates
- Publication Date
- 20260505
- Application Date
- 20260327
Claims (11)
- 1. A large language model reasoning method, applied to a large language model reasoning architecture, the architecture comprising a plurality of micro services, the micro services being in communication collaboration with each other through a predefined interface protocol, the method comprising: receiving and analyzing the reasoning request through the arrangement scheduling service to generate a corresponding reasoning task execution diagram; Scheduling at least two micro services with decoupled functions and capable of being expanded independently to perform inference calculation in a cooperative manner according to the inference task execution diagram, wherein the plurality of micro services comprise an inference calculation service responsible for executing forward calculation of a neural network and a key value cache service responsible for centrally managing an inference intermediate state; And organizing the execution result of each micro service through the scheduling service to generate a response to the reasoning request.
- 2. The large language model reasoning method of claim 1, wherein the reasoning computation service is decoupled into a pre-filling service and a decoding service, wherein the scheduling at least two micro-services with decoupled functions and independently expandable to perform reasoning computation according to the reasoning task execution graph comprises: invoking the pre-filling service, performing one-time forward computation on the input text of the reasoning request, generating an initial output word element and a corresponding initial key value cache, and writing the initial key value cache into the key value cache service; scheduling the decoding service to execute iterative decoding for a plurality of times, wherein each iteration comprises the steps of acquiring a key value cache currently required from the key value cache service, executing model forward computation to generate an output word element, and writing a new key value cache generated by the computation into the key value cache service; And generating an output word sequence through multiple iterations.
- 3. The large language model reasoning method of claim 2, wherein the large language model is a hybrid expert model, the plurality of micro-services further including a routing service, the scheduling the decoding service to perform multiple iterative decoding comprising: Invoking a routing service to select at least one target expert for the hybrid expert model based on input data of a current iteration, the current iteration input data being determined based on an output lemma generated in a previous iteration; and scheduling at least one expert reasoning service corresponding to the at least one target expert, and executing forward calculation of the model by combining key value cache data required by the current iteration with the input data to generate an output word element of the current iteration, wherein the expert reasoning service is an independently extensible micro-service.
- 4. The large language model reasoning method of claim 3, wherein the invoking the routing service to select at least one target expert for the hybrid expert model based on input data for a current iteration comprises: acquiring the original expert selection probability of the mixed expert model on the input data of the current iteration; and modulating the original gating probability of the mixed expert model according to the real-time load state of each expert class so as to select the at least one target expert.
- 5. The large language model reasoning method of claim 4, wherein modulating the original gating probability of the hybrid expert model based on the real-time load status of each expert class to select the at least one target expert comprises: Acquiring a real-time load state of each expert class, and calculating a dynamic capacity factor of each expert class based on the real-time load state; Calculating to obtain the joint scores of the expert categories according to the original expert selection probability and the corresponding dynamic capacity factors; And selecting a preset number of expert categories from the combined scores of all expert categories according to the sequence from high to low as the at least one target expert.
- 6. The large language model reasoning method of claim 1, wherein the key value cache management in the key value cache service comprises: Organizing key value cache data generated in the reasoning process into a plurality of data blocks for storage and management; monitoring and calculating the access heat of each data block; And according to the access heat, migrating the data blocks between at least two storage levels with different access speeds.
- 7. The large language model reasoning method of claim 6, wherein the migrating the data block between at least two storage tiers of different access speeds according to the access hotness comprises: And if the access heat of the current data block is greater than a set heat threshold, determining a storage hierarchy after the migration of the current data block according to the access heat, wherein at least two different access heats are different, the determined storage hierarchy is different, and the lower the access heat in the at least two different access heats is, the lower the access speed of the determined storage hierarchy after the migration is.
- 8. The large language model reasoning method of claim 6, wherein the migrating the data block between at least two storage tiers of different access speeds according to the access hotness comprises: When the access heat of the current data block increases, the current data block is migrated from the current storage level to the storage level with faster access speed, or When the key value caching service receives an access request for a current data block, judging whether the current data block is stored in at least one preset storage level with the highest access speed; If not, the data block is migrated to the storage hierarchy with the access speed higher than that of the current storage hierarchy.
- 9. A large language model reasoning method as claimed in any of claims 6 to 8, wherein migration operations of the data blocks between different storage levels are performed asynchronously by the key-value caching service.
- 10. An electronic device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, implements the large language model reasoning method of any of claims 1 to 9.
- 11. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed, implements the large language model reasoning method of any of claims 1 to 9.
Description
Large language model reasoning method, device and storage medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a storage medium for large language model reasoning. Background With the wide application of large language models (LLM, large Language Model) in the scenes of code generation, intelligent question-answering, content production, etc., model parameter scales are continuously improved, and part of models have adopted billions of level parameters and even higher scales. In order to improve the calculation efficiency and modeling capability, a hybrid expert model (MoE, mixture of Experts) structure is widely applied to a mainstream large model, and sparse calculation activated on demand is realized by introducing a gating mechanism, so that a higher parameter scale and lower calculation cost are obtained. However, the introduction of MoE structures also brings new reasoning complexity and system scheduling challenges. For example, the existing large model reasoning system still mainly adopts a single body type system architecture, when the MoE structure is introduced, as the model scale and the concurrency request quantity are improved, the single body type reasoning service architecture is difficult to support the real-time reasoning requirement of a very large scale model (such as the MoE model) in a high concurrency scene, and the problem of insufficient expansibility of the reasoning architecture is caused. Disclosure of Invention To overcome the problems in the related art, the present specification provides a large language model reasoning method, apparatus, and storage medium. In a first aspect, a large language model reasoning method is provided, and is applied to a large language model reasoning architecture, the architecture includes a plurality of micro services, and communication collaboration is performed between the micro services through a predefined interface protocol, and the method includes: receiving and analyzing the reasoning request through the arrangement scheduling service to generate a corresponding reasoning task execution diagram; Scheduling at least two micro services with decoupled functions and capable of being expanded independently to perform inference calculation in a cooperative manner according to the inference task execution diagram, wherein the plurality of micro services comprise an inference calculation service responsible for executing forward calculation of a neural network and a key value cache service responsible for centrally managing an inference intermediate state; And organizing the execution result of each micro service through the scheduling service to generate a response to the reasoning request. According to the large language model reasoning method provided by the application, the reasoning calculation service is decoupled into the pre-filling service and the decoding service, and the micro-service with at least two decoupled functions and independently expandable is scheduled to cooperatively perform the reasoning calculation according to the reasoning task execution diagram, and the method comprises the following steps: invoking the pre-filling service, performing one-time forward computation on the input text of the reasoning request, generating an initial output word element and a corresponding initial key value cache, and writing the initial key value cache into the key value cache service; scheduling the decoding service to execute iterative decoding for a plurality of times, wherein each iteration comprises the steps of acquiring a key value cache currently required from the key value cache service, executing model forward computation to generate an output word element, and writing a new key value cache generated by the computation into the key value cache service; And generating an output word sequence through multiple iterations. According to the large language model reasoning method provided by the application, the large language model is a mixed expert model, the plurality of micro services further comprise routing services, and the decoding service is scheduled to execute iterative decoding for a plurality of times, and the method comprises the following steps: Invoking a routing service to select at least one target expert for the hybrid expert model based on input data of a current iteration, the current iteration input data being determined based on an output lemma generated in a previous iteration; and scheduling at least one expert reasoning service corresponding to the at least one target expert, and executing forward calculation of the model by combining key value cache data required by the current iteration with the input data to generate an output word element of the current iteration, wherein the expert reasoning service is an independently extensible micro-service. According to the large language model reasoning method provi