CN-121998071-A - Large model reasoning system, reasoning request processing method and device
Abstract
The application discloses a large model reasoning system, a processing method and a device of a reasoning request, which relate to the technical field of artificial intelligence, in the large model reasoning system provided by the application, a large model reasoning function can be cooperatively realized through interaction between a client and a reasoning node, wherein, when the first client executes the reasoning request, the first client sends the reasoning request to the reasoning node, the inference node sends the intermediate data of one or more first inference units in the inference units corresponding to the inference request to the first client, so that the first client does not need to consume local computing resources to generate the intermediate data of the first inference units, multiplexing the intermediate data of the inference units in the client large model inference process is realized, and the large model inference efficiency is effectively improved.
Inventors
- LI BINGSHUAI
- SHAO YUNFENG
- QI MEIYU
- ZHANG XIAOXIAO
Assignees
- 华为技术有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20241108
Claims (20)
- 1. A large model reasoning system, comprising a first client and a reasoning node; the reasoning node is configured to: Receiving an reasoning request sent by a first client; determining a first reasoning unit in the reasoning units corresponding to the reasoning requests, wherein the reasoning units corresponding to the reasoning requests refer to basic data units when reasoning is performed by using a large model; transmitting the intermediate data of the first reasoning unit to the first client; the first client is configured to execute the inference request based on the intermediate data of the first inference unit and the large model.
- 2. The system of claim 1, wherein the inference node is configured to: searching in a storage system based on an inference unit corresponding to the inference request, wherein the storage system is used for storing intermediate data of the inference unit generated by using the large model; If a second reasoning unit and intermediate data of the second reasoning unit are retrieved in the storage system, determining the first reasoning unit based on the second reasoning unit and the reasoning unit corresponding to the reasoning request, wherein the second reasoning unit is matched with any one of the reasoning units corresponding to the reasoning request, and the first reasoning unit comprises the second reasoning unit; And if no inference unit matched with the inference unit corresponding to the inference request is retrieved in the storage system, taking the inference unit corresponding to the inference request as the first inference unit.
- 3. The system of claim 2, wherein the inference node is configured to: Determining reference information based on the second reasoning unit and the reasoning unit corresponding to the reasoning request, wherein the reference information indicates the number of the reasoning units which need to generate intermediate data in addition to the second reasoning unit for executing the reasoning request; If the reference information meets the condition, the second reasoning unit is used as the first reasoning unit, or And if the reference information does not meet the condition, determining a third inference unit based on the inference unit corresponding to the inference request and the second inference unit, wherein the first inference unit comprises the second inference unit and the third inference unit, and the third inference unit refers to the inference units except the second inference unit in the inference units corresponding to the inference request.
- 4. A system according to claim 3, wherein the condition is that the number of inference units indicated by the reference information is less than a threshold.
- 5. The system of any of claims 1-4, wherein the inference unit to which the inference request corresponds further comprises a fourth inference unit, the fourth inference unit being different from the first inference unit, the first client further configured to: generating intermediate data of the fourth inference unit by using the large model; The inference request is performed based on the intermediate data of the first inference unit and the intermediate data of the fourth inference unit.
- 6. The system of any one of claims 1 to 5, wherein the inference nodes are cloud nodes, the large model inference system further comprising edge nodes, the inference nodes for any one of: the intermediate data of the first reasoning unit is sent to the first client through the edge node, or And controlling the first client to acquire the intermediate data of the first reasoning unit from the edge node.
- 7. The system of any one of claims 1 to 5, wherein the inference node is an edge node, the inference node further configured to: generating intermediate data of the first inference unit using the large model, or In the case that the large model reasoning system further comprises a cloud node, intermediate data of the first reasoning unit are acquired from the cloud node.
- 8. The system according to any of the claims 1 to 5, characterized in that the inference node is a terminal device, the inference node being further adapted to: generating intermediate data of the first inference unit using the large model, or And acquiring the intermediate data of the first reasoning unit through a network node.
- 9. The system according to any of claims 1 to 8, wherein the inference node is further configured to: Retrieving a storage system based on the received fifth inference unit, the storage system for storing intermediate data of the inference unit generated using the large model; And if the intermediate data of the fifth reasoning unit is not retrieved in the storage system, sending a data transmission instruction to a second client in the large model reasoning system, wherein the data transmission instruction indicates to send the intermediate data of the fifth reasoning unit to the reasoning node, and the second client stores the intermediate data of the fifth reasoning unit.
- 10. The system of claim 9, wherein if the inference node receives the fifth inference unit sent by the plurality of third clients in the large model inference system, the inference node is further configured to: The second client is determined from the plurality of third clients based on one or more of a network transmission status, a data transmission delay, and a location of the third client.
- 11. The system according to claim 9 or 10, wherein the inference node is further configured to: Receiving intermediate data of the fifth reasoning unit sent by the second client; and storing the intermediate data of the fifth reasoning unit to the storage system.
- 12. The system of any one of claims 9 to 11, wherein the inference nodes are cloud nodes, the large model inference system further comprising edge nodes; the second client is configured to send the intermediate data of the fifth inference unit to the edge node; And the edge node is used for merging the received intermediate data of the reasoning units and then sending the merged intermediate data to the reasoning node.
- 13. A method of processing an inference request, performed by an inference node in a large model inference system, the large model inference system further comprising a first client, the method comprising: Receiving an reasoning request sent by a first client; determining a first reasoning unit in the reasoning units corresponding to the reasoning requests, wherein the reasoning units corresponding to the reasoning requests refer to basic data units when reasoning is performed by using a large model; And sending the intermediate data of the first reasoning unit to the first client, wherein the first client is used for executing the reasoning request based on the intermediate data of the first reasoning unit and the large model.
- 14. The method of claim 13, wherein the determining a first one of the inference units to which the inference request corresponds comprises: searching in a storage system based on an inference unit corresponding to the inference request, wherein the storage system is used for storing intermediate data of the inference unit generated by using the large model; If a second reasoning unit and intermediate data of the second reasoning unit are retrieved in the storage system, determining the first reasoning unit based on the second reasoning unit and the reasoning unit corresponding to the reasoning request, wherein the second reasoning unit is matched with any one of the reasoning units corresponding to the reasoning request, and the first reasoning unit comprises the second reasoning unit; And if no inference unit matched with the inference unit corresponding to the inference request is retrieved in the storage system, taking the inference unit corresponding to the inference request as the first inference unit.
- 15. The method of claim 14, wherein the determining the first inference unit based on the second inference unit and an inference unit to which the inference request corresponds comprises: Determining reference information based on the second reasoning unit and the reasoning unit corresponding to the reasoning request, wherein the reference information indicates the number of the reasoning units which need to generate intermediate data in addition to the second reasoning unit for executing the reasoning request; If the reference information meets the condition, the second reasoning unit is used as the first reasoning unit, or And if the reference information does not meet the condition, determining a third inference unit based on the inference unit corresponding to the inference request and the second inference unit, wherein the first inference unit comprises the second inference unit and the third inference unit, and the third inference unit refers to the inference units except the second inference unit in the inference units corresponding to the inference request.
- 16. The method according to any of the claims 13 to 15, wherein the inference node is a cloud node, the large model inference system further comprising an edge node, the sending the intermediate data of the first inference unit to the first client, comprising any of: the intermediate data of the first reasoning unit is sent to the first client through the edge node, or And controlling the first client to acquire the intermediate data of the first reasoning unit from the edge node.
- 17. The method according to any of the claims 13 to 15, wherein the inference node is an edge node, the method further comprising: generating intermediate data of the first inference unit using the large model, or In the case that the large model reasoning system further comprises a cloud node, intermediate data of the first reasoning unit are acquired from the cloud node.
- 18. The method according to any of the claims 13 to 15, characterized in that the inference node is a terminal device, the method further comprising generating intermediate data of the first inference unit using the large model, or obtaining intermediate data of the first inference unit by a network node.
- 19. The method according to any one of claims 13 to 18, further comprising: Retrieving a storage system based on the received fifth inference unit, the storage system for storing intermediate data of the inference unit generated using the large model; And if the intermediate data of the fifth reasoning unit is not retrieved in the storage system, sending a data transmission instruction to a second client in the large model reasoning system, wherein the data transmission instruction indicates to send the intermediate data of the fifth reasoning unit to the reasoning node, and the second client stores the intermediate data of the fifth reasoning unit.
- 20. The method of claim 19, wherein if the inference node receives the fifth inference unit sent by the plurality of third clients in the large model inference system, the method further comprises: The second client is determined from the plurality of third clients based on one or more of a network transmission status, a data transmission delay, and a location of the third client.
Description
Large model reasoning system, reasoning request processing method and device Technical Field The application relates to the technical field of artificial intelligence, in particular to a large model reasoning system, a reasoning request processing method and a reasoning request processing device. Background At present, the demand of realizing large model reasoning at a client is increasing, and in the process of executing large model reasoning requests by the client, the client can discard KVcache of a reasoning unit in time to release storage space after generating and using a key (key) matrix and a value (value) matrix (KVcache) of the reasoning unit due to limited storage capacity of the client, so that the client needs to consume a large amount of computing resources to regenerate KVcache each time of executing the reasoning requests, and reasoning efficiency is seriously affected. Therefore, how to improve the reasoning efficiency in the client large model reasoning process is a problem to be solved urgently. Disclosure of Invention The application provides a large model reasoning system, a reasoning request processing method and a device, which realize multiplexing of intermediate data of a reasoning unit in the process of client large model reasoning, thereby effectively improving the large model reasoning efficiency. The large model reasoning system comprises a first client and a reasoning node, wherein the reasoning node is used for receiving a reasoning request sent by the first client, determining a first reasoning unit in the reasoning units corresponding to the reasoning request, wherein the reasoning unit corresponding to the reasoning request refers to a basic data unit when a large model is utilized for reasoning, sending intermediate data of the first reasoning unit to the first client, and the first client is used for executing the reasoning request based on the intermediate data of the first reasoning unit and the large model. In the large model reasoning system provided by the application, various implementation modes exist for the reasoning nodes, for example, the reasoning nodes are cloud nodes, terminal equipment, edge nodes or public network nodes and the like on a cloud platform, and the application is not limited to the implementation modes. The client and the reasoning node can cooperatively realize the large model reasoning function through interaction, wherein the first client sends the reasoning request to the reasoning node when executing the reasoning request, and the reasoning node sends the intermediate data of one or more first reasoning units in the reasoning units corresponding to the reasoning request to the first client, so that the first client does not need to consume local computing resources to generate the intermediate data of the reasoning units, the multiplexing of the intermediate data of the reasoning units in the large model reasoning process of the client is realized, and the large model reasoning efficiency is effectively improved. The inference unit corresponding to the inference request refers to a basic data unit when the inference is performed by using the large model, that is, the large model uses the inference unit as a basic data processing unit to implement the inference. For example, in a text reasoning scene, the reasoning unit is a word element obtained by word segmentation of the text, in an image reasoning scene, the reasoning unit is an image block obtained by image segmentation, each image block is an reasoning unit, in an audio reasoning scene, the reasoning unit is an audio segment obtained by audio segmentation, each audio segment is an reasoning unit, in a video reasoning scene, the reasoning unit is an image block obtained by image segmentation of each video frame in the video, each image block is an reasoning unit, and the like. In some embodiments, the inference node is configured to perform a search in a storage system based on an inference unit corresponding to an inference request, where the storage system is configured to store intermediate data of the inference unit generated by using a large model, determine a first inference unit based on the second inference unit and the inference unit corresponding to the inference request if the second inference unit and the intermediate data of the second inference unit are searched in the storage system, and the first inference unit includes the second inference unit if the second inference unit is matched with any one of the inference units corresponding to the inference request, and take the inference unit corresponding to the inference request as the first inference unit if the inference unit matched with the inference unit corresponding to the inference request is not searched in the storage system. By the mode, the inference node searches in the storage system based on the inference unit corresponding to the inference request, and then determines the first inference unit according to t