Search

CN-121981258-A - Mixed reasoning method, model reasoning system and equipment of target large model

CN121981258ACN 121981258 ACN121981258 ACN 121981258ACN-121981258-A

Abstract

The application is suitable for the technical field of artificial intelligence acceleration and storage calculation fusion, and provides a hybrid reasoning method, a model reasoning system and equipment of a target large model, wherein the method comprises the steps of responding to a reasoning instruction sent by a second processor, and acquiring a first operator set and a second operator set corresponding to a target reasoning task; the method comprises the steps of carrying out inference operation on parameters corresponding to a first operator set in a parameter set by using an in-memory accelerator in a memory unit to obtain a first operation result of the first operator set, carrying out inference operation on parameters corresponding to a second operator set in the parameter set by using a first processor to obtain a second operation result of the second operator set, processing the first operation result and the second operation result to obtain a target inference result corresponding to a target inference task, and sending the target inference result to a second processor to instruct the second processor to determine output content of a target large model according to the target inference result. The scheme may provide model reasoning efficiency.

Inventors

  • LUO TING
  • LIN YIN
  • WU DAWEI
  • CHEN QIANG
  • WANG JIANLI

Assignees

  • 得一微电子股份有限公司

Dates

Publication Date
20260505
Application Date
20251229

Claims (10)

  1. 1. The hybrid reasoning method of the target large model is characterized by being applied to a flash memory module, wherein the flash memory module comprises a first processor and a storage unit, a parameter set of the target large model is stored in the storage unit, and the flash memory module is in communication connection with a second processor, and the method comprises the following steps: responding to an reasoning instruction sent by the second processor, acquiring a first operator set and a second operator set corresponding to a target reasoning task, wherein the first operator set comprises operators capable of performing parallel operation, the second operator set comprises operators needing to be sequentially operated, and the reasoning instruction is used for indicating the flash memory module to determine a target reasoning result of the target reasoning task of the target large model and returning the target reasoning result; Performing inference operation on parameters corresponding to the first operator set in the parameter set by using an in-memory accelerator in the memory unit to obtain a first operation result of the first operator set; performing inference operation on parameters corresponding to the second operator set in the parameter set by using the first processor to obtain a second operation result of the second operator set; processing the first operation result and the second operation result to obtain a target reasoning result corresponding to the target reasoning task; And sending the target reasoning result to the second processor to instruct the second processor to determine the output content of the target large model according to the target reasoning result.
  2. 2. The method of claim 1, wherein the target inference task includes one or more inference subtasks corresponding to input content of the target large model, the one or more inference subtasks being determined according to a load condition of the second processor and a load condition of the flash memory module; The first operator set comprises a first type of operators in a plurality of operators corresponding to the one or more reasoning subtasks, the second operator set comprises a second type of operators in the plurality of operators corresponding to the one or more reasoning subtasks, the first type of operators are operators capable of carrying out parallel operation, and the second type of operators are operators needing sequential operation.
  3. 3. The method as claimed in claim 1 or 2, wherein said obtaining, in response to the inference instruction sent by the second processor, the first operator set and the second operator set corresponding to the target inference task includes: Acquiring an operator total set and a model reasoning stage corresponding to the target reasoning task, and acquiring a first real-time load of the in-memory accelerator and a second real-time load of the first processor, wherein the model reasoning stage is used for representing a pre-filling stage or a decoding stage in the target large model reasoning process; Determining an operator allocation proportion between the in-memory accelerator and the first processor according to the first real-time load and the second real-time load, wherein the operator allocation proportion is used for representing a magnitude relation between the total amount of operators processed by the in-memory accelerator and the total amount of operators processed by the first processor; And selecting a first type of operators conforming to the operator distribution proportion from the operator total set to form the first operator set according to the model reasoning stage, and dividing operators in the first operator set from the operator total set to form the second operator set.
  4. 4. The method of claim 3, wherein the selecting a first type of operator from the total set of operators that meets the operator allocation scale to form the first set of operators according to the model inference phase comprises: selecting a first type of operators conforming to the operator allocation proportion from the operator total set to form the first operator set according to the sequence of the memory access frequency from large to small under the condition that the model reasoning stage is a pre-filling stage, or And under the condition that the model reasoning stage is a decoding stage, selecting a first type of operators conforming to the operator allocation proportion from the operator total set to form the first operator set according to the order of the calculated quantity from large to small.
  5. 5. The method of claim 1 or 2, wherein a data transfer bandwidth between the in-memory accelerator and the storage unit is greater than a data transfer bandwidth between the first processor and the storage unit.
  6. 6. The method of claim 5, wherein the method further comprises: Responding to a deployment instruction of the target large model sent by the second processor, and acquiring a calculation graph of the target large model, wherein the calculation graph is used for representing the association relation among operators of the target large model; Classifying each operator of the target large model according to the input tensor and the output tensor of the operators corresponding to each node in the computational graph to obtain a first type operator and a second type operator corresponding to the target large model.
  7. 7. A model reasoning system, the system comprising: The flash memory module comprises a storage unit and a first processor and is used for storing a parameter set of the target large model in the storage unit; The second processor is used for sending an inference instruction to the flash memory module, wherein the inference instruction is used for indicating the flash memory module to determine a target inference result of a target inference task of the target large model and returning the target inference result; The flash memory module is further used for responding to the reasoning instruction, acquiring a first operator set and a second operator set corresponding to the target reasoning task, wherein the first operator set comprises operators capable of carrying out parallel operation, the second operator set comprises operators needing to be operated sequentially, the in-memory accelerator in the storage unit is used for carrying out reasoning operation on parameters corresponding to the first operator set in the parameter set to obtain a first operation result of the first operator set, the first processor is used for carrying out reasoning operation on parameters corresponding to the second operator set in the parameter set to obtain a second operation result of the second operator set, and the first operation result and the second operation result are processed to obtain a target reasoning result corresponding to the target reasoning task; and the second processor is also used for determining the output content of the target large model according to the target reasoning result.
  8. 8. An electronic device comprising a processor, a memory and a computer program stored in the memory and executable on the processor, which when executed by the processor causes the electronic device to implement the method of any one of claims 1 to 6.
  9. 9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 6.
  10. 10. A computer program product comprising a computer program which, when run, causes the method of claims 1 to 6 to be performed.

Description

Mixed reasoning method, model reasoning system and equipment of target large model Technical Field The application belongs to the technical field of artificial intelligence acceleration and storage computing fusion, and particularly relates to a hybrid reasoning method, a model reasoning system, electronic equipment, a computer-readable storage medium and a computer program product of a target large model. Background With the dramatic increase in the size of artificial intelligence model parameters, model parameters are often stored in flash memory outside of the artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) chip when models, particularly large-scale models with huge parameter sizes (referred to simply as large models), are deployed. During model reasoning, the AI chip can read the required model parameters from the flash memory to determine the reasoning result of the model. However, the memory and calculation separation architecture taking the AI chip as the center is often limited by a smaller data transmission bandwidth between the flash memory and the AI chip, and has low data transmission speed and influences on model reasoning efficiency. Therefore, how to improve the reasoning efficiency of the large model has become a technical problem to be solved. Disclosure of Invention The embodiment of the application provides a hybrid reasoning method, a model reasoning system, electronic equipment, a computer readable storage medium and a computer program product of a target large model, which can solve the problem of how to improve the reasoning efficiency of the large model. In a first aspect, an embodiment of the present application provides a hybrid reasoning method for a target large model, applied to a flash memory module, where the flash memory module includes a first processor and a storage unit, the storage unit stores a parameter set of the target large model, and the flash memory module is communicatively connected with a second processor, where the method includes: Responding to an inference instruction sent by a second processor, acquiring a first operator set and a second operator set corresponding to a target inference task, wherein the first operator set comprises operators capable of performing parallel operation, and the second operator set comprises operators needing to be operated in sequence, and the inference instruction is used for indicating a flash memory module to determine a target inference result of the target inference task of the target large model and returning the target inference result; carrying out reasoning operation on parameters corresponding to a first operator set in the parameter set by using an in-memory accelerator in the memory unit to obtain a first operation result of the first operator set; Carrying out reasoning operation on parameters corresponding to a second operator set in the parameter set by using a first processor to obtain a second operation result of the second operator set; Processing the first operation result and the second operation result to obtain a target reasoning result corresponding to the target reasoning task; And sending the target reasoning result to the second processor to instruct the second processor to determine the output content of the target large model according to the target reasoning result. In some embodiments, the target inference task includes one or more inference subtasks corresponding to the input content of the target large model, where the one or more inference subtasks are determined according to the load condition of the second processor and the load condition of the flash memory module; The first operator set comprises a first type of operators in a plurality of operators corresponding to one or more reasoning subtasks, the second operator set comprises a second type of operators in the plurality of operators corresponding to one or more reasoning subtasks, the first type of operators are operators capable of carrying out parallel operation, and the second type of operators are operators needing sequential operation. In some embodiments, responding to an inference instruction sent by a second processor, obtaining a first operator set and a second operator set corresponding to a target inference task includes: Acquiring an operator total set and a model reasoning stage corresponding to a target reasoning task, and acquiring a first real-time load of an in-memory accelerator and a second real-time load of a first processor, wherein the model reasoning stage is used for representing a pre-filling stage or a decoding stage in a target large model reasoning process; According to the first real-time load and the second real-time load, determining an operator allocation proportion between the in-memory accelerator and the first processor, wherein the operator allocation proportion is used for representing the size relation between the total amount of operators processed by the in-memory accelerator and the total amo