CN-122021895-A - Deep learning model reasoning method and system
Abstract
A deep learning model reasoning method and system for realizing cloud-end collaborative reasoning are disclosed. The method comprises the steps of determining a candidate operator distribution scheme with estimated delay time meeting delay time constraint from a plurality of candidate operator distribution schemes as a current operator distribution scheme according to an operator set and input length of a deep learning model in a current iteration and a current network bandwidth, and executing an reasoning task of the current iteration according to the current operator distribution scheme, wherein a cloud side and an end side execute distributed operators in the operator set of the current iteration respectively. Therefore, aiming at diversified configuration and dynamic reasoning load, cloud-end task division and scheduling can be adaptively carried out at an operator level according to real-time load demands and resource availability so as to efficiently carry out large-model reasoning, thereby giving consideration to SLO and reasoning cost control demands.
Inventors
- Request for anonymity
- Request for anonymity
Assignees
- 北京无问芯穹科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260122
Claims (10)
- 1. A deep learning model reasoning method in which a part of a reasoning task is performed on a terminal device and another part is performed on a server and a collaborative calculation is performed through a network, the method comprising: Determining a candidate operator distribution scheme with estimated delay time conforming to delay time constraint from a plurality of candidate operator distribution schemes as a current operator distribution scheme according to the operator set and input length of the depth learning model in the current iteration and the current network bandwidth, and And executing the reasoning task of the current iteration according to the current operator allocation scheme, wherein the terminal equipment and the server execute the operators allocated in the operator set of the current iteration respectively.
- 2. The method of claim 1, wherein determining, as the current operator allocation scheme, one candidate operator allocation scheme for which the estimated delay time meets the delay time constraint from among the plurality of candidate operator allocation schemes comprises: Calculating estimated reasoning delay time of multiple candidate operator allocation schemes under the current iteration according to the operator set and input length of the deep learning model under the current iteration and the current network bandwidth, and And selecting one operator allocation scheme from a plurality of candidate operator allocation schemes with estimated delay time less than the delay time constraint, wherein the estimated delay time is determined based on the estimated inferred delay time.
- 3. The method of claim 1, wherein determining, as the current operator allocation scheme, one candidate operator allocation scheme for which the estimated delay time meets the delay time constraint from among the plurality of candidate operator allocation schemes comprises: inquiring a pre-established operator allocation estimated delay table, determining at least one table item meeting the operator set, the input length and the current network bandwidth of the current iteration in the operator allocation estimated delay table according to the operator set and the input length of the current iteration and the current network bandwidth, and determining the current operator allocation scheme based on a candidate operator allocation scheme corresponding to the at least one table item.
- 4. The method of claim 1, wherein determining, as the current operator allocation scheme, one candidate operator allocation scheme for which the estimated delay time meets the delay time constraint from among the plurality of candidate operator allocation schemes comprises: Acquiring the dependency relationship between operators of the operator set of the current iteration, determining the transmission delay time of the front and rear operators with the dependency relationship when the front and rear operators are distributed at different ends, and And determining a candidate operator allocation scheme with the estimated delay time which is integrated with the transmission delay time and accords with the delay time constraint from a plurality of candidate operator allocation schemes as a current operator allocation scheme.
- 5. The method of any of claims 1-4, wherein determining, from among the plurality of candidate operator allocation schemes, a candidate operator allocation scheme for which the estimated delay time meets the delay time constraint as the current operator allocation scheme comprises: Monitoring the iterative execution condition in real time to determine whether the scheme switching condition is satisfied, and And in response to the scheme switching condition being met, performing the operation of determining a candidate operator allocation scheme with estimated delay time meeting delay time constraint from a plurality of candidate operator allocation schemes as a current operator allocation scheme.
- 6. The method of claim 5, wherein in response to the scheme switching condition being met, performing the operation of determining, as the current operator allocation scheme, a candidate operator allocation scheme from a plurality of candidate operator allocation schemes for which the estimated delay time meets a delay time constraint comprises: Determining estimated reasoning delay time of a plurality of candidate operator allocation schemes under the current iteration according to the operator set and input length of the deep learning model under the current iteration and the current network bandwidth; calculating, for each of the plurality of candidate operator allocation schemes, a scheme switching time for switching a previous operator allocation scheme used in a previous iteration to that scheme; And taking the sum of the scheme switching time and the estimated reasoning delay time as the estimated delay time, and selecting one candidate operator distribution scheme with the estimated delay time meeting the delay time constraint from the plurality of candidate operator distribution schemes as the current operator distribution scheme.
- 7. The method of claim 6, wherein calculating, for each of the plurality of candidate operator allocation schemes for the current iteration, a scheme switch time for a previous operator allocation scheme used in a previous iteration to switch to that scheme comprises: Identifying a migration operator set to which the previous operator scheme needs to be migrated in order to switch to the scheme; calculating the data transfer time required for each operator in the set of migration operators, and The scheme switching time for switching the previous operator allocation scheme to that scheme is determined based on the sum of the data transmission times required for each operator.
- 8. The method of claim 7, wherein determining the scheme switching time for the previous operator allocation scheme to switch to the scheme based on a sum of data transmission times required for each operator comprises: multiplying the sum of the data transmission times by an overlap coefficient to obtain a scheme switching time for switching the previous operator allocation scheme to the scheme, wherein the value of the overlap coefficient is determined based on a relationship between the remaining calculation time of the last batch of operators of the previous operator allocation scheme and the theoretical transmission time of data to be migrated in the migration operator set.
- 9. The method of claim 5, wherein monitoring the iterative execution condition in real time to determine whether a solution switching condition is met comprises: Real-time monitoring of real-time network bandwidth and actual delay time of previous iteration execution, and the scheme switching conditions include at least one of: the difference between the delay time constraint and the actual delay time is less than a first threshold; the difference in real-time network bandwidth from the network bandwidth used by the previous operator allocation scheme used in the previous iteration is determined to be greater than a second threshold.
- 10. A deep learning model reasoning system comprising a terminal device and a server, wherein a part of the reasoning tasks are performed on the terminal device and another part on the server and co-computing over a network, the system performing the method of any of claims 1 to 9.
Description
Deep learning model reasoning method and system Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to a deep learning model reasoning method and system. Background With the rapid evolution of artificial intelligence technology, deep learning models are widely used in the fields of image recognition, voice recognition, natural language processing and the like. Large language models (Large Language Model, LLM) as a key direction for the development of deep learning present significant advantages in the understanding and generation of complex text tasks. However, LLM parameters continue to scale up with the reasoning process exponentially increasing storage bandwidth and computational effort consumption, making efficient deployment a new challenge. The cloud has sufficient computing resources, but has additional computing cost, the reasoning process depends on network connection, is easily influenced by bandwidth and stability fluctuation, and is difficult to ensure that the reasoning delay always meets a Service Level Objective (SLO) of SERVICE LEVEL, namely the reasoning delay needs to be kept within a preset upper limit. The terminal device may avoid network round trip overhead, but is limited by local computing power, and may not be able to complete large model reasoning within a specified time limit. Cloud-end collaborative reasoning provides a viable path for balancing performance against cost by sharing computational load between the two sides. Disclosure of Invention The different large language models have differences in structural design and model volume, the calculation and communication requirements of various operators are different, and in addition, the network state and the end-side load have dynamics when in operation, so that the operator division and scheduling between cloud-ends are more complex, and the existing allocation strategy is difficult to meet SLO and cost control requirements. Therefore, the present disclosure proposes a deep learning model reasoning method and system, which can adaptively perform cloud-end task partitioning and scheduling at an operator level according to real-time load requirements and resource availability under diversified configuration and dynamic reasoning load so as to efficiently perform large model reasoning, thereby giving consideration to the requirements of SLO and reasoning cost control. According to a first aspect of the present disclosure, there is provided a deep learning model reasoning method, wherein a part of a reasoning task is executed on a terminal device, another part is executed on a server, and collaborative computation is performed through a network, the method includes determining a candidate operator allocation scheme, in which a predicted delay time accords with a delay time constraint, from a plurality of candidate operator allocation schemes as a current operator allocation scheme according to an operator set and an input length of a current iteration of the deep learning model and a current network bandwidth, and executing the current iterative reasoning task according to the current operator allocation scheme, wherein the terminal device and the server each execute an operator allocated in the operator set of the current iteration. Optionally, determining a candidate operator allocation scheme with estimated delay time meeting delay time constraint from a plurality of candidate operator allocation schemes as a current operator allocation scheme comprises calculating estimated inference delay time of the plurality of candidate operator allocation schemes under the current iteration according to an operator set and input length of the depth learning model under the current iteration and a current network bandwidth, and selecting one operator allocation scheme from the plurality of candidate operator allocation schemes with estimated delay time determined based on the estimated inference delay time being smaller than the delay time constraint as the current operator allocation scheme. Optionally, determining a candidate operator allocation scheme with estimated delay time meeting delay time constraint from a plurality of candidate operator allocation schemes as a current operator allocation scheme comprises querying a pre-established operator allocation estimated delay table, determining at least one item meeting the operator set, the input length and the current network bandwidth of the current iteration in the operator allocation estimated delay table according to the operator set and the input length of the current iteration and the current network bandwidth of the current iteration, and determining the current operator allocation scheme based on the candidate operator scheme corresponding to the at least one item. Optionally, determining a candidate operator allocation scheme with estimated delay time conforming to delay time constraint from a plurality of candidate operator al