CN-121998107-A - Terminal side multi-model collaborative reasoning method, terminal side device and storage medium

CN121998107ACN 121998107 ACN121998107 ACN 121998107ACN-121998107-A

Abstract

The application discloses an end-side multi-model collaborative reasoning method, an end-side device and a storage medium, which relate to the technical field of computers and are characterized in that by dynamically collecting hardware resource state data of the end-side device, the hardware resource state data at least comprises calculation power utilization rate, memory occupancy rate and running power consumption in each heterogeneous calculation unit; the method comprises the steps of acquiring static metadata and dynamic reasoning performance data of each AI model, dynamically distributing computing resources of heterogeneous computing units for each AI model based on hardware resource state data, the static metadata and the dynamic reasoning performance data, determining reasoning priority of each AI model and executing parallel strategies to generate a resource scheduling strategy, and constructing and executing a system-level reasoning task pipeline according to the resource scheduling strategy and model dependency relations to complete collaborative reasoning. The application can improve the resource utilization rate and the reasoning efficiency of the terminal side multi-model reasoning scheme.

Inventors

XIE CHEN

Assignees

深圳市歌尔泰克科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251230

Claims (12)

1. An end-side multi-model collaborative reasoning method applied to an end-side device comprising a plurality of heterogeneous computing units, wherein a plurality of AI models are deployed on the end-side device, and the method is characterized by comprising the following steps: dynamically collecting hardware resource state data of the end-side equipment, wherein the hardware resource state data at least comprises computing power utilization rate, memory occupancy rate and running power consumption in each heterogeneous computing unit; Acquiring static metadata and dynamic reasoning performance data of each AI model, wherein the static metadata comprises a calculation graph structure, operator types and input/output tensor dimensions which are obtained by analysis from the AI model, and the dynamic reasoning performance data comprises actual time delay, throughput and resource occupancy rate on each heterogeneous calculation unit in a model reasoning process of the AI model; Dynamically distributing computing resources of the heterogeneous computing units for each AI model based on the hardware resource state data, the static metadata and the dynamic reasoning performance data, and determining a reasoning priority of each AI model and an execution parallel strategy to generate a resource scheduling strategy, wherein the execution parallel strategy is used for indicating whether the models without dependency are executed in parallel or not and whether a single model is calculated in parallel on a plurality of computing units, and the reasoning priority and the execution parallel strategy need to be combined with model dependency among the AI models to make a decision; And constructing and executing a system-level reasoning task pipeline according to the resource scheduling strategy and the model dependency relationship to complete collaborative reasoning, wherein constructing the system-level reasoning task pipeline comprises determining an execution sequence according to the model dependency relationship, constructing a directed acyclic computing graph representing a data flow direction between models, and performing operator fusion on adjacent and operator compatible model nodes in the directed acyclic computing graph.
2. The end-side multi-model collaborative reasoning method of claim 1, wherein after the step of building and executing a system-level reasoning task-pipeline to accomplish collaborative reasoning according to the resource scheduling policy and the model dependencies, the method further comprises: And collecting performance indexes in the execution process of the system-level reasoning task pipeline, and feeding back the performance indexes to update the dynamic reasoning performance data so as to optimize a resource scheduling strategy and the reasoning task pipeline of the next round.
3. The method for collaborative reasoning on an end-side multi-model as claimed in claim 1 or 2, wherein the step of dynamically collecting hardware resource status data of the end-side device comprises: Collecting the hardware resource state data in a dynamically adjustable collection period; The method comprises the steps of acquiring a system load or an AI model, wherein the adjustment of the acquisition period is based on the emergency degree of an inference task of the current system load or the AI model, shortening the acquisition period to improve the monitoring frequency when the system load is higher than a preset threshold value or a task with the emergency degree of the inference task being higher than a preset degree value exists, and prolonging the acquisition period to reduce the system overhead when the system load is lower than the preset threshold value and no task with the emergency degree of the inference task being higher than the preset degree value exists.
4. The end-side multi-model collaborative reasoning method of claim 2, wherein the plurality of heterogeneous computing units includes at least two of CPU, GPU, NPU, the step of dynamically allocating computing resources for each of the AI models based on the hardware resource status data, the static metadata, and the dynamic reasoning performance data, comprising: Establishing and maintaining a global resource pool, and carrying out unified abstraction, quantification and state management on at least two computing resources of CPU, GPU, NPU; and dynamically distributing computing resources for each AI model based on the hardware resource state data, the static metadata and the dynamic reasoning performance data and combining the real-time available resource quantity of the global resource pool.
5. The end-side multi-model collaborative inference method according to claim 4, wherein the step of determining the inference priority of each model and executing a concurrent policy comprises: based on the historical sequence of the dynamic reasoning performance data, predicting the calculation force demand change of each AI model in the next time window by adopting a self-adaptive load balancing algorithm; And calculating and dynamically adjusting the reasoning priority of each model by combining the predicted calculation force demand change, the model dependence and a preset task urgency index.
6. The end-side multi-model collaborative inference method according to claim 5, wherein the step of using the performance index feedback to update the dynamic inference performance data to optimize a resource scheduling policy for a next round comprises: Comparing the acquired actual time delay and throughput with expected performance under the current resource scheduling strategy to generate a prediction error; Based on the prediction error, an online learning mechanism is adopted to adjust internal weight parameters of an adaptive load balancing algorithm for predicting the power demand change.
7. The end-side multimodal collaborative inference method as described in claim 5, wherein said method further comprises a failure recovery mechanism: Monitoring the health state of each heterogeneous computing unit in real time; when the heterogeneous computing units where the computing resources allocated by any AI model are monitored to be faulty or the performance of the computing units is seriously reduced, based on the real-time available resource quantity of the global resource pool, the affected model reasoning task is migrated to other available heterogeneous computing units to be continuously executed, or the affected model reasoning task reduces the reasoning precision so as to reduce the computational power requirement.
8. The end-side multi-model collaborative reasoning method of claim 7, wherein the step of operator fusing adjacent and operator compatible model nodes in the directed acyclic computational graph comprises: Judging whether an output operator of an upstream model node in the directed acyclic computing graph is matched with an input operator of a downstream model node in the directed acyclic computing graph in terms of data precision, data dimension and computing type; If the matching degree exceeds a preset fusion threshold, the output operator and the input operator are fused into a composite operator, and the composite operator is continuously executed in one calculation unit so as to reduce the moving times of intermediate data between the memory and different heterogeneous calculation units.
9. The end-side multimodal collaborative inference method as set forth in claim 8 wherein said step of building and executing a system-level inference task pipeline further comprises: Based on the resource scheduling strategy, the model dependency relationship, the model nodes in the directed acyclic computing graph and the operator types after fusion, operators bearing computing tasks in the AI models are mapped to heterogeneous computing units corresponding to the global resource pool, and the system-level reasoning task pipeline is executed by adopting a pipeline parallel technology, so that different AI models in different stages in the directed acyclic computing graph can process data of different batches or different parts at the same time, and the overall hardware utilization rate and the system throughput are improved.
10. The end-side multimodal collaborative inference method of claim 1 or 2, wherein the method further comprises: providing a visual monitoring interface for displaying the hardware resource state data, the dynamic reasoning performance data of each AI model, the reasoning priority, the resource scheduling strategy and the execution state of the system-level reasoning task pipeline in real time; And receiving an instruction of external manual intervention operation through the visual monitoring interface, wherein the manual intervention operation comprises the steps of adjusting the reasoning priority of the AI model, triggering or ending operator fusion and reallocating computing resources for the AI model.
11. An end-side device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the end-side multimodal collaborative inference method of any of claims 1-10.
12. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the end-side multimodal collaborative inference method according to any of claims 1-10.

Description

Terminal side multi-model collaborative reasoning method, terminal side device and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to an end-side multi-model collaborative reasoning method, an end-side device, and a storage medium. Background With the rapid development of artificial intelligence technology, end-side devices (such as smartphones, internet of things terminals, automatic driving edge computing units, etc.) are assuming increasingly complex intelligent tasks. To meet the needs of diverse applications, multiple different AI models (e.g., models of simultaneous object detection, speech recognition, semantic understanding, etc.) typically need to be deployed and co-operated on a single end-side device. These models typically have different computational graph structures, operator types, and resource requirements, and may have data dependencies between each other. However, computing resources of the end-side device (heterogeneous computing units such as CPU, GPU, NPU) are severely limited in terms of computing power, memory, and power consumption. At present, a common multi-model reasoning scheme mostly adopts a static or semi-static resource allocation and scheduling strategy, i.e. fixed computing resources are pre-allocated to each model according to experience or offline analysis before deployment, and a fixed execution sequence or parallel mode is set. Such a solution suffers from significant drawbacks: First, the resource allocation is stiff and cannot adapt to dynamic loads. The running environment of the terminal equipment is complex and changeable (such as user operation, network condition, background task and the like), so that hardware resources (such as CPU/GPU/NPU utilization rate, memory occupation and power consumption) and reasoning loads (such as time delay and throughput) of the model are in dynamic fluctuation. The static allocation strategy cannot be dynamically adjusted according to the real-time state, so that resources are strived for to cause delay at high load and resources are idle at low load to cause waste. Second, the scheduling policy lacks global collaborative optimization. Existing schemes often consider the performance of a single model in isolation, or schedule based on simple priority rules only, failing to fully consider the dependencies between multiple models, the specificity of heterogeneous computational units, and compatibility of model operator levels. This results in a failure to construct a globally optimal inference pipeline, which makes it difficult to implement operator fusion across models, data stream optimization, and efficient parallel execution on heterogeneous hardware. Therefore, on the end-side device with limited resources, how to improve the resource utilization rate and the reasoning efficiency of the end-side multi-model reasoning scheme has become a technical problem to be solved urgently. Disclosure of Invention The application mainly aims to provide an end-side multi-model collaborative reasoning method, end-side equipment and a storage medium, and aims to solve the technical problems of low resource utilization rate and insufficient reasoning efficiency in an end-side multi-model reasoning scheme. In order to achieve the above object, the present application provides an end-side multi-model collaborative reasoning method applied to an end-side device including a plurality of heterogeneous computing units, where a plurality of AI models are deployed on the end-side device, the method is characterized by comprising: dynamically collecting hardware resource state data of the end-side equipment, wherein the hardware resource state data at least comprises computing power utilization rate, memory occupancy rate and running power consumption in each heterogeneous computing unit; Acquiring static metadata and dynamic reasoning performance data of each AI model, wherein the static metadata comprises a calculation graph structure, operator types and input/output tensor dimensions which are obtained by analysis from the AI model, and the dynamic reasoning performance data comprises actual time delay, throughput and resource occupancy rate on each heterogeneous calculation unit in a model reasoning process of the AI model; Dynamically distributing computing resources of the heterogeneous computing units for each AI model based on the hardware resource state data, the static metadata and the dynamic reasoning performance data, and determining a reasoning priority of each AI model and an execution parallel strategy to generate a resource scheduling strategy, wherein the execution parallel strategy is used for indicating whether the models without dependency are executed in parallel or not and whether a single model is calculated in parallel on a plurality of computing units, and the reasoning priority and the execution parallel strategy need to be combined with model dependency a