CN-121981280-A - AI large model lightweight reasoning service construction method and system based on edge collaboration

CN121981280ACN 121981280 ACN121981280 ACN 121981280ACN-121981280-A

Abstract

The invention belongs to the technical field of model construction in computer information science, and relates to an AI large model lightweight reasoning service construction method and system based on edge collaboration; the method comprises the steps of carrying out knowledge distillation on a teacher model through semantic clustering to generate a special student sub-model and deploying the special student sub-model to edge nodes, analyzing the request by a pre-trained lightweight scheduling model and generating a plurality of candidate execution blueprints defining sub-model calling relations when receiving an inference request, calculating collaborative inference cost of each candidate execution blueprint, selecting the lowest cost as an optimal execution blueprint, scheduling a corresponding edge node to cooperatively execute an inference task, backtracking and updating conditional transition probability in an instruction path graph according to an actual execution path and node response time, generating a feedback signal by combining performance indexes, optimizing scheduling model parameters on line, and achieving semantic lightweight deployment of a large model and global optimal scheduling under multi-dimensional constraint.

Inventors

WANG LONG
HUO KE
JIN WEI

Assignees

西安明赋云计算股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (10)

1. The method for constructing the AI large model lightweight inference service based on edge cooperation is characterized by comprising the following steps: the method comprises the steps of constructing an instruction path diagram formed by instruction step nodes and edges with conditional transition probability, dividing the instruction step nodes into different semantic clusters through a semantic clustering algorithm based on the instruction path diagram, generating a special student sub-model by carrying out knowledge distillation on the behavior of a teacher model on a corresponding task for each semantic cluster, and deploying the student sub-model on a plurality of edge nodes; When a user reasoning request is received, analyzing the reasoning request by utilizing a pre-trained lightweight scheduling model to generate at least one candidate execution blueprint defining a student sub-model calling relation, and calculating the collaborative reasoning cost of each candidate execution blueprint, wherein the collaborative reasoning cost integrates the blueprint priori execution probability determined according to the instruction path diagram, the estimated communication time delay among involved edge nodes and the current calculation load of each edge node; selecting an optimal execution blueprint based on the lowest collaborative reasoning cost, and scheduling corresponding edge nodes to execute the collaborative execution reasoning task according to the optimal execution blueprint; And simultaneously, generating a feedback signal by combining the actual execution time delay and the confidence performance index of the reasoning result, and carrying out online optimization on the parameters of the lightweight scheduling model.
2. The AI large model lightweight inference service construction method based on edge collaboration according to claim 1, wherein the process of dividing the instruction step nodes into different semantic clusters through a semantic clustering algorithm is as follows: converting the text representation of each instruction step node into a high-dimensional semantic vector using a pre-trained language model; and (3) adopting a K-Means clustering algorithm, taking cosine similarity among the high-dimensional semantic vectors as distance measurement, and iteratively dividing all the high-dimensional semantic vectors into N semantic clusters, wherein N is the preset cluster number.
3. The method for constructing the AI large model lightweight inference service based on edge collaboration according to claim 1, wherein for each semantic cluster, the constructed student sub-model is a lightweight feedforward neural network comprising a plurality of hidden layers; In the distillation process, the joint loss function is , wherein, Cross entropy loss between the student sub-model output and the real label, Temperature coefficient for student model KL divergence between the soft target of the lower output and the soft target of the teacher model, Is a weight super parameter.
4. The AI large model lightweight inference service construction method based on edge collaboration according to claim 1, wherein the collaborative inference cost calculation method of each candidate execution blueprint is as follows: for any candidate execution blueprint, its collaborative reasoning costs By weighted summation calculation of three dimensionless cost components: ; Wherein, the Is a probability cost and , The prior execution probability of the candidate execution blueprint is obtained through conditional transition probabilities of all continuous edges in the instruction path graph; normalized communication costs calculated from a sum of estimated communication delays between involved edge nodes; Normalized computational cost for an average calculation from the current computational load of the edge node involved; is a preset weight coefficient and 。
5. The AI large model lightweight inference service construction method based on edge collaboration according to claim 1, wherein the candidate execution blueprint generation process is: inputting a user reasoning request text into a lightweight scheduling model based on a Transformer architecture; The lightweight scheduling model adopts a sequence-to-sequence generation mode, outputs a sequence composed of unique identifiers of student sub-models, the sequence is a candidate execution blueprint, the sequence order defines the calling order of the student sub-models, and a plurality of candidate execution blueprints are generated through a beam search decoding strategy.
6. The method for constructing the edge-collaboration-based AI large model lightweight inference service according to claim 1, wherein the process of backtracking and updating the conditional transition probabilities of the corresponding edges in the instruction path diagram is as follows: for any actually executed node transition in the optimal execution blueprint, if a source instruction step node is A and a target instruction step node of actual jump is B, updating the conditional transition probabilities of all directed edges starting from the source instruction step node A; the actually executed directed edge is the edge pointing from the source instruction step node A to the target instruction step node B, and the updated conditional transition probability thereof The calculation formula of (2) is as follows: ; other directed edges that are not actually executed are edges pointing from the source instruction step node a to the other instruction step node K, Updated conditional transition probabilities thereof The calculation formula of (2) is as follows: ; Wherein, the In order to update the conditional transition probabilities before the update, Is the learning rate.
7. The method for constructing the AI large model lightweight inference service based on edge collaboration according to claim 1, wherein the optimization method of parameters of the lightweight scheduling model is as follows: according to the actual total time delay of the current reasoning And confidence score of inference results Calculating a normalized bonus signal : Wherein, the method comprises the steps of, For a predetermined target time delay, And Weight coefficients for respectively controlling the importance of time delay and confidence coefficient; adopting a near-end strategy optimization algorithm to carry out the rewarding signal As feedback, the parameters of the lightweight scheduling model are iteratively updated by maximizing the jackpot.
8. The AI large model lightweight reasoning service construction system based on edge cooperation is characterized by comprising the following modules: The system comprises a deployment module, a semantic clustering algorithm, a student sub-model generation module, a data processing module and a data processing module, wherein the deployment module is used for constructing an instruction path diagram formed by instruction step nodes and edges with conditional transition probability; The computing module is used for analyzing the request by utilizing a pre-trained lightweight scheduling model when receiving a user reasoning request, and generating at least one candidate execution blueprint defining a student sub-model calling relationship; and calculating the collaborative reasoning cost of each candidate execution blueprint, wherein the collaborative reasoning cost integrates the blueprint prior execution probability determined according to the instruction path diagram, the estimated communication delay among involved edge nodes and the current calculation load of each edge node; the execution module is used for selecting an optimal execution blueprint based on the lowest collaborative reasoning cost and scheduling a corresponding edge node to execute the collaborative execution reasoning task according to the optimal execution blueprint; And the optimization module is used for backtracking and updating the conditional transition probability of the corresponding edge in the instruction path diagram according to the actual execution path and the node response time of the optimal execution blueprint after the task is executed, and meanwhile, generating a feedback signal by combining the actual execution time delay and the reasoning result confidence performance index to perform online optimization on the parameters of the lightweight scheduling model.
9. The AI large model lightweight inference service construction system based on edge collaboration of claim 8, wherein the process of partitioning the instruction step nodes into different semantic clusters by a semantic clustering algorithm is: converting the text representation of each instruction step node into a high-dimensional semantic vector using a pre-trained language model; and (3) adopting a K-Means clustering algorithm, taking cosine similarity among the high-dimensional semantic vectors as distance measurement, and iteratively dividing all the high-dimensional semantic vectors into N semantic clusters, wherein N is the preset cluster number.
10. The AI large model lightweight inference service construction system based on edge collaboration of claim 8, wherein for each semantic cluster, the constructed student sub-model is a lightweight feedforward neural network comprising a plurality of hidden layers; In the distillation process, the joint loss function is , wherein, Cross entropy loss between the student sub-model output and the real label, Temperature coefficient for student model KL divergence between the soft target of the lower output and the soft target of the teacher model, Is a weight super parameter.

Description

AI large model lightweight reasoning service construction method and system based on edge collaboration Technical Field The invention belongs to the technical field of model construction in computer information science, and particularly relates to an AI large model lightweight reasoning service construction method and system based on edge cooperation, which are used for lightweight deployment of an AI large model in an edge computing scene with limited resources, distributed collaborative reasoning and application occasions of reasoning service optimization, and can improve the execution efficiency and operation stability of the large model reasoning service in an edge environment. Background With the rapid development of artificial intelligence technology, the large AI model shows excellent performance in a plurality of fields such as natural language processing, computer vision and the like by virtue of huge parameter quantity and complex network structure, but brings about extremely high calculation and storage resource requirements. The traditional reasoning service mode generally transmits all user requests to the cloud data center in a centralized manner for processing, the centralized architecture not only can generate obvious network communication delay due to long-distance data transmission and directly influence user interaction experience, but also can cause the cloud server to bear huge computational pressure, and high operation and maintenance cost is further brought. More importantly, potential privacy information leakage risks exist in the cloud transmission and processing process of the data, and the data security requirements in partial scenes are difficult to meet. Therefore, how to make the large AI model operate efficiently in the limited-resource edge environment becomes a key problem to be solved in the landing process of the current technology. To achieve lightweight deployment of large models, model compression techniques have evolved. The knowledge distillation is one of the main stream means, and the training of the light student model simulates the reasoning behavior of the complex teacher model, so that the model volume and the calculation complexity are reduced, and the performance of the original model is kept as much as possible. The model segmentation and edge collaboration technology divides a large model into a plurality of sub-modules, deploys the sub-modules on distributed edge nodes, and completes an inference task by relying on collaboration among the nodes, thereby effectively reducing the resource load of a single node. However, the prior art still has obvious shortboards that on one hand, the model segmentation and collaborative reasoning scheme cannot fully consider the diversity of user requests in the task semantic level, the split submodules often lack pertinence, and the individuation requirements of different types of tasks are difficult to adapt, so that the reasoning efficiency is limited. For example, the cloud collaborative reasoning and personality learning method disclosed in the chinese patent with the issued publication number CN117786236B, while proposing that a lightweight small model is distributed to edge nodes for nearby data reasoning to alleviate cloud pressure, mainly relies on a single, static deployment mode, and lacks flexibility in facing multi-source heterogeneous instructions beyond a preset experimental scenario. As another example, chinese patent application publication No. CN116134454a discloses means for compressing model volumes using knowledge distillation, but often does not effectively combine complex semantic features with dynamic synergy of distributed heterogeneous edge nodes. On the other hand, most of the existing task scheduling strategies are static or semi-dynamic, and lack the capability of sensing and self-adapting to dynamic changes of the edge network environment, so that the existing task scheduling strategies cannot flexibly cope with sudden situations such as node load fluctuation, network delay change and the like. For example, a chinese patent application document with publication number CN112650585A discloses an edge computing task scheduling method using an ant colony cooperative scheduling algorithm, and although an optimization algorithm is introduced, its core scheduling logic is still to offload tasks to the cloud when the computing power of edge nodes is insufficient, and this edge-cloud secondary scheduling mode does not implement the lateral cooperation between multiple edge nodes, and there is a limitation in flexibility and resource utilization of task processing. Meanwhile, the whole system is generally lack of a closed-loop feedback mechanism for learning optimization from actual execution effects, the scheduling strategy and the model performance are difficult to continuously and iteratively improve according to actual operation data, and the overall efficiency of the edge collaborative reasoning syst