CN-121979667-A - HITL method and system for unloading large model reasoning task in heterogeneous GPU computing power cloud

CN121979667ACN 121979667 ACN121979667 ACN 121979667ACN-121979667-A

Abstract

The invention discloses a HITL method and a HITL system for unloading a large model reasoning task in heterogeneous GPU computing power cloud, wherein the method comprises the steps of constructing a system model of the heterogeneous GPU computing power cloud; the method comprises the steps of designing a multi-agent-based HITL system architecture, introducing human expert feedback through an HITL method to correct the large model to identify the user intention, introducing human expert feedback through the HITL method to correct the large model to identify the user intention, mapping the user intention to a utility function through interaction among a plurality of agents by a multi-agent cooperation module, decomposing the thinking process of the large model by using an optimized prompt word method, and improving the accuracy of intention identification in a chain reasoning mode. The method and the system can optimize the dynamic scheduling of the computing power resource management and large model reasoning tasks, and improve the service quality and user experience in the heterogeneous GPU computing power cloud environment.

Inventors

GAO YUQIANG
CHENG SIZHE
YU JUN
MU JUN
LIU WENSONG
WANG ZHAO
ZHOU BAOKANG
LIU ZHUFENG
Lin Fanzhao
CHEN BIN
CHEN XI

Assignees

国电南瑞科技股份有限公司
南京南瑞瑞腾科技有限责任公司

Dates

Publication Date: 20260505
Application Date: 20251224

Claims (10)

1. The HITL method for unloading the large model reasoning task in the heterogeneous GPU computing power cloud is characterized by comprising the following steps of: (1) Constructing a system model of heterogeneous GPU computing power cloud, wherein the system model comprises a user access layer, an edge cloud computing power layer and a center cloud computing power layer; (2) Designing a multi-agent based HITL system architecture, the HITL system architecture comprising: the intelligent body model is a basic unit of the framework, adopts a modularized design and comprises a main body agent, a memory module, an action module, a perception module and a brain module, wherein the memory module, the action module, the perception module and the brain module are interactively integrated with the main body agent; The HITL module is interacted with the brain module and the perception module of the intelligent body model and provides a human expert intervention channel for the architecture; defining and managing multiple role intelligent agents including operation and maintenance engineering intelligent agents, dispatcher intelligent agents, analyst intelligent agents and client intelligent agents, packaging a specific tool set and a knowledge base by each intelligent agent based on role functions, and carrying out multiple rounds and structured dialogue and information exchange; The prompt word engineering module is used for structuring the prompt word into 'knowing-analyzing-outputting' based on an improved thinking chain method, dynamically embedding the current task context, multi-agent interaction rounds, appointed interaction objects, forced interaction modes and structured output formats for each stage; (3) Human expert feedback is introduced through the HITL method, and the large model is corrected to identify the intention of the user, wherein the human expert is a person with specialized knowledge in the specific field and system operation authority, and monitors, evaluates and corrects the artificial intelligent decision process; (4) The multi-agent cooperation module maps subjective user intention into objective utility function through multi-round cooperation among a plurality of agents; (5) And (3) decomposing the thinking process of the large model by using an optimized prompt word method, and improving the accuracy of intention recognition in a chain reasoning mode.
2. The HITL method of offloading large model inference tasks in heterogeneous GPU computing power clouds of claim 1, wherein in step (1), the user access layer handles lightweight inference tasks, the edge cloud computing power layer accepts complex tasks, and the central cloud computing power layer handles computationally intensive loads.
3. The HITL method for offloading large model reasoning tasks in heterogeneous GPU computational clouds according to claim 1, wherein in step (2), the memory module is divided into long-term memory, medium-term memory and short-term memory, the medium-term memory extracts key information through word slots, the action module integrates tool call and action execution, the large model real-time call tool is used, the perception module comprises the system model and prompt word design, input text is dynamically generated, and the brain module integrates multiple large language models, evaluates and adjusts QoSE targets.
4. The HITL method of large model inference task offloading in heterogeneous GPU computational power clouds of claim 1, wherein in step (2), the multi-agent collaboration involves an operation and maintenance engineer agent, a dispatcher agent, an analyst agent, and a customer agent, with dynamic optimization of task scheduling being achieved through multiple rounds of interaction flow.
5. The HITL method for unloading large model reasoning tasks in heterogeneous GPU computing power cloud as recited in claim 4, wherein the operation and maintenance engineer agent obtains system network conditions and gathers information, the dispatcher agent performs task scheduling based on the gathered information and analysis results, the analyzer agent analyzes the scheduling results and gives a guarantee, and the client agent gives an adjustment suggestion.
6. The HITL method for offloading large model reasoning tasks in heterogeneous GPU computational power clouds of claim 1, wherein the hint word method of step (5) is based on thought chain improvement, comprising: The reasoning process is structured and divided into three stages of knowing, analyzing and outputting; Adding interaction information of roles, including interaction objects, interaction modes and interaction formats; and increasing the current interaction round number information.
7. A HITL system for offloading large model reasoning tasks in heterogeneous GPU computing power clouds, comprising the following modules: The heterogeneous GPU computing power cloud resource management module is used for carrying out unified abstraction, discovery, monitoring and pooling management on heterogeneous GPU resources of a user access layer, an edge cloud computing power layer and a central cloud computing power layer, and providing a resource state query interface and a task unloading execution interface for an upper layer; The intelligent engine module comprises an intelligent instantiation unit, an intelligent kernel, a memory module, a behavior module, a perception module and a brain module, wherein the intelligent instantiation unit dynamically creates or calls intelligent instances with different roles of an operation and maintenance engineer, a dispatcher, an analyst, a client and the like according to task requirements; the man-machine cooperation interaction module is used for providing an expert operation interface, actively suspending an automatic flow when the decision confidence of the agent is low or a preset rule is triggered, visually presenting a decision context, an inference chain and a candidate scheme to a human expert, and receiving a correction instruction or a direct decision of the expert; Defining a communication protocol, an interaction flow and a conflict resolution mechanism among the intelligent agents, driving and managing the intelligent agents with different roles to cooperate according to a set flow, and jointly completing complete closed loop from user intention understanding to generating an optimal scheduling scheme; The dynamic prompt word engineering and optimization module is used for maintaining a structured prompt word template library and dynamically assembling and optimizing prompt words sent to brains of various intelligent agents according to the current task type, the cooperation stage, the interaction round and expert feedback of the HITL module; The task scheduling and unloading execution module receives the scheduling strategy which is finally generated by the multi-agent cooperation module and confirmed by the HITL module and converts the scheduling strategy into an executable instruction; The system supporting module is used for executing logic calculation and scheduling of each module by the processor, and the storage medium is used for persistently storing system configuration, long-term/medium-term memory of an intelligent agent, history interaction logs, prompt word templates, expert knowledge base, task data and the like.
8. The HITL system of claim 7, wherein the brain module integrates and manages calls to multiple large language model APIs, performing core reasoning.
9. A computer device, comprising: the memory is used for storing executable instructions, system configuration data generated by the multi-agent-based HITL system architecture, agent model parameters, a prompt word template library and an interaction log; The processor is coupled to the memory and configured to execute the executable instructions, call and run the prompt word engineering module to execute the dynamic optimization and assembly of the prompt word, coordinate the interaction and communication among the plurality of agents in the multi-agent cooperation module, and manage the interaction flow with human expert through the interface provided by the HITL module.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method steps of any of claims 1-6.

Description

HITL method and system for unloading large model reasoning task in heterogeneous GPU computing power cloud Technical Field The invention relates to the technical field of cloud computing, large model reasoning, heterogeneous GPU computing resource scheduling and optimization, in particular to a HITL method and a HITL system for unloading large model reasoning tasks in heterogeneous GPU computing power clouds. Background With the development of artificial intelligence technology, and particularly the wide application of large-scale pre-training Models (LARGE PRETRAINED Models), the need for computational power by artificial intelligence reasoning tasks has grown exponentially. The reasoning process of large language models often involves the computation of billions or even billions of parameters, which is extremely significant in the occupation of computing resources, especially high-performance GPU resources. In this context, heterogeneous GPU computing power clouds are increasingly becoming the core infrastructure that carries large model reasoning tasks. The heterogeneous GPU computing power cloud realizes optimal utilization of resources by dynamically scheduling reasoning tasks among GPU resources with different models and performances, and becomes an important means for improving AI reasoning efficiency and service quality. In the large model reasoning Task, task unloading (Task unloading) technology becomes a key for improving system performance and resource utilization rate due to huge Task scale and high resource consumption. However, in the actual deployment process, the decision of task offloading is not only affected by the availability of computing resources, task characteristics and network conditions, but also needs to comprehensively consider the actual demands and targets of users, namely the user intention. For large model reasoning services, user intent typically affects the priority of tasks, resource scheduling policies, the way in which the reasoning results are generated, etc. Particularly in heterogeneous GPU environments, user intent has a direct impact on the execution strategy of the reasoning tasks (e.g., model selection, scheduling priority, whether parallel or distributed reasoning is used, etc.). Therefore, accurately identifying the user intention is a key for guaranteeing the intellectualization and individuation of the large model reasoning service. However, the current recognition of user intent is mainly based on traditional methods, small models or mathematical methods, which usually rely on rules, statistical models or small-scale data set training, and can infer the trained data range, but due to the low complexity of the models, the generalization capability is poor, and the variety and complexity of the user needs are difficult to adapt. Large models can learn more complex patterns through massive amounts of data and exhibit strong reasoning and predictive capabilities. However, the current large model method cannot solve the understanding of the physical world, lacks deep understanding of dynamic changes of equipment and environment, causes the user intention to be unable to be accurately identified, and limits the application of the large model method in complex scenes. In addition, due to understanding errors of the large model, limitation of reasoning capability, context management problems and the like, a gap exists between actual execution of task scheduling and intention of a user in the unloading process of the large model reasoning task. Therefore, how to accurately identify and tune the user intention becomes one of the core challenges of the current large model reasoning task offloading. Disclosure of Invention The invention aims to provide a HITL method and a HITL system for unloading a large model reasoning task in heterogeneous GPU computing power cloud, which accurately map user intention into an executable utility function by introducing human expert feedback and multi-agent cooperation mechanism and combining an optimized chain prompt word technology, solve the problem of intention recognition deviation caused by lack of understanding of physical resource environment of the existing large model, reduce the gap between task scheduling execution result and the actual intention of the user, and improve the accuracy, individuation and scheduling efficiency of large model reasoning service in heterogeneous computing power environment. The technical scheme is that the HITL method for unloading the large model reasoning task in the heterogeneous GPU computing power cloud comprises the following steps: (1) Constructing a system model of heterogeneous GPU computing power cloud, wherein the system model comprises a user access layer, an edge cloud computing power layer and a center cloud computing power layer; (2) Designing a multi-agent based HITL system architecture, the HITL system architecture comprising: the intelligent body model is a basic unit of the fr