Search

CN-121985019-A - Intelligent computing job submission management method based on mobile terminal

CN121985019ACN 121985019 ACN121985019 ACN 121985019ACN-121985019-A

Abstract

The invention discloses an intelligent computing operation submission management method based on a mobile terminal, which comprises the following steps of S1, session establishment and authentication, namely, a mobile terminal APP initiates a connection request to a Web service gateway deployed at the front end of an HPC cluster. The invention replaces the operation of the traditional Linux command line by the graphical interaction and the templated design of the mobile terminal, greatly reduces the technical threshold, enables scientific researchers not in the computer profession to complete the operation management conveniently, and simultaneously gets rid of the constraint of the fixed desktop environment by means of the real-time monitoring, the instant response and the full-flow mobile coverage of the mobile terminal, so that the user can grasp the operation state in real time and process abnormal tasks quickly in any scene, and the flexibility and the efficiency of the intelligent computing operation management are obviously improved.

Inventors

  • LUO DANDAN
  • YAO WENSHENG
  • LI LINGFENG
  • WU YUMIAO
  • CHEN XIAOLIANG

Assignees

  • 福建省数字福建云计算运营有限公司

Dates

Publication Date
20260505
Application Date
20251208

Claims (8)

  1. 1. A mobile terminal-based intelligent computing job submission management method is characterized in that a Web system is deployed based on an Open source OpenOnDemand platform, and the Web system is packaged into a mobile terminal APP, so that visual one-stop management of an HPC platform is realized, and the method specifically comprises the following steps: s1, session establishment and authentication, wherein a mobile terminal APP initiates a connection request to a Web service gateway deployed at the front end of an HPC cluster, and a user inputs credentials to complete identity verification through an authentication interface integrated in the mobile terminal APP, so that a secure session is established; S2, resource discovery and selection, namely, a mobile terminal APP obtains and visually displays the resource states of currently available computing nodes and queues by calling a cluster state query API, wherein the resource states comprise CPU/GPU load, memory utilization rate and queue waiting conditions, and a user selects a target computing queue or node on a graphical interface based on the resource states; S3, job preparation and file management, wherein a user accesses an HPC home directory or a project directory thereof in a graphical operation mode through a file manager module of a mobile terminal APP, and at least one operation of creating an input file, uploading a local file to a cluster, editing the content of the existing file or organizing a data set required by the job is performed under the directory; S4, job configuration and submission, wherein a user configures and calculates job parameters through a job submission form of a mobile terminal APP, the parameters comprise a job name, an execution command, a required core number, a memory size, a predicted operation time length and a dependent software environment, and the mobile terminal APP packages the job parameters configured by the user into a standard job description script and submits the standard job description script to a selected queue through a job scheduling system interface; S5, job status real-time monitoring and interaction, wherein a mobile terminal APP periodically polls or receives status update from a job scheduling system through Web Socket connection, and displays the job status on a mobile terminal interface in real time in a list or chart form, wherein the job status comprises 'in queuing', 'in operation', 'completed' or 'error', and simultaneously provides a user interaction control to allow a user to execute operations of ending, re-queuing or checking detailed logs on specified jobs; And S6, obtaining and post-processing results, namely when the operation state is monitored to be changed into 'completed', automatically reminding a user by the mobile terminal APP, and browsing, previewing or downloading standard output, standard errors and result files generated by the operation to the local of the mobile equipment by the user through the file manager module.
  2. 2. The intelligent computing operation submission management method based on the mobile terminal according to claim 1, wherein in the step S1, the mobile terminal APP establishes connection through a Web service interface of an encapsulated OpenOnDemand platform, and the security session adopts an identity authentication system of the platform to support the linkage verification with a unified identity authentication system of an HPC cluster.
  3. 3. The mobile-terminal-based intelligent computing job submission management method of claim 1, wherein the specific logic steps of S2 are as follows: S201, a mobile terminal APP starts a resource discovery function, and authority and screening parameter configuration, interface and security configuration are automatically completed; S202, a mobile terminal APP sends a resource query request to an HPC cluster Web service gateway, and the Web service gateway pulls data from a monitoring node and returns the data, wherein the returned data needs to contain fields necessary for formula calculation; S203, after the mobile terminal APP performs validity verification on the original data, calculating key indexes through a formula to form 'user-readable resource state data', wherein the method comprises the following steps: s2031, calculating a CPU/GPU load rate and a memory utilization rate: CPU load rate= (used CPU/total CPU) ×100%; GPU load rate= (usedGPU/totalGPU) ×100%; Memory usage= (usedMem/totalMem) ×100%; s2032, calculating a queue average waiting time: Current timestamp=1716825600, n=3 is the number of waiting jobs; Average latency= [ (1716825600-1716825480) + (1716825600-1716825360) + (1716825600-1716825240) ]/3= (120+240+360)/3=240 seconds; s2033, computing node remaining available resources: remaining cpu=total CPU-used CPU; remaining gpu=total GPU-used GPU; remaining memory = total Mem-used Mem; s2034, calculating a queue resource utilization: Queue total cpu= total Nodes In Queue ×total CPU; Queue used CPU = sum of used CPUs of each node; queue CPU utilization= (used CPU/total CPU) ×100%; S2035, converting the calculation result into an legible format and storing the legible format as a visual data set; s204, the mobile terminal APP presents the resource state in a graphical mode according to the calculation result of S203, so as to help the user to quickly judge, wherein the presented content specifically comprises: (1) Each node card displays 'CPU load rate', 'GPU load rate', 'memory utilization rate', 'residual CPU', load rate is greater than 80% red, 50% -80% yellow and <50% green, and node busyness is intuitively distinguished; (2) Displaying the average waiting time of the GPU-HIGH queue and the average waiting time of the GPU-NORMAL queue by using a histogram, displaying the CPU utilization rate of the GPU-HIGH queue by using a ring graph, and supporting a user to click the graph to check the calculation detail; (3) Screening control, namely providing buttons including but not limited to a screening GPU load rate <50% "" screening average waiting time <5 minutes ", automatically filtering out resources which do not meet the conditions after clicking, and only displaying nodes/queues with the calculation results meeting the standards; s205, clicking a ' GPU_001 node+GPU_HIGH queue ' by a user, popping up a confirmation frame by an APP, displaying the key result calculated in the step S203, simultaneously comparing the estimated requirement of the user to be submitted by the APP by the mobile terminal with the ' residual resources ' calculated in the step S203, namely, the residual CPU ' job requirement, the residual GPU ' job requirement, the residual memory ' job requirement, checking pass, popping up a prompt that ' the residual CPU of the selected node is insufficient, recommending the GPU_003 node of the residual CPU60 core ', and displaying the calculation result of the recommended node; S206, after confirming selection by the user, the mobile terminal APP caches the GPU_001 node ID, the GPU_HIGH queue ID and the corresponding calculation result, and calls the resource locking API to send a request of the pre-application 8-core CPU and 1 GPU to the scheduling system, and the scheduling system judges the allocability based on the residual resources calculated in the step S203, returns a receipt of the pre-application success, and ensures that the resources are not fully occupied before the user submits the job.
  4. 4. The mobile-based intelligent computing job submission management method of claim 1, wherein in S3, the file management module operations include at least one of creating, viewing, deleting, renaming, moving, downloading, copying, pasting files or directories; editing the existing file content is realized by calling a Web version code editor integrated in the APP or using a command line file editor through an embedded Web Shell terminal; the file management module supports a Web Shell terminal integrated through a mobile terminal APP, the operation instructions of files and catalogues are executed in a command line mode, and the Web Shell terminal is linked with the terminal service of an OpenOnDemand platform in real time, so that operation logs are synchronously reserved.
  5. 5. The mobile-terminal-based intelligent computing Job submission management method of claim 1, wherein in S4, a Job submission form integrates a Job template library function, a template library is built based on a Job Composer module of an Open OnDemand platform, and a user can select a platform-preset intelligent computing Job template or self-define a saved Job parameter template and synchronize to a platform cloud; The specific logic steps of S4 are as follows: S401, after opening a 'job configuration' page of a mobile terminal APP, automatically docking a background template library by a system, and displaying two types of templates, namely, a platform preset template and a user-defined template stored before a user, wherein if the user submits a job before using the template, the system automatically reads a history record, and prefills default parameters of a last template into a current job form, so that repeated input is reduced; S402, a user screens a template through a scene tag according to the current operation requirement, clicks any template to check detailed parameters, if the default parameters of the template do not meet the requirement, the user can directly modify the template, the modified parameters are marked with special colors, and the user can check and adjust the content conveniently; S403, if no ready-made template is available, a user needs to manually fill all necessary parameters in a form, including an operation name, an execution command, the required number of cores/memory, expected operation time and dependence on a software environment, after the parameter is filled, a 'save as template' button is clicked, the template name and the label are input, and meanwhile, the template authority can be selected, a system synchronizes the template to a cloud end, so that subsequent multiplexing is convenient, the system feeds back 'successful template creation' after saving, and generates an exclusive template ID, and the template can be directly found in a template library without repeated configuration; S404, the system automatically checks the form, confirms whether the ' job name, execution command, target queue, required number of cores ' necessary item is filled completely, if the necessary item is missing, prompts ' please supplement the queue submitted by selecting the job ', combines the node residual resources determined in the previous ' resource discovery and selection ' step, judges whether the resource required by the current job is in the allocable range, and simultaneously checks whether the selected ' dependent software environment ' is installed in the target node, if the target node is not installed, prompts ' the node is not installed, and recommends switching to the node on which the software is installed; S405, after the parameter verification is passed, the system automatically converts the parameters configured by the user into a standard 'job description file' according to the operation rule of the target queue, wherein the file contains job basic information, a software environment loading command, a job execution command and a log storage path; S406, after a user clicks a 'submit job' button, the system displays a 'job submit confirmation page', lists the job name, the target queue, required resources and expected duration key information, clicks the 'confirm submit' after confirming without errors, sends a job description file to a dispatching system of the HPC cluster, distributes a unique job ID, feeds back a submit result, prompts 'successful job submission, job ID: xxxx, is currently in queuing', prompts specific reasons if the job ID is failed, and simultaneously saves 'job ID, submit time, parameter configuration and template use record' to the local and cloud, thereby facilitating subsequent checking of job states, multiplexing parameters or templates.
  6. 6. The method for intelligent computing job submission management according to claim 1, wherein in S5, job status monitoring further comprises management of an interactive job session, and the mobile terminal APP can display interactive session information created through the Open OnDemand platform and provide operation controls for session deletion and renewal.
  7. 7. The mobile-terminal-based intelligent computing job submission management method of claim 1, wherein in S6, the result obtaining and post-processing supports the direct opening of the visual application of the Open OnDemand platform through the mobile terminal APP, and the online previewing of the visual chart generated by the job and the engineering simulation image result file does not need to be downloaded locally.
  8. 8. The mobile-terminal-based intelligent computing job submission management method of claim 1, wherein the specific logic steps of S5 are as follows: S501, after a user finishes the task submission, the mobile terminal APP automatically starts a task monitoring function, and the APP detects the current network delay in real time, if the delay is more than or equal to 1000ms, the APP judges that the network is unstable, if the delay is more than or equal to 300ms and less than or equal to 1000ms, the APP judges that the network is stable, and if the delay is less than or equal to 300ms, the APP judges that the network is good; S502, when the network is stable/high-quality, the Web Socket long connection is adopted, the state update is received in real time without polling, when the network is unstable, the network is automatically switched into periodic polling, and the actual polling interval is calculated, wherein the used formula is as follows: S503, pulling an initial state from a dispatching system as monitoring reference data by the mobile terminal APP in association with all jobs under a user account, and presenting the acquired state data in a visual form by the mobile terminal APP, so that the user can conveniently and intuitively check the state data; s504, after the mobile terminal APP receives the state change, the mobile terminal APP processes and reminds the user according to the priority, and the priority rule is as follows: when the state is changed into 'completed' or 'error', the content comprises a job name, a state and an operation suggestion through popup window and notification bar message reminding; S505, the mobile terminal APP provides an interaction control to support a user to execute operation on a job, and specifically comprises the following steps: s5051, opening a corresponding control 'in-queue/running' job display 'termination' button, an 'error/terminated' job display're-queuing' button and a 'view log' button according to the current state of the job; S5052, after the user clicks "terminate", the mobile terminal APP sends a termination request to the dispatching system, after success, the state is updated to be "terminated", the user clicks "re-queuing", the mobile terminal APP multiplexes the original job configuration submitting request, the state is updated to be "in queuing", the user clicks "log", the execution log, the error log and the system log can be checked, keyword searching is supported, matching degree is calculated through a formula, and the side-by-side display is performed: S5053, after each step of operation, the mobile terminal APP popup prompts a result to ensure that a user knows an operation state; S506, the user can manually adjust the monitoring setting, view the historical data, click the 'pause monitoring' in the 'setting', stop the mobile terminal APP from receiving the state update, click the 'resume monitoring', reestablish the connection, pull the latest state, click the 'history record', display the timeline of the finished operation and the resource use statistics by the mobile terminal APP, and help the user to re-dial the operation condition.

Description

Intelligent computing job submission management method based on mobile terminal Technical Field The invention relates to the technical field of job submission management, in particular to an intelligent computing job submission management method based on a mobile terminal. Background High Performance Computing (HPC) and intelligent computing (intelligent computing) have become indispensable infrastructures in the fields of scientific research, engineering simulation, artificial intelligent model training and the like, traditional HPC job submission and management modes are seriously dependent on command line terminals and desktop workstations, and users need to interact with a job scheduling system through specific commands so as to complete a series of operations such as job script writing, file transmission, task submission, state monitoring and the like; However, along with the improvement of the complexity of the computing task and the diversification of application scenes, the traditional mode exposes remarkable limitations that 1, the operation has a higher technical threshold, a user is required to be familiar with a Linux operating system and various dispatcher commands, a use barrier is formed for scientific researchers without computer professions, 2, the user cannot conveniently manage the computing task with longer time consumption in a mobile scene before being bound in a fixed desktop environment, cannot respond to the operation state in real time (such as task failure needs to be submitted again immediately), and lacks flexibility. Disclosure of Invention Based on the technical problems in the background technology, the invention provides an intelligent computing job submission management method based on a mobile terminal. The invention provides a mobile-terminal-based intelligent computing job submission management method, which is based on an Open-source OpenOnDemand platform to deploy a Web system, packages the Web system into a mobile terminal APP, realizes visual one-stop management of an HPC platform and specifically comprises the following steps: s1, session establishment and authentication, wherein a mobile terminal APP initiates a connection request to a Web service gateway deployed at the front end of an HPC cluster, and a user inputs credentials to complete identity verification through an authentication interface integrated in the mobile terminal APP, so that a secure session is established; S2, resource discovery and selection, namely, a mobile terminal APP obtains and visually displays the resource states of currently available computing nodes and queues by calling a cluster state query API, wherein the resource states comprise CPU/GPU load, memory utilization rate and queue waiting conditions, and a user selects a target computing queue or node on a graphical interface based on the resource states; S3, job preparation and file management, wherein a user accesses an HPC home directory or a project directory thereof in a graphical operation mode through a file manager module of a mobile terminal APP, and at least one operation of creating an input file, uploading a local file to a cluster, editing the content of the existing file or organizing a data set required by the job is performed under the directory; S4, job configuration and submission, wherein a user configures and calculates job parameters through a job submission form of a mobile terminal APP, the parameters comprise a job name, an execution command, a required core number, a memory size, a predicted operation time length and a dependent software environment, and the mobile terminal APP packages the job parameters configured by the user into a standard job description script and submits the standard job description script to a selected queue through a job scheduling system interface; S5, job status real-time monitoring and interaction, wherein a mobile terminal APP periodically polls or receives status update from a job scheduling system through Web Socket connection, and displays the job status on a mobile terminal interface in real time in a list or chart form, wherein the job status comprises 'in queuing', 'in operation', 'completed' or 'error', and simultaneously provides a user interaction control to allow a user to execute operations of ending, re-queuing or checking detailed logs on specified jobs; And S6, obtaining and post-processing results, namely when the operation state is monitored to be changed into 'completed', automatically reminding a user by the mobile terminal APP, and browsing, previewing or downloading standard output, standard errors and result files generated by the operation to the local of the mobile equipment by the user through the file manager module. Preferably, in the step S1, the mobile terminal APP establishes a connection through a Web service interface of the encapsulated Open OnDemand platform, and the secure session uses an identity authentication system of the platform to support a