Search

US-12626157-B2 - Identifying idle-cores in data centers using machine-learning (ML)

US12626157B2US 12626157 B2US12626157 B2US 12626157B2US-12626157-B2

Abstract

Apparatuses, systems, and techniques to determine a number of idle cores of a computing device using a machine learning (ML) model based on a set of processes executed by the computing device are described. One method determines a set of processes executed by the computing device and determines, using an ML model, a number of cores of the computing device to be powered down based at least on the set of processes. The method updates a first mode of the number of cores to a second mode in which the number of cores consumes less power than in the first mode.

Inventors

  • Yogesh Dangi
  • Manas Ranjan Jagadev
  • Sandip Kumar
  • Kiran Sutar

Assignees

  • NVIDIA CORPORATION

Dates

Publication Date
20260512
Application Date
20220929

Claims (18)

  1. 1 . A method comprising: determining, using a computing device comprising a plurality of cores, a set of processes executed by the computing device; predicting, using a machine learning (ML) model, a first number of cores to be utilized by a first process of the set of processes and a second number of cores to be utilized by a second process of the set of processes; determining a number of cores of the plurality of cores to be placed in a lower power state based at least on subtracting the first number and the second number from a total number of available cores; and updating a first mode of the number of cores to a second mode in which the number of cores consumes less power than in the first mode.
  2. 2 . The method of claim 1 , wherein the ML model is trained using at least historical core utilization data for at least one process of the set of processes, and wherein the ML model is deployed to a second computing device operatively coupled to the computing device.
  3. 3 . The method of claim 1 , wherein the ML model is trained using at least historical core utilization data for at least one process of the set of processes during a first amount of time, and wherein the number of cores is updated to the second mode for a second amount of time, wherein the first amount of time and the second amount of time comprise the same duration of time.
  4. 4 . The method of claim 1 , wherein the total number of available cores is less than a total number of the plurality of cores.
  5. 5 . The method of claim 1 , wherein: the predicting the first number of cores comprises predicting the first number of cores to be utilized by the first process for a next time period; the predicting the second number of cores comprises predicting the second number of cores to be utilized by the second process for the next time period; the number of cores is updated to the second mode for the next time period; and the ML model is trained using historical core utilization data for the first process and the second process during one or more previous time periods.
  6. 6 . The method of claim 1 , further comprising: determining, using the computing device, a second set of processes executed by the computing device at a second time subsequent to the updating the number of cores to the second mode; determining, using the ML model, a second number of cores of the plurality of cores to be placed in a lower power state based on the second set of processes; and updating the first mode of the second number of cores to the second mode.
  7. 7 . The method of claim 1 , further comprising: collecting, using the computing device, core utilization data for each process executed by the computing device during a first time period, the core utilization data comprising, for each process, a process identifier, a count of utilized cores, and a timestamp; and storing the core utilization data for each process as historical core utilization data in a database operatively coupled to the computing device, wherein the ML model is trained using the historical core utilization data for each process of the set of processes.
  8. 8 . The method of claim 1 , wherein the ML model is at least one of a random forest regression model or a support vector machine (SVM) model.
  9. 9 . The method of claim 1 , wherein the set of processes comprises at least one of an application, a job, a task, or a routine executed by the plurality of cores.
  10. 10 . A computing device comprising: a plurality of processing units to: determine a set of processes executed by the computing device; predict, using a machine learning (ML) model, a first number of cores to be utilized by a first process of the set of processes and a second number of cores to be utilized by a second process of the set of processes; determine a number of processing units of the plurality of processing units to be placed in a lower power state based at least on subtracting the first number and the second number from a total number of available cores; and update a first mode of the number of processing units to a second mode in which the number of processing units consumes less power than in the first mode.
  11. 11 . The computing device of claim 10 , wherein the plurality of processing units is to: collect utilization data for each process executed by the computing device during a first time period, the utilization data comprising, for each process, a process identifier, a count of utilized processing units, and a timestamp; and store the utilization data for each process as historical utilization data in a database operatively coupled to the computing device, wherein the ML model is trained using the historical utilization data for each process of the set of processes, and wherein the ML model is deployed to a second computing device operatively coupled to the computing device.
  12. 12 . The computing device of claim 10 , wherein the ML model is trained using at least historical utilization data for at least one process of the set of processes during a first amount of time, and wherein the number of processing units is updated to the second mode for a second amount of time, wherein the first amount of time and the second amount of time comprise the same duration of time.
  13. 13 . The computing device of claim 10 , wherein the total number of available processing units is less than a total number of the plurality of processing units.
  14. 14 . The computing device of claim 10 , wherein the plurality of processing units is further to: determine a second set of processes executed by the computing device at a second time subsequent to updating the number of processing units to the second mode; determine, using the ML model, a second number of processing units of the plurality of processing units to be placed in a lower power state based on the second set of processes; and update the first mode of the second number of processing units to the second mode.
  15. 15 . A system comprising: a first device comprising a plurality of cores; a second device hosting a machine learning (ML) model trained to predict a number of cores of the plurality of cores to be placed in a lower power state based at least on a given set of processes executed by the first device; and a database to store historical core utilization data for at least one process of the set of processes executed by the first device, wherein the first device is to: determine a first set of processes executed by the first device; predict, using the ML model, a first number of cores to be utilized by a first process of the first set of processes and a second number of cores to be utilized by a second process of the first set of processes; determine a first subset of cores of the plurality of cores to be powered down based at least on subtracting the first number and the second number from a total number of available cores; and update a first mode of the first subset of cores to a second mode in which the first subset of cores consumes less power than in the first mode.
  16. 16 . The system of claim 15 , wherein the first device is further to: determine a second set of processes executed by the first device at a subsequent time to the first set of processes; determine, using the ML model, a second subset of cores of the plurality of cores to be placed in a lower power state based at least on the second set of processes; and update a first mode of the second subset of cores to a second mode of the second subset of cores in which the second subset of cores consumes less power than in the first mode.
  17. 17 . The system of claim 15 , wherein the ML model is at least one of a random forest regression model or a support vector machine (SVM) model, and wherein the first set of processes comprises at least one of an application, a job, a task, or a routine executed by the plurality of cores.
  18. 18 . The system of claim 15 , wherein the system comprises one or more of: a control system for an autonomous or semi-autonomous machine; a perception system for an autonomous or semi-autonomous machine; a system for performing simulation operations; a system for performing deep learning operations; a system for generating synthetic data; a system for generating multi-dimensional assets using a collaborative content platform; a system implemented using an edge device; a system implemented using a robot; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.

Description

TECHNICAL FIELD At least one embodiment pertains to processing resources used to perform and facilitate artificial intelligence. For example, at least one embodiment pertains to processors or computing systems used to train and use machine learning (ML) to identify and power down idle cores. BACKGROUND In multi-computing platforms and environments—such as data centers, supercomputers, high-performance computing (HPC) environments, cluster computing environments, or cloud computing environments, etc.—it is important to find idle or underutilized computing devices so that the usages of these computing devices can be more efficiently allocated by taking corrective actions. In the data center or cloud environment, it is important to save power (energy) consumed by a server. The applications or jobs executing on a server may not be consuming all available central processing unit (CPU) cores on the server. Each CPU core, however, still consumes power, so the unutilized CPU cores result in power wastage. BRIEF DESCRIPTION OF DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1A is a block diagram of an idle-core identification system for identifying and powering down idle cores in an exemplary data center, according to at least one embodiment. FIG. 1B is a block diagram of an idle-core identification system for identifying and powering down idle cores, according to at least one embodiment. FIG. 2 illustrates a method of identifying and powering down idle cores in accordance with one embodiment. FIG. 3 illustrates a method of identifying and powering down idle cores in accordance with one embodiment. FIG. 4 is a block diagram of a data center with multiple servers, each with a service to update CPU cores in a power-saving mode, according to at least one embodiment. FIG. 5 is a block diagram of a data center with multiple servers, each with an agent to collect CPU usage data, according to at least one embodiment. FIG. 6 illustrates a table with core usage data according to at least one embodiment. FIG. 7 is an example data flow diagram of a process for identifying and powering down idle cores, according to at least one embodiment. FIG. 8 is a table of power consumption savings in powering down cores in the power-saving mode according to various embodiments. FIG. 9 illustrates the training and deployment of a neural network, according to at least one embodiment. FIG. 10 is a flow diagram of a method of identifying idle cores, according to at least one embodiment. FIG. 11 is a flow diagram of a method of training a machine learning (ML) model for predicting a CPU core requirement of a computing device, according to at least one embodiment. FIG. 12A illustrates inference and/or training logic, according to at least one embodiment. FIG. 12B illustrates inference and/or training logic, according to at least one embodiment. FIG. 13 illustrates an example data center system, according to at least one embodiment. FIG. 14 illustrates a computer system, according to at least one embodiment. FIG. 15 illustrates a computer system, according to at least one embodiment. FIG. 16 illustrates at least portions of a graphics processor, according to one or more embodiments. FIG. 17 illustrates at least portions of a graphics processor, according to one or more embodiments. FIG. 18 is an example data flow diagram for an advanced computing pipeline, in accordance with at least one embodiment. FIG. 19 is a system diagram for an example system for training, adapting, instantiating, and deploying machine learning models in an advanced computing pipeline, in accordance with at least one embodiment. FIG. 20A and FIG. 20B illustrate a data flow diagram for a process to train a machine learning model, as well as client-server architecture to enhance annotation tools with pre-trained annotation models, in accordance with at least one embodiment. DETAILED DESCRIPTION Idle-Core Identification Systems Embodiments described herein are directed to determining idle cores using ML-based techniques for saving power in data centers. Multiple cores can be located in a computing device. The computing devices can be CPUs, graphics processing units (GPUs), data processing units (DPUs), or the like. The cores of these computing devices can also be implemented as components in devices, such as (for example and without limitation): machines, computers, servers, network devices, or the like. These computing devices are important resources in a data center or a cloud environment. It is important to have efficient and effective monitoring or management of these resources. As described above, it is important to identify idle cores so that the usage of these cores can be enhanced by taking corrective actions. Saving power (energy) consumed by computing devices, like servers, is a priority in the data center environment. In conventional systems, operating systems can use power-savin