JP-2026076225-A - Techniques for modifying cluster computing environments
Abstract
[Problem] To provide a system, device, and method for intelligently coordinating a set of worker nodes within a computing cluster. [Solution] A computing device or service monitors performance metrics of a set of worker nodes in a computing cluster, and when it detects a performance metric below a threshold, it performs a first adjustment to the number of nodes in the cluster, acquires training data at least in part on the first adjustment and uses it together with supervised learning techniques to train a machine learning model to predict future performance changes in the cluster, provides the machine learning model with subsequent performance metrics and/or cluster metadata to obtain an output showing the predicted performance changes, and performs an additional adjustment to the number of worker nodes at least in part on this output. [Selection Diagram] Figure 1
Inventors
- アキナペッリ,サンディープ
- ダス,ディーバラジ
- カバリ,ディーバラジュル
- ジェイスワル,プニート
- ラダノビック,ベリミール
Assignees
- オラクル・インターナショナル・コーポレイション
Dates
- Publication Date
- 20260511
- Application Date
- 20260116
- Priority Date
- 20201110
Claims (20)
- A method implemented by a computer, The computing service monitors one or more performance metrics of a set of worker nodes in a computing cluster, The computing service detects when the performance metric falls below the performance threshold, In response to detecting that the performance metric falls below the performance threshold, the computing service performs a first adjustment to the number of worker nodes in the set of worker nodes of the computing cluster. The computing service acquires training data for a machine learning model, at least in part, based on performing the first adjustment. The computing service trains the machine learning model using the training data and supervised machine learning algorithms. The computing service includes obtaining an output showing predicted performance changes in the computing cluster, the output being obtained at least in part on the basis of providing one or more subsequent performance metrics of the computing cluster as input to the machine learning model, and the method further includes A computer-implemented method comprising the computing service performing a second adjustment to the set of worker nodes based at least in part on the output indicating the predicted performance change in the computing cluster.
- The method implemented by a computer according to claim 1, wherein the coordination of the set of worker nodes further includes the computing service generating a scaling task, the scaling task being performed by a computing process, and the computing process updating metadata associated with the computing cluster upon completion of the scaling task.
- The computer-based method according to claim 1, wherein the output indicating the predicted performance change indicates how many worker nodes will be used to compute the task at a later time, and the later time occurs within a predetermined period in the future.
- The method implemented by a computer according to claim 1, wherein performing the first adjustment or the second adjustment includes increasing the number of worker nodes or decreasing the number of worker nodes.
- The method implemented by a computer according to claim 1, wherein performing the first adjustment includes provisioning a certain number of additional worker nodes to the set of worker nodes in the computing cluster.
- The computer-implemented method according to claim 5, further comprising determining that provisioning a certain number of additional worker nodes resulted in a subsequent performance metric exceeding the performance threshold, wherein the training data is generated in response to the determination that the certain number of additional worker nodes resulted in the subsequent performance metric.
- The computer-based method according to claim 6, wherein the training data includes one or more performance metrics, the subsequent performance metrics, and a certain number of the additional worker nodes provisioned during the first period.
- The method implemented by the computer according to claim 1, wherein the one or more performance metrics include at least one of the following: the number of pending queries, the number of pending tasks, a latency measurement, processing utilization, or memory utilization.
- A computing device, One or more processing devices that are communicatively coupled to a computer-readable medium, The system comprises a computer-readable medium that stores non-temporary computer executable program instructions, and when the non-temporary computer executable program instructions are executed by one or more processing devices, they cause the computing device to perform an operation, and the operation is Monitoring performance metrics for a set of worker nodes in a computing cluster, Detecting when performance metrics fall below performance thresholds, In response to detecting that the performance metric falls below the performance threshold, a first adjustment is made to the number of worker nodes in the set of worker nodes of the computing cluster. Obtaining training data for a machine learning model based at least in part on the first adjustment, Training the machine learning model using the aforementioned training data and supervised machine learning algorithm, This includes obtaining an output showing the expected performance changes in the computing cluster, the output being obtained at least in part on the basis of providing subsequent performance metrics of the computing cluster as input to the machine learning model, and the operation further includes, A computing device comprising performing a second adjustment to a set of worker nodes based at least in part on the output indicating the predicted performance change in the computing cluster.
- The computing device according to claim 9, wherein the coordination of the set of worker nodes further includes the computing service generating a scaling task, the scaling task being performed by a computing process, and the computing process updating metadata associated with the computing cluster upon completion of the scaling task.
- The computing device according to claim 10, wherein the output indicating the predicted performance change indicates how many worker nodes will be used to compute the task at a later time, and the later time occurs within a predetermined period in the future.
- The computing device according to claim 9, wherein performing the first adjustment or the second adjustment includes increasing or decreasing the number of worker nodes.
- The computing device according to claim 9, wherein performing the first adjustment includes provisioning a certain number of additional worker nodes to the set of worker nodes in the computing cluster, the computing device performs an additional operation, the additional operation includes determining that provisioning the certain number of additional worker nodes resulted in a subsequent performance metric exceeding the performance threshold, and the training data is generated in response to the determination that the certain number of additional worker nodes resulted in the subsequent performance metric.
- The computing device according to claim 13, wherein the training data includes one or more performance metrics, the subsequent performance metrics, and a certain number of the additional worker nodes provisioned during the first period.
- A non-temporary computer-readable storage medium for storing computer executable program instructions, wherein, when the computer executable program instructions are executed by a processing device of a computing device, the computing device causes the computing device to perform an action, and the action is Monitoring performance metrics for a set of worker nodes in a computing cluster, Detecting when performance metrics fall below performance thresholds, In response to detecting that the performance metric falls below the performance threshold, a first adjustment is made to the number of worker nodes in the set of worker nodes of the computing cluster. Obtaining training data for a machine learning model based at least in part on the first adjustment, Training the machine learning model using the aforementioned training data and supervised machine learning algorithm, This includes obtaining an output showing the expected performance changes in the computing cluster, the output being obtained at least in part on the basis of providing subsequent performance metrics of the computing cluster as input to the machine learning model, and the operation further includes, A non-transient computer-readable storage medium, comprising performing a second adjustment to the set of worker nodes based at least in part on the output indicating the predicted performance change in the computing cluster.
- The non-temporary computer-readable storage medium according to claim 15, wherein coordinating the set of worker nodes further includes the computing service generating a scaling task, the scaling task being performed by a computing process, and the computing process updating metadata associated with the computing cluster upon completion of the scaling task.
- The output indicating the predicted performance change indicates how many worker nodes will be used to compute the task at a later time, and the later time occurs within a predetermined period in the future, as described in claim 16 of the non-temporary computer-readable storage medium.
- The non-temporary computer-readable storage medium according to claim 15, wherein performing the first adjustment or the second adjustment includes increasing or decreasing the number of worker nodes.
- The non-temporary computer-readable storage medium according to claim 15, wherein performing the first adjustment includes provisioning a certain number of additional worker nodes to the set of worker nodes in the computing cluster, the computing device performs an additional operation, the additional operation includes determining that provisioning the certain number of additional worker nodes resulted in a subsequent performance metric exceeding the performance threshold, and the training data is generated in response to the determination that the certain number of additional worker nodes resulted in the subsequent performance metric.
- The computing device according to claim 19, wherein the training data includes one or more performance metrics, the subsequent performance metrics, and a certain number of the additional worker nodes provisioned during the first period.
Description
Reference to Related Applications This application claims priority to U.S. Patent Application 17/094,715, filed November 10, 2020, entitled “Techniques for Modifying Cluster Computing Environments,” the disclosure of which is incorporated herein by reference in its entirety for any purpose. Background Distributed computing systems are becoming increasingly prevalent. These systems may include computing clusters of connected nodes (e.g., computers, servers, virtual machines, etc.) that work together to handle various requests (e.g., requests to store and/or retrieve data in a system that maintains a database). As the number of tasks increases or decreases, the number of connected nodes may fall below an optimal state. For example, if the number of tasks decreases, the number of nodes may be greater than needed for pending tasks, leaving some nodes idle. Conversely, if the number of tasks increases, the number of nodes may be less than needed to efficiently handle pending tasks, thus introducing greater latency to execute pending tasks. Improvements may be made to how conventional systems manage the number of nodes in a computing cluster. Embodiments of this disclosure address these and other issues individually and collectively. Overview This invention provides techniques (e.g., methods, systems, and non-temporary computer-readable media for storing code or instructions executable by one or more processors) for adjusting the number of nodes in a computing cluster in response to actual and/or predicted changes in one or more performance metrics of the computing cluster. This specification describes various embodiments, including methods, systems, and non-temporary computer-readable storage media for storing programs, code, or instructions executable by one or more processors. One embodiment relates to a method for adjusting the number of compute nodes in a compute cluster based at least in part on actual and/or predicted changes in one or more performance metrics of the compute cluster. The method may include a compute service monitoring one or more performance metrics of a set of worker nodes in the compute cluster. The method may further include the compute service detecting that a performance metric falls below a performance threshold. In response to detecting that a performance metric is below a performance threshold, the method may further include the compute service performing a first adjustment to the number of worker nodes in the set of worker nodes in the compute cluster. The method may further include the compute service obtaining training data for a machine learning model based at least in part on performing the first adjustment. The method may further include the compute service training a machine learning model using the training data and a supervised machine learning algorithm. The method may further include the compute service obtaining an output showing predicted performance changes in the compute cluster. In some embodiments, the output is obtained at least in part on providing one or more subsequent performance metrics of the compute cluster as input to the machine learning model. The method may further include the computing service performing a second adjustment to a set of worker nodes, at least in part, based on an output indicating the expected performance changes in the computing cluster. In some embodiments, coordinating a set of worker nodes further includes the computing service generating a scaling task, which is executed by a computing process, and the computing process updates the metadata associated with the computing cluster upon completion of the scaling task. In some embodiments, the output showing the predicted performance change indicates how many worker nodes will be used to compute the task at a later time, and this later time occurs within a predetermined period in the future. In some embodiments, performing a first adjustment or a second adjustment includes increasing or decreasing the number of worker node sets. In some embodiments, performing the first adjustment includes provisioning a certain number of additional worker nodes to the set of worker nodes in the computing cluster. In some embodiments, the method may further include determining that provisioning a certain number of additional worker nodes resulted in subsequent performance metrics exceeding a performance threshold, and the training data is generated in response to determining that the certain number of additional worker nodes resulted in subsequent performance metrics. In some embodiments, the training data includes one or more performance metrics, subsequent performance metrics, and the number of additional worker nodes provisioned during the first period. In some embodiments, the one or more performance metrics include at least one of the following: the number of pending queries, the number of pending tasks, latency measurements, processing utilization, or memory utilization. Another embodiment relates