KR-20260067527-A - A MULTIPLE SUBSURROGATE APPROACH FOR CONFIGURATION TUNING OF APACHE SPARK APPLICATIONS
Abstract
A multi-part surrogate approach for tuning the configuration of an Apache Spark application is disclosed. According to an embodiment of the present invention, a method for tuning the configuration of an application is performed by a computing device including at least a processor and utilizes Bayesian optimization, and comprises the steps of: sampling candidate evaluation points for a configuration setting of a target function of the application including a plurality of jobs; predicting an observation value for each of the candidate evaluation points; calculating an acquisition function value for each of the candidate evaluation points using the predicted observation value; determining the candidate evaluation point having the highest acquisition score value among the candidate evaluation points as the next evaluation point; performing the target function at the next evaluation point; and including the result of the step of performing the target function in the observation value.
Inventors
- 정연돈
- 윤현식
Assignees
- 고려대학교 산학협력단
Dates
- Publication Date
- 20260513
- Application Date
- 20241106
Claims (9)
- A configuration tuning method for an application, performed by a computing device including at least a processor and utilizing Bayesian Optimization (BO), wherein A step of sampling evaluation point candidates for a configuration setting of a target function, which is an application including multiple jobs; A step of predicting an observation value for each of the above-mentioned evaluation point candidates; A step of calculating the acquisition function value of each of the above-mentioned evaluation point candidates using the predicted observations; A step of determining the evaluation point candidate having the highest acquired score value among the above evaluation point candidates as the next evaluation point; Step of performing the target function at the next evaluation point; and A step comprising including the result of performing the step of performing the above target function in the observation value, Application settings tuning method.
- In paragraph 1, The above target function is an Apache Spark Application, Application settings tuning method.
- In paragraph 2, The step of predicting the observation value involves predicting the observation value using a plurality of partial surrogate models, each of which represents at least one of the plurality of jobs, and The jobs represented by each of the plurality of partial proxy models do not overlap with one another, and any one of the plurality of jobs is represented by any one of the plurality of partial proxy models. Application settings tuning method.
- In paragraph 3, The step of predicting the above observation value is, A step of estimating the execution time of a substituted job using the above-mentioned multiple partial substitute models; Step of calculating the mean and variance of the estimated execution time; and A step comprising estimating the mean and variance of the execution time of the application using the calculated mean and variance, Application settings tuning method.
- In paragraph 4, The step of calculating the acquisition function value for each of the above-mentioned evaluation point candidates comprises calculating the EI (Expected Improvement), PI (Probability of Improvement), or UCB (Upper Confidence Bound) for each of the above-mentioned evaluation point candidates, Application settings tuning method.
- In paragraph 5, The above sampling step to the step of including in the observation value is performed repeatedly until a termination condition is reached. Application settings tuning method.
- In paragraph 6, The above application is a machine learning application comprising at least one of resource allocation, data load, data cleaning, feature extraction, training, and result aggregation. Application settings tuning method.
- In Paragraph 7, Prior to the above sampling step, further comprising an initialization step for the plurality of partial surrogate functions, Application settings tuning method.
- In paragraph 8, After the step of including the above observation value, A step that further includes outputting a setting value showing the shortest execution time, Application settings tuning method.
Description
A Multipart Substitute Approach for Configuration Tuning of Apache Spark Applications The present invention relates to Bayesian Optimization (BO), and more specifically, to a method for improving tuning performance by using multiple surrogate models used by Bayesian Optimization when tuning a configuration using Bayesian Optimization (BO) during the execution of Apache Spark Applications. In the era of big data, a massive amount of data is generated every second. Big data analysis systems such as Apache Spark (see Non-patent Literature 1 and 2) and Apache Hadoop (see Non-patent Literature 3) have been adopted to extract hidden insights from big data or to train machine learning models. Apache Spark is a widely used big data analysis system due to its fast processing speed and various high-level libraries such as SparkSQL (see Non-patent Literature 4), Spark Streaming (see Non-patent Literature 5), and SparkML (see Non-patent Literature 6). In each execution of a Spark application, configuration settings (e.g., the number of executors, memory configuration, level of parallelism, etc.) can be submitted along with the application. The execution time of an application depends significantly on these configuration settings. In particular, the importance of configuration tuning is more pronounced in periodic applications, which account for a significant portion of jobs executed in Spark (see Non-Patent Literature 7, 8). However, due to the vast configuration of Spark, manual tuning requires not only expert-level system knowledge but also a considerable amount of time. Furthermore, since it is impractical to attempt multiple deep learning test runs on a real cluster, configuration tuning approaches must be performed quickly online. To address this, several online tuning approaches have been proposed (see Non-Patent Literature 7, 8, 10-16). Among them, Bayesian Optimization (BO)-based tuning is the most widely used because it does not require a large amount of training data, is tolerant of uncertainty, and possesses a non-parametric nature. Bayesian Optimization (BO) is a process that iteratively performs adaptive sampling of a set value x next and evaluates x next . Sampling is performed using a surrogate model and an acquisition function. The surrogate model is a probabilistic model that estimates the execution time of an application at a given set value. The acquisition function is defined based on the surrogate model and is used to recommend x next . The estimation of the surrogate model becomes more accurate as more evaluation results (observations) are obtained. On the other hand, relying on a single surrogate model in Bayesian Optimization (BO) can lead to several problems, such as the following: Problem 1. [Complex Black Box]. Spark applications consist of various execution operations such as shuffle, caching, map, and reduce, and configurations comprehensively affect these operations. Additionally, due to the nature of distributed systems, noise may be included in the evaluation results. Consequently, estimating execution time in Spark places a burden on a single surrogate model, which can lead to inaccurate estimations in the early stages of Bayesian optimization (BO). Problem 2. [Evaluation Failure]. When Bayesian Optimization (BO) attempts to evaluate various setting values, evaluation failure (application failure) may occur. If the application fails midway, the execution log contains partial evaluation results obtained up to the failure. However, since a single surrogate model requires the entire execution time, most existing studies cannot consider partial results or treat failed evaluations as a predefined upper execution time (see Non-Patent Literature 18). These approaches do not fully utilize evaluation results and may impede the accuracy of the estimation. Problem 3. [Application Modification]. After optimization, the application may be modified. Consider a machine learning application that includes a machine learning pipeline consisting of data loading, data cleaning, feature extraction, and iterative training. If the user decides to change the number of features, the application's execution time may be affected. Even if only the feature extraction step is modified, the target function changes. Therefore, evaluation results must be collected again from the beginning. To address this problem, the present invention proposes a novel Multiple Subsurrogate (MSS) approach that surrogates a Spark application F through multiple models rather than a single model. Furthermore, the invention introduces a method to solve the aforementioned problem 1-3 through MSS-based Bayesian Optimization (BO) and Spark application configuration tuning using MSS-based Bayesian Optimization (BO). MSS-based Bayesian Optimization (BO) can be easily integrated with existing Bayesian Optimization (BO)-based configuration tuning approaches. The contributions of the present invention are summarized as follows. - We propose a new