Search

US-12619598-B2 - Data allocation with user interaction in a machine learning system

US12619598B2US 12619598 B2US12619598 B2US 12619598B2US-12619598-B2

Abstract

Various embodiments are provided for providing enhanced data allocation for machine learning operations in a computing environment by one or more processors in a computing system. One or more data sampling strategies may be determined based on a dataset. One or more enhanced training data allocations may be suggested for machine learning operations in a cloud computing environment based on the one or more data sampling strategies.

Inventors

  • Bei Chen
  • Massimiliano MATTETTI
  • Rahul Nair
  • Elizabeth Daly
  • Oznur ALKAN

Assignees

  • INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date
20260505
Application Date
20211102

Claims (20)

  1. 1 . A method for improving machine learning operations within a cloud computing environment when training a machine learning model for execution within the cloud computing environment, the method comprising: determining one or more training data sampling strategies based on a dataset hosted using a cloud computing service in a cloud computing environment; predicting a cost in resource utilization of each of the one or more training data sampling strategies in terms of a cost of data storage in a cloud object and in terms of a cost of training a machine learning model in the cloud computing environment to consume data provided by each of the one or more data sampling strategies; predicting a degree of impact to accuracy of the one or more training data sampling strategies; suggesting as part of a data pre-processing step of an automated machine learning model prediction service one or more enhanced training data allocations for machine learning operations in a cloud computing environment based on the one or more data sampling strategies, the one or more enhanced training data allocations each having one or more portions of data removed from the cloud computing environment to minimize cost of training the machine learning model while adhering to constraints of accuracy and run time; and training the machine learning model utilizing the suggested enhanced training data allocations from the dataset, the machine learning model trained with the one or more portions of data removed from the dataset prior to training of the machine learning model to minimize the cost of training the machine learning model while adhering to the constraints of accuracy and run time.
  2. 2 . The method of claim 1 , further including receiving, as the dataset, a plurality of data types and data features, wherein the plurality of data types includes at least tabular data and timeseries data and the data features include at least a change point, seasonality data, and clustered data.
  3. 3 . The method of claim 1 , further including: applying a forward allocation as a data sampling strategy for tabular data; applying a backward allocation as a data sampling strategy for time series data; applying a stratified sampling as a data sampling strategy for clustered data; applying constraint sampling to include a defined time period as a data sampling strategy for seasonal data; and using a change point detection as a data sampling strategy for abnormal data.
  4. 4 . The method of claim 1 , further including collecting feedback data based on the one or more data sampling strategies.
  5. 5 . The method of claim 1 , further including providing one or more t-shirt size options, data storage options, and the one or more data sampling strategies for suggesting one or more enhanced training data allocations.
  6. 6 . The method of claim 1 , further including providing a projected learning curve for the machine learning operations and benefit tradeoffs for each of the one or more enhanced training data allocations.
  7. 7 . The method of claim 1 , further including predicting a degree of impact on the dataset for each of the one or more enhanced training data allocations based on a training accuracy, training time, the dataset, computing hardware configurations, and the one or more data sampling strategies.
  8. 8 . A system for improving machine learning operations within a cloud computing environment when training a machine learning model for execution within the cloud computing environment, the system comprising: one or more computers with executable instructions that when executed cause the system to: determine one or more training data sampling strategies based on a dataset hosted using a cloud computing service in a cloud computing environment; predicting a cost in resource utilization of each of the one or more training data sampling strategies in terms of a cost of data storage in a cloud object and in terms of a cost of training a machine learning model in the cloud computing environment to consume data provided by each of the one or more data sampling strategies; predicting a degree of impact to accuracy of the one or more training data sampling strategies; suggest as part of a data pre-processing step of an automated machine learning model prediction service one or more enhanced training data allocations for machine learning operations in a cloud computing environment based on the one or more data sampling strategies, the one or more enhanced training data allocations each having one or more portions of data removed from the cloud computing environment to minimize cost of training the machine learning model while adhering to constraints of accuracy and run time; and training the machine learning model utilizing the suggested enhanced training data allocations from the dataset, the machine learning model trained with the one or more portions of data removed from the dataset prior to training of the machine learning model to minimize the cost of training the machine learning model while adhering to constraints of accuracy and run time.
  9. 9 . The system of claim 8 , wherein the executable instructions when executed cause the system to receive, as the dataset, a plurality of data types and data features, wherein the plurality of data types includes at least tabular data and timeseries data and the data features include at least a change point, seasonality data, and clustered data.
  10. 10 . The system of claim 8 , wherein the executable instructions when executed cause the system to: apply a forward allocation as a data sampling strategy for tabular data; apply a backward allocation as a data sampling strategy for time series data; apply a stratified sampling as a data sampling strategy for clustered data; apply constraint sampling to include a defined time period as a data sampling strategy for seasonal data; and use a change point detection as a data sampling strategy for abnormal data.
  11. 11 . The system of claim 8 , wherein the executable instructions when executed cause the system to collect feedback data based on the one or more data sampling strategies.
  12. 12 . The system of claim 8 , wherein the executable instructions when executed cause the system to provide one or more t-shirt size options, data storage options, and the one or more data sampling strategies for suggesting one or more enhanced training data allocations.
  13. 13 . The system of claim 8 , wherein the executable instructions when executed cause the system to provide a projected learning curve for the machine learning operations and benefit tradeoffs for each of the one or more enhanced training data allocations.
  14. 14 . The system of claim 8 , wherein the executable instructions when executed cause the system to predict a degree of impact on the dataset for each of the one or more enhanced training data allocations based on a training accuracy, training time, the dataset, computing hardware configurations, and the one or more data sampling strategies.
  15. 15 . A computer program product for improving machine learning operations within a cloud computing environment when training a machine learning model for execution within the cloud computing environment, the computer program product comprising: one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions comprising: program instructions to determine one or more training data sampling strategies based on a dataset hosted using a cloud computing service in a cloud computing environment; program instructions to predict a cost in resource utilization of each of the one or more training data sampling strategies in terms of a cost of data storage in a cloud object and in terms of a cost of training a machine learning model in the cloud computing environment to consume data provided by each of the one or more data sampling strategies; program instructions to suggest as part of a data pre-processing step of an automated machine learning model prediction service one or more enhanced training data allocations for machine learning operations in a cloud computing environment based on the one or more data sampling strategies, the one or more enhanced training data allocations each having one or more portions of data removed from the cloud computing environment to minimize cost of training the machine learning model while adhering to constraints of accuracy and run time; and program instructions to train the machine learning model utilizing the suggested enhanced training data allocations from the dataset, the machine learning model trained with the one or more portions of data removed from the dataset prior to training of the machine learning model to minimize the cost of training the machine learning model while adhering to the constraints of accuracy and run time.
  16. 16 . The computer program product of claim 15 , further including program instructions to receive, as the dataset, a plurality of data types and data features, wherein the plurality of data types includes at least tabular data and timeseries data and the data features include at least a change point, seasonality data, and clustered data.
  17. 17 . The computer program product of claim 15 , further including program instructions to: apply a forward allocation as a data sampling strategy for tabular data; apply a backward allocation as a data sampling strategy for time series data; apply a stratified sampling as a data sampling strategy for clustered data; apply constraint sampling to include a defined time period as a data sampling strategy for seasonal data; and use a change point detection as a data sampling strategy for abnormal data.
  18. 18 . The computer program product of claim 15 , further including program instructions to collect feedback data based on the one or more data sampling strategies.
  19. 19 . The computer program product of claim 15 , further including program instructions to: provide one or more t-shirt size options, data storage options, and the one or more data sampling strategies for suggesting one or more enhanced training data allocations; and provide a projected learning curve for the machine learning operations and benefit tradeoffs for each of the one or more enhanced training data allocations.
  20. 20 . The computer program product of claim 15 , further including program instructions to predict a degree of impact on the dataset for each of the one or more enhanced training data allocations based on a training accuracy, training time, the dataset, computing hardware configurations, and the one or more data sampling strategies.

Description

BACKGROUND The present invention relates in general to computing systems, and more particularly, to various embodiments for providing enhanced data allocation for machine learning operations in a computing environment in a cloud computing system using a computing processor. SUMMARY According to an embodiment of the present invention, a method for providing enhanced data allocation for machine learning operations in a computing environment in a computing system is provided. One or more data sampling strategies may be determined based on a dataset. One or more enhanced training data allocations may be suggested for machine learning operations in a cloud computing environment based on the one or more data sampling strategies. An embodiment includes a computer usable program product. The computer usable program product includes a computer-readable storage device, and program instructions stored on the storage device. An embodiment includes a computer system. The computer system includes a processor, a computer-readable memory, and a computer-readable storage device, and program instructions stored on the storage device for execution by the processor via the memory. Thus, in addition to the foregoing exemplary method embodiments, other exemplary system and computer product embodiments are provided. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram depicting an exemplary cloud computing node according to an embodiment of the present invention. FIG. 2 is an additional block diagram depicting an exemplary cloud computing environment according to an embodiment of the present invention. FIG. 3 is an additional block diagram depicting abstraction model layers according to an embodiment of the present invention. FIG. 4 is an additional block diagram depicting an exemplary functional relationship between various aspects of the present invention. FIG. 5 is block diagram depicting an exemplary operations for providing enhanced data allocation for machine learning operations in which aspects of the present invention may be realized. FIGS. 6A-6B are block diagrams depicting exemplary operations for providing enhanced data allocation for machine learning operations in different data sampling modules in which aspects of the present invention may be realized. FIG. 7 is graph diagram depicting an exemplary operation for various results of an impact prediction component in which aspects of the present invention may be realized. FIG. 8 is a flowchart diagram depicting an exemplary method for providing enhanced data allocation for machine learning operations in different data sampling modules in a computing environment by a processor, again in which aspects of the present invention may be realized. DETAILED DESCRIPTION OF THE DRAWINGS The present invention relates generally to the field of artificial intelligence (“AI”) such as, for example, machine learning and/or deep learning. Machine learning allows for an automated processing system (a “machine”), such as a computer system or specialized processing circuit, to develop generalizations about particular datasets and use the generalizations to solve associated problems by, for example, classifying new data. Once a machine learns generalizations from (or is trained using) known properties from the input or training data, it can apply the generalizations to future data to predict unknown properties. Moreover, machine learning is a form of AI that enables a system to learn from data rather than through explicit programming. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data, and more efficiently train machine learning models and pipelines. However, machine learning is not a simple process. As the algorithms ingest training data, it is then possible to produce more precise models based on that data (“data” as used herein may be construed singularly or plurally). A machine-learning model is the output generated when a machine-learning algorithm is trained with data. After training, input is provided to the machine learning model which then generates an output. For example, a predictive algorithm may create a predictive model. Then, the predictive model is provided with data and a prediction is then generated (e.g., “output”) based on the data that trained the model. Machine learning enables machine learning models to train on datasets before being deployed. Some machine-learning models are online and continuous. This iterative process of online models leads to an improvement in the types of associations made between data elements. Different conventional techniques exist to create machine learning models and neural network models. The basic prerequisites across existing approaches include having a dataset, as well as basic knowledge of machine learning model synthesis, neural network architecture synthesis and coding skills. In addition, as used, herein, cloud computing refers to the