Search

US-20260127055-A1 - DYNAMIC MAINTENANCE OF CLOUD INFRASTRUCTURE FOR MITIGATING PREDICTED OUTAGES

US20260127055A1US 20260127055 A1US20260127055 A1US 20260127055A1US-20260127055-A1

Abstract

A plurality of log entries for a respective plurality of modules of a cloud computing platform are processed with a machine-learned Large Foundational Model (LFM) to obtain a prediction output. The prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity. Based on the prediction output, a target module of the plurality of modules of the cloud computing platform is identified. A plurality of modifications is generated for a configuration of the target module with the machine-learned LFM. The plurality of modifications is configured to mitigate the predicted outage event. The plurality of modifications is based at least in part on the degree of severity. The plurality of modifications is deployed to the configuration of the target module.

Inventors

  • Ramandeep Singh
  • Khaled Elbehiery

Assignees

  • CHARTER COMMUNICATIONS OPERATING, LLC

Dates

Publication Date
20260507
Application Date
20241104

Claims (20)

  1. 1 . A method, comprising: processing, by a computing system comprising one or more processor devices, a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity; identifying, by the computing system based on the prediction output, a target module of the plurality of modules of the cloud computing platform; and for the target module: generating, by the computing system with the machine-learned LFM, a plurality of modifications for a configuration of the target module, wherein the plurality of modifications is configured to mitigate the predicted outage event, wherein the plurality of modifications comprises a modification to a variable of the configuration for the target module that controls a maximum number of actions of a particular type performed by the target module, and wherein the plurality of modifications is based at least in part on the degree of severity; and deploying, by the computing system, the plurality of modifications to the configuration of the target module.
  2. 2 . The method of claim 1 , wherein generating the plurality of modifications for the configuration of the target module further comprises: executing, by the computing system, a test suite associated with the target module to validate the plurality of modifications to the configuration of the target module.
  3. 3 . The method of claim 1 , wherein the machine-learned LFM comprises one of a plurality of machine-learned LFMs, and wherein processing the plurality of log entries for the respective plurality of modules of the cloud computing platform with the machine-learned LFM to obtain the prediction output comprises: processing, by the computing system with a first machine-learned LFM of the plurality of machine-learned LFMs, a first log entry of the plurality of log entries for a first module of the plurality of modules of the cloud computing platform to obtain a first prediction sub-output; processing, by the computing system with a second machine-learned LFM of the plurality of machine-learned LFMs, a second log entry of the plurality of log entries for a second module of the plurality of modules of the cloud computing platform to obtain a second prediction sub-output; and generating, by the computing system, the prediction output based on the first prediction sub-output and the second prediction sub-output.
  4. 4 . The method of claim 3 , wherein the first machine-learned LFM of the plurality of machine-learned LFMs comprises a first instance of the machine-learned LFM prompted with a first prompt comprising contextual information associated with a function of the first module of the plurality of modules of the cloud computing platform; and wherein the second machine-learned LFM of the plurality of machine-learned LFMs comprises the first instance of the machine-learned LFM prompted with a second prompt comprising contextual information associated with a function of the second module of the plurality of modules of the cloud computing platform.
  5. 5 . The method of claim 4 , wherein the function of the first module comprises: a compute function; a storage function; a network and security function; a virtualization function; or a cloud platform configuration function.
  6. 6 . The method of claim 3 , wherein, prior to processing the first log entry of the plurality of log entries with the first machine-learned LFM of the plurality of machine-learned LFMs, the method comprises: training, by the computing system, the first machine-learned LFM based at least in part on contextual information associated with a function of the first module of the plurality of modules of the cloud computing platform.
  7. 7 . The method of claim 3 , wherein generating the prediction output based on the first prediction sub-output and the second prediction sub-output comprises: processing, by the computing system, the first prediction sub-output and the second prediction sub-output with the machine-learned LFM to obtain the prediction output.
  8. 8 . The method of claim 1 , wherein identifying the target module of the plurality of modules of the cloud computing platform comprises: obtaining, by the computing system based on the prediction output, module mapping information descriptive of existing relationships between the plurality of modules of the cloud computing platform; and identifying, by the computing system, the target module based on the prediction output and the module mapping information.
  9. 9 . The method of claim 8 , wherein the module mapping information comprises source code for the target module.
  10. 10 . The method of claim 8 , wherein the module mapping information comprises technical documentation associated with the target module.
  11. 11 . The method of claim 1 , wherein generating the plurality of modifications for the configuration of the target module comprises: generating, by the computing system with the machine-learned LFM, the plurality of modifications, wherein the plurality of modifications comprises a modification to a unit of software instructions that implements the target module, wherein the modification is configured to mitigate the predicted outage event.
  12. 12 . The method of claim 1 , wherein deploying the plurality of modifications to the configuration of the target module comprises: deploying, by the computing system, the plurality of modifications to the configuration of the target module prior to occurrence of the predicted outage event.
  13. 13 . The method of claim 1 , wherein the target module comprises an impacted module impacted by the predicted outage event, and wherein the modifications mitigate an impact of the predicted outage event prior to occurrence of the predicted outage event.
  14. 14 . The method of claim 13 , wherein the target module comprises a causative module that is causative of the predicted outage event.
  15. 15 . A computing system comprising: one or more processor devices to: process a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity; identify, based on the prediction output, a target module of the plurality of modules of the cloud computing platform; and for the target module: generate, with the machine-learned LFM, a plurality of modifications for a configuration of the target module, wherein the plurality of modifications comprises a modification to a variable of the configuration for the target module that controls a maximum number of actions of a particular type performed by the target module, wherein the plurality of modifications is configured to mitigate the predicted outage event, and wherein the plurality of modifications is based at least in part on the degree of severity; and deploy the plurality of modifications to the configuration of the target module.
  16. 16 . The computing system of claim 15 , wherein, to generate the plurality of modifications for the configuration of the target module, the one or more processor devices are to: execute a test suite associated with the target module to validate the plurality of modifications to the configuration of the target module.
  17. 17 . The computing system of claim 15 , wherein the machine-learned LFM comprises one of a plurality of machine-learned LFMs, and wherein, to process the plurality of log entries for the respective plurality of modules of the cloud computing platform with the machine-learned LFM to obtain the prediction output, the one or more processor devices are to: process, with a first machine-learned LFM of the plurality of machine-learned LFMs, a first log entry of the plurality of log entries for a first module of the plurality of modules of the cloud computing platform to obtain a first prediction sub-output; process, with a second machine-learned LFM of the plurality of machine-learned LFMs, a second log entry of the plurality of log entries for a second module of the plurality of modules of the cloud computing platform to obtain a second prediction sub-output; and generate the prediction output based on the first prediction sub-output and the second prediction sub-output.
  18. 18 . The computing system of claim 17 , wherein the first machine-learned LFM of the plurality of machine-learned LFMs comprises a first instance of the machine-learned LFM prompted with a first prompt comprising contextual information associated with a function of the first module of the plurality of modules of the cloud computing platform; and wherein the second machine-learned LFM of the plurality of machine-learned LFMs comprises the first instance of the machine-learned LFM prompted with a second prompt comprising contextual information associated with a function of the second module of the plurality of modules of the cloud computing platform.
  19. 19 . The computing system of claim 18 , wherein the function of the first module comprises: a compute function; a storage function; a network and security function; a virtualization function; or a cloud platform configuration function.
  20. 20 . A non-transitory computer-readable storage medium that includes executable instructions to cause one or more processor devices to: process a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity; identify, based on the prediction output, a target module of the plurality of modules of the cloud computing platform; and for the target module: generate, with the machine-learned LFM, a plurality of modifications for a configuration of the target module, wherein the plurality of modifications comprises a modification to a variable of the configuration for the target module that controls a maximum number of actions of a particular type performed by the target module, wherein the plurality of modifications is configured to mitigate the predicted outage event, and wherein the plurality of modifications is based at least in part on the degree of severity; and deploy the plurality of modifications to the configuration of the target module.

Description

BACKGROUND “Cloud computing” refers to the provision of computing services over the internet, such as hosting, storage, databases, networking, software, and analytics, etc. Cloud computing platforms enable real-time access to these resources on-demand, without needing to invest in physical infrastructure, enabling scalability, flexibility, and cost savings. Cloud computing platforms are often modular, and can be scaled dynamically to meet demand. Cloud computing platforms generally provide access to much larger quantities of computing resources than would be available to most organizations otherwise. For example, assume that one organization hosts online services locally using a local on-premises server device, and another organization hosts services via a cloud computing service. Further assume that the services provided by both organizations experience substantial spikes in demand. If the demand exceeds the capacity of the local on-premises server, the performance of the services can be severely degraded. However, if the demand exceeds the current capacity provided by the cloud computing platform, the cloud computing platform can dynamically allocate additional capacity to mitigate performance degradation. SUMMARY Cloud computing platforms can experience outages due to faults or the like at certain cloud modules. Logging entries from such platforms can be processed with a machine-learned model to obtain a prediction output indicating a predicted outage event for the cloud platform. Based on the prediction output, a target cloud module can be identified (e.g., a causative module, an impacted module, etc.). A plurality of modifications can be generated for a configuration of the target module to mitigate the outage event. The modifications can be deployed to the target module. In one implementation, a method is provided. The method includes processing, by a computing system comprising one or more processor devices, a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity. The method further includes identifying, by the computing system based on the prediction output, a target module of the plurality of modules of the cloud computing platform. The method further includes, for the target module, generating, by the computing system with the machine-learned LFM, a plurality of modifications for a configuration of the target module, wherein the plurality of modifications is configured to mitigate the predicted outage event, and wherein the plurality of modifications is based at least in part on the degree of severity. The method further includes, for the target module, deploying, by the computing system, the plurality of modifications to the configuration of the target module. In another implementation, a computing system is provided. The computing device includes a memory, and one or more processor devices coupled to the memory. The one or more processor devices are to process a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity. The one or more processor devices are further to identify, based on the prediction output, a target module of the plurality of modules of the cloud computing platform. The one or more processor devices are further to, for the target module, generate, with the machine-learned LFM, a plurality of modifications for a configuration of the target module, wherein the plurality of modifications is configured to mitigate the predicted outage event, and wherein the plurality of modifications is based at least in part on the degree of severity. The one or more processor devices are further to, for the target module, deploy the plurality of modifications to the configuration of the target module. In another implementation, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium includes executable instructions to cause one or more processor devices to process a plurality of log entries for a respective plurality of modules of a cloud computing platform with a machine-learned Large Foundational Model (LFM) to obtain a prediction output, wherein the prediction output is indicative of a predicted outage event for the cloud computing platform and a corresponding degree of severity. The instructions further cause the one or more processor devices to identify, based on the prediction output, a target module of the plurality of modules of the cloud computing platform. The instructions further cause the one or more pro