CA-3060105-C - TECHNOLOGY SYSTEM AUTO-RECOVERY AND OPTIMALITY ENGINE AND TECHNIQUES
Abstract
Disclosed are hardware and techniques for correcting computer process faults by identifying risk associated with correcting a computer process fault and computer processes that may depend on the corrected computer process. The interdependent computer processes in a network may be determined by evaluating a stream of process break flags from a monitoring component coupled to the network. Each computer process break flag in the stream of computer process break flags indicates a process fault detected by the monitoring component and is correlated to a corrective response. The break flag and the corrective response are assigned a risk. A risk matrix accounts for interdependencies between computer processes and identified corrective actions. A final response strategy that corrects the computer process faults is determined using the assigned risk and computer system interdependence. A runbook stores the final response strategy, which may be updated based on changing computer process interdependencies and assigned risk.
Inventors
- Bhavik Gudka
- Surya Avirneni
- Eric Barnum
- Milind Patel
Assignees
- CAPITAL ONE SERVICES, LLC
Dates
- Publication Date
- 20260505
- Application Date
- 20191024
- Priority Date
- 20181127
Claims (1)
- 34 What is claimed is: 1. An apparatus, comprising: a memory storing programming code; and a triage processing component, coupled to the memory and, via a communication interface, to a monitoring component that monitors operation of computer implemented processes of a network, operable to execute the stored programming code, that when executed causes the triage processing component to perform functions, including functions to: receive, from the monitoring component, a first process break event indicating a symptom of a potential operational breakdown of a computer implemented process; evaluate the received first process break event for a correlation to a possible cause of the potential operational breakdown of the computer process; based on the correlation to the possible cause of the potential operational breakdown of the computer process, identify possible corrective actions that can be implemented to fix the computer implemented process to prevent the potential operational breakdown; assign a break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer implemented process; assign a respective fix risk assessment value to each of the identified possible corrective actions; populate a risk assessment matrix with the assigned break risk assessment value and the fix risk assessment value assigned to each of the identified possible corrective actions, wherein the risk assessment matrix has elements representing the computer implemented process, a plurality of other computer implemented processes, and an interdependency rating that quantifies a level of interdependence of each of the plurality of the other computer implemented processes on the computer implemented process; access a runbook including a plurality of corrective actions that correct potential operational breakdowns of computer implemented processes of the network; obtain a list of corrective actions correlated to the first process break event from the runbook; and modify the list of corrective actions based on a rule set applied to the risk assessment matrix, wherein the modified list of corrective actions includes at least one of the identified possible corrective actions as an optimal corrective action. CA 3060105 2. The apparatus of claim 1, wherein: the assigned break event risk assessment value has a range from a value indicating the potential operational breakdown has a high likelihood of occurring to a value indicating the potential operation breakdown has a low likelihood of occurring; and the respective fix risk assessment value assigned to each of the identified possible corrective action has a range from a value indicating the potential operational breakdown has a high likelihood of being fixed to a value indicating the potential operation breakdown has a low likelihood of being fixed by the respective identified possible corrective action. 3. The apparatus of claim 1, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions when modifying the list of corrective actions in the runbook, including functions to: assign an interdependency rating to each of the possible corrective actions in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of each of the computer implemented processes that may be affected by application of each of the possible corrective actions in the list of corrective actions; populate the risk assessment matrix with the interdependency rating of each of the possible corrective actions in the list of corrective actions; evaluate the risk assessment matrix, based on the assigned interdependency rating of each of the possible corrective action in the list of corrective actions to one another; and in response to the evaluation of the risk assessment matrix, flag a respective corrective action from the list of corrective actions as the optimal corrective action. 4. The apparatus of claim 1, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions prior to the runbook being modified, including functions to: identify interdependency risk patterns in the risk assessment matrix populated with the assigned break risk assessment value and the fix risk assessment value assigned for each of the identified corrective actions, wherein the identified interdependency risk patterns indicate risks related to procedures in the runbook and effects of implementing procedures on the computer implemented processes in the network; and generate, based on the identified interdependency risk patterns, a response strategy incorporating at least one of the procedures from the list of corrective actions. CA 3060105 36 5. The apparatus of claim 1, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions, including functions to: receive an additional process break event indicating an additional symptom of another or the same potential operational breakdown of the computer implemented process. 6. The apparatus of claim 5, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions, including functions to: update correlations to possible causes of the potential operational breakdown of the computer implemented processes by analyzing the received additional process break event in conjunction with the first process break event; based on the updated correlations, update the list of corrective actions; and generate updated break risk assessment values for the potential operational breakdown of the computer implemented process and updated fix risk assessment values for each corrective action in the updated list of corrective actions. 7. The apparatus of claim 6, wherein the triage processing component is coupled to receive one or more process break events from multiple monitoring circuits that monitor computer implemented processes in the network; and the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: receive subsequent process break events from one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received subsequent break events, break risk assessment values and fix risk assessment values; populate the risk assessment matrix using the generated break risk assessment values and fix risk assessment values; identify one procedure in a revised list of procedures for implementing one corrective action to fix the potential operational breakdowns indicated by the subsequent break events; and modify the runbook to include the identified one procedure as the procedure to implement when the potential operational breakdown requires fixing. CA 3060105 37 8. The apparatus of claim 7, wherein the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: produce a copy of the populated risk assessment matrix; receive successive process break events that follow the subsequent process break events from the one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received successive process break events, break risk assessment values and fix risk assessment values of the successive process break events; populate the copy of the risk assessment matrix using the generated break risk assessment values and fix risk assessment values to produce a revised risk assessment matrix; analyze the break risk assessment values and the fix risk assessment values of the subsequent process break events in the populated risk assessment matrix to the break risk assessment values and the fix risk assessment values of the successive process break events in the revised risk assessment matrix; and update, based on results of the analysis, the modified runbook to identify one procedure in the list of procedures for implementing the one corrective action to fix the potential operational breakdown. 9. The apparatus of claim 8, wherein the memory further comprises programming code that causes the triage processing component to perform functions, including functions to: produce a simulation copy of the populated risk assessment matrix, wherein the simulation copy of the populated risk assessment matrix has elements including previously determined break risk assessment values and fix risk assessment values; obtain simulated process breaks events as received process break events; determine, based on the simulated process break events, break risk assessment values and fix risk assessment values of the simulated process break events; populate the simulation copy of the risk assessment matrix using the determined break risk assessment values and fix risk assessment values to produce a revised simulated risk assessment matrix; compare break risk assessment values and fix risk assessment values of the simulated process break events in the simulation copy of the populated risk assessment matrix to break risk assessment values and fix risk assessment values of the simulated process break events in the revised risk assessment matrix; CA 3060105 38 produce a simulation copy of the modified run book; revise, based on results of the comparing, the simulation copy of the modified runbook to identify one procedure in the list of procedures for implementing the one corrective action to fix the potential operational breakdown; and evaluate using statistical analysis the revised simulation copy of the modified runbook against the simulation copy of the modified runbook to determine whether the revised simulation copy of the modified runbook includes the identified one procedure as the procedure to implement when the potential operational breakdown requires fixing. 10. A method, comprising: receiving, by a triage processing component coupled to a plurality of monitoring circuits and a network environment, a stream of computer process break flags from one or more of the plurality of monitoring circuits coupled to the network environment, each computer process break flag in the stream of computer process break flags indicating a process fault detected by a respective one or more monitoring circuits that generated the received stream of computer process break flags; generating a dynamic runbook containing fixes known to correct known possible causes of process faults based on each computer process break flag in the received stream of computer process break flags; extracting individual computer process break flags from the received stream of computer process break flags; evaluating each individual computer process break flag extracted from the received stream of computer process break flags by: correlating a respective individual computer process break flag currently being evaluated to known possible causes of the indicated process fault; assigning a break event risk assessment value to the respective computer process break flag based on the respective computer process break flag’s correlation to the known possible causes of the process fault indicated by the respective computer process break flag; accessing the dynamic runbook containing known responses that correct one or more of the known possible causes of each process fault of a number of process faults indicated by the respective break flag; CA 3060105 39 identifying a list of possible responses from the known responses contained in the dynamic runbook, wherein a possible response in the list of possible responses corrects at least one of the known possible causes of the indicated process faults; determining an interdependency rating of each possible response in the list of possible responses, wherein the interdependency rating of each possible response quantifies a level of interdependence of computer implemented processes affected by an application of each of the possible responses in the list of possible responses to the network; assigning a fix event risk assessment value to each possible response in the list of possible responses; as each respective individual computer process break flag is evaluated: populating a risk matrix to include the assigned break risk assessment value of an evaluated individual computer process break flag, each fix event risk assessment value assigned to each of the possible responses that corrects the evaluated individual computer process break flag, and the interdependency rating of each possible response in the list of possible responses; assessing the risk matrix to generate a preliminary response strategy to be implemented to correct the process fault indicated by each individual computer process break flag that has been evaluated in the received stream of computer process break flags; reevaluating previously received and newly received individual computer process break flag extracted from the received stream of computer process break flags; modifying the generated preliminary response strategy based on results of the reevaluation; in response to no further modifications being performed on the generated preliminary response strategy after multiple reevaluations, generating a final response strategy, wherein the final response strategy identifies a response that corrects a computer process fault indicated by the evaluated computer process break flags; and applying the response identified in the final response strategy to the computer implemented process experiencing the computer process fault in the network environment. 11. The method of claim 10, further comprising: prior to applying the response identified in the final response strategy to the network environment, modifying the dynamic runbook to include the generated final response strategy. 12. The method of claim 10, wherein: CA 3060105 the identified response in the final response strategy is an ordered series of multiple responses that are applied serially to the network environment, and the multiple responses in the ordered series are ordered according to a respective fix event risk assessment value of each response of the multiple responses in the final response strategy. 13. The method of claim 10, wherein: the assigned break event risk assessment value assigned to each break event risk assessment has a range from a value indicating the process fault has a high likelihood of causing a process break to a value indicating the process fault has a low likelihood of causing a process break; and the fix event risk assessment value assigned to each respective possible response has a range from a value indicating the process fault has a high likelihood of being corrected by the respective possible response to a value indicating the process fault has a low likelihood of being corrected by the respective possible response. 14. The method of claim 10, further comprising: monitoring, by the plurality of monitoring circuits, status of multiple computer implemented processes of the network environment, the multiple computer implemented processes being interdependent upon one another for operation of the network environment; and monitoring a next iteration of the risk matrix to determine an impact of the applied final response strategy on interdependent computer implemented processes. 15. The method of claim 10, wherein determining the interdependency rating further comprises: determining a level of interdependence of the computer implemented process that may be affected by application of each possible response in the list of possible responses; and assigning an interdependency rating to each possible response in the list of possible responses, based on the determined level of interdependence. 16. The method of claim 10, further comprising: identifying interdependency risk patterns in the risk matrix populated with the assigned break event risk assessment value and the fix event risk assessment value assigned for each of the possible responses, wherein the identified interdependency risk patterns indicate risks CA 3060105 41 related to responses in the runbook and effects of implementing a response on respective computer implemented processes in the network environment; and updating, based on the identified interdependency risk patterns, the final response strategy. 17. A non-transitory computer-readable storage medium storing computer-readable program code executable by a processor, wherein execution of the computer-readable program code causes the processor to: receive, via a coupling to one or more monitoring circuits coupled to a network environment, a plurality of computer process break flags from the one or more monitoring circuits, wherein each of the plurality of computer process break flags indicates a process fault in the network environment; generate a break event risk assessment value indicating a risk of an occurrence of a computer process break attributable to the process fault indicated by each of the computer process break flags; determine, for each computer process break flag, a correlation to one or more possible root causes of the process fault indicated by each of the computer process break flags; identify, for each correlated one or more possible root causes of the process fault indicated by a respective computer process break flag, a known fix for each correlated possible root cause of the process fault indicated by the respective computer process break flag; for each identified known fix, determine a respective risk of each identified known fix adversely affecting other computer processes in the network environment; assign a fix event risk assessment value to each identified known fix; generate a risk assessment matrix with the generated break risk assessment value and the fix risk assessment value assigned for each of the known fixes; and modify a runbook based on a rule set applied to the risk matrix, wherein the runbook is a list of procedures for implementing the known fixes of the indicated process fault to one or more computer implemented processes associated with at least one of the respective process break flags. 18. The non-transitory computer-readable storage medium of claim 17, wherein: CA 3060105 42 the generated break event risk assessment value has a value range from a value indicating the process attribute has a high likelihood of causing a process break to a value indicating the process attribute has a low likelihood of causing a process break; and the fix event risk assessment value assigned to each of the known fixes has a value range from a value indicating the process fault has a high likelihood of being corrected by a respective known fix to a value indicating the process fault has a low likelihood of being corrected by the respective known fix. 19. The non-transitory computer-readable storage medium of claim 17, further comprising computer-readable program code that when executed causes the processor to: generate a procedure that implements a final response strategy as an ordered series of multiple known fixes that are applied serially to the network environment, and the order of the multiple known fixes in the ordered series is according to a respective fix risk assessment value of each known fix of the one or more computer implemented processes associated with at least one of the respective process break flags. 20. The non-transitory computer-readable storage medium of claim 19, wherein the multiple known fixes of the final response strategy generated procedure of the final response strategy is an ordered series of multiple responses that are applied serially to the network environment. 21. An apparatus, comprising: a memory storing programming code; and a triage processing component, coupled to the memory and, via a communication interface, to a monitoring component that monitors operation of computer implemented processes of a network, operable to execute the stored programming code, that when executed causes the triage processing component to perform functions, including functions to: evaluate a first process break event received from a monitoring component for a correlation to a possible cause of a potential operational breakdown of a computer process of the computer implemented processes; based on the correlation to the possible cause of the potential operational breakdown of the computer process, identify corrective actions implementable to fix a computer implemented process exhibiting a symptom of the potential operational breakdown; populate a risk assessment matrix with a break risk assessment value and a fix risk assessment value assigned to each of the identified corrective actions; CA 3060105 43 obtain a list of corrective actions correlated to the first process break event from a runbook, wherein the runbook includes a plurality of corrective actions that correct potential operational breakdowns of the computer implemented processes of the network; and modify the list of corrective actions based on a rule set applied to the risk assessment matrix, wherein the modified list of corrective actions includes at least one of the identified corrective actions as an optimal corrective action. 22. The apparatus of claim 21, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions, including functions to: assign a break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer implemented process; and assign a respective fix risk assessment value to each of the identified corrective actions. 23. The apparatus of claim 22, wherein: the assigned break risk assessment value has a range from a value indicating the potential operational breakdown has a high likelihood of occurring to a value indicating the potential operation breakdown has a low likelihood of occurring; and the respective fix risk assessment value assigned to each of the identified corrective actions has a range from a value indicating the potential operational breakdown has a high likelihood of being fixed to a value indicating the potential operation breakdown has a low likelihood of being fixed by a respective identified corrective action. 24. The apparatus of claim 22, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions prior to the runbook being modified, including functions to: identify interdependency risk patterns in the risk assessment matrix populated with the assigned break risk assessment value and the fix risk assessment value assigned for each of the identified corrective actions, wherein the identified interdependency risk patterns indicate risks related to each corrective action in the runbook and effects of implementing each corrective on the computer implemented processes in the network; and generate, based on the identified interdependency risk patterns, a response strategy incorporating at least one of the corrective actions from the list of corrective actions. CA 3060105 44 25. The apparatus of claim 21, wherein the memory further comprises: programming code that causes the triage processing component to perform further functions when modifying the list of corrective actions, including functions to: assign an interdependency rating to each corrective action in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of each of the computer implemented processes that may be affected by application of each corrective action in the list of corrective actions; populate the risk assessment matrix with the assigned interdependency rating of each corrective action in the list of corrective actions; evaluate the risk assessment matrix, based on the assigned interdependency rating of each corrective action in the list of corrective actions to one another; and in response to the evaluation of the risk assessment matrix, flag a respective corrective action from the list of corrective actions as the optimal corrective action. 26. The apparatus of claim 21, wherein the risk assessment matrix has elements representing a plurality of computer implemented processes including the computer implemented process. 27. The apparatus of claim 26, wherein the triage processing component is coupled to receive one or more process break events from multiple monitoring circuits that monitor computer implemented processes in the network; and the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: receive subsequent process break events from one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received subsequent process break events, break risk assessment values and fix risk assessment values; populate the risk assessment matrix using the generated break risk assessment values and fix risk assessment values; identify one corrective action in a revised list of corrective actions for implementing one corrective action to fix potential operational breakdowns indicated by the subsequent break events; and modify the runbook to include the identified one corrective action as the corrective action to implement when the potential operational breakdowns require fixing. CA 3060105 28. The apparatus of claim 27, wherein the memory further comprises programming code that causes the triage processing component to perform further functions, including functions to: produce a copy of the populated risk assessment matrix; receive successive process break events that follow the subsequent process break events from the one or more of the multiple monitoring circuits coupled to the triage processing component; generate, based on the received successive process break events, break risk assessment values and fix risk assessment values of the successive process break events; populate the copy of the risk assessment matrix using the generated break risk assessment values and the generated fix risk assessment values to produce a revised risk assessment matrix; analyze the break risk assessment values and the fix risk assessment values of the subsequent process break events in the populated risk assessment matrix to the break risk assessment values and the fix risk assessment values of the successive process break events in the revised risk assessment matrix; and update, based on results of the analysis, the modified runbook to identify one corrective action in the list of corrective actions for implementing the one corrective action to fix the potential operational breakdown. 29. The apparatus of claim 28, wherein the memory further comprises programming code that causes the triage processing component to perform functions, including functions to: produce a simulation copy of the populated risk assessment matrix, wherein the simulation copy of the populated risk assessment matrix has elements including previously determined break risk assessment values and fix risk assessment values; obtain simulated process breaks events as received process break events; determine, based on the simulated process break events, break risk assessment values and fix risk assessment values of the simulated process break events; populate the simulation copy of the risk assessment matrix using the determined break risk assessment values and fix risk assessment values to produce a revised simulated risk assessment matrix; compare break risk assessment values and fix risk assessment values of the simulated process break events in the simulation copy of the populated risk assessment matrix to break CA 3060105 46 risk assessment values and fix risk assessment values of the simulated process break events in the revised risk assessment matrix; produce a simulation copy of the modified run book; revise, based on results of the comparing, the simulation copy of the modified runbook to identify one corrective action in the list of corrective actions for implementing the one corrective action to fix the potential operational breakdown; and evaluate, using statistical analysis, the revised simulation copy of the modified runbook against the simulation copy of the modified runbook to determine whether the revised simulation copy of the modified runbook includes the identified one corrective action as the corrective action to implement when the potential operational breakdowns require fixing. 30. A method, comprising: receiving, by a triage processing component coupled to a plurality of monitoring circuits and a network environment, a stream of computer process break flags from one or more of the plurality of monitoring circuits coupled to the network environment, each computer process break flag in the stream of computer process break flags indicating a process fault; extracting individual computer process break flags from the received stream of computer process break flags; evaluating each individual computer process break flag extracted from the received stream of computer process break flags by: correlating a respective individual computer process break flag currently being evaluated to known possible causes of the indicated process fault; assigning a break event risk assessment value to the respective computer process break flag based on the respective computer process break flag’s correlation to the known possible causes of the process fault indicated by the respective computer process break flag; accessing a runbook containing known responses that correct one or more of the known possible causes of each process fault of a number of process faults indicated by the respective break flag; identifying a list of possible responses from the known responses contained in the runbook, wherein a possible response in the list of possible responses corrects at least one of the known possible causes of the indicated process faults; determining an interdependency rating of each possible response in the list of possible responses; CA 3060105 47 assigning a fix event risk assessment value to each possible response in the list of possible responses; as each respective individual computer process break flag is evaluated: populating a risk matrix to include the assigned break risk assessment value of an evaluated individual computer process break flag, each fix event risk assessment value assigned to each of the possible responses that corrects the evaluated individual computer process break flag, and the interdependency rating of each possible response in the list of possible responses; assessing the risk matrix to identify a final response strategy to be implemented to correct the process fault indicated by each individual computer process break flag that has been evaluated in the received stream of computer process break flags, wherein the final response strategy identifies a response that corrects a computer process fault indicated by the evaluated computer process break flags; and applying the response identified in the final response strategy to a respective individual computer implemented process experiencing the process fault associated with the respective individual computer process break flag in the network environment. 31. The method of claim 30, further comprising: prior to applying the response identified of the final response strategy to the network environment, modifying the runbook to include the final response strategy. 32. The method of claim 30, wherein: the identified response in the final response strategy is an ordered series of multiple responses that are applied serially to the network environment, and the order of the multiple responses in the ordered series is according to a respective fix event risk assessment value of each response of the multiple responses in the final response strategy. 33. The method of claim 30, wherein: the assigned break risk assessment value assigned to each break event risk assessment has a range from a value indicating the process fault has a high likelihood of causing a process break to a value indicating the process fault has a low likelihood of causing a process break; and the fix event risk assessment value assigned to each respective possible response has a range from a value indicating the process fault has a high likelihood of being corrected by the CA 3060105 48 respective possible response to a value indicating the process fault has a low likelihood of being corrected by the respective possible response. 34. The method of claim 30, wherein the interdependency rating of each possible response quantifies a level of interdependence of each of the respective individual computer implemented processes affected by an application of each of the possible responses in the list of possible responses to the network. 35. The method of claim 30, wherein determining the interdependency rating further comprises: determining a level of interdependence of each of the computer implemented processes that may be affected by application of each of the possible response in the list of possible responses; and assigning an interdependency rating to each of the possible response in the list of possible responses, based on the determined level of interdependence. 36. The method of claim 30, further comprising: identifying interdependency risk patterns in the risk matrix populated with the assigned break event risk assessment value and the fix event risk assessment value assigned for each of the possible responses, wherein the identified interdependency risk patterns indicate risks related to responses in the runbook and effects of implementing a response on respective computer implemented processes in the network environment; and updating, based on the identified interdependency risk patterns, the final response strategy. 37. A non-transitory computer-readable storage medium storing computer-readable program code executable by a processor, wherein execution of the computer-readable program code causes the processor to: receive, via a coupling to one or more monitoring components, a plurality of computer process break flags, wherein each of the plurality of computer process break flags indicates a process fault in a network environment; generate a break event risk assessment value indicating a risk of an occurrence of a computer process break attributable to the process fault indicated by each of the computer process break flags; CA 3060105 49 determine, for each computer process break flag, one or more possible root causes of the process fault indicated by each of the computer process break flags; identify, for each one or more possible root causes of the process fault indicated by a respective computer process break flag, a known fix for each possible root cause of the process fault indicated by the respective computer process break flag; for each identified known fix, determine a respective risk of each identified known fix adversely affecting other computer processes in the network environment; assign a fix event risk assessment value to each identified known fix; and generate a risk assessment matrix with the generated break risk assessment value attributable to the process fault indicated by each of the computer process break flags and the fix risk assessment value assigned for each of the known fixes. 38. The non-transitory computer-readable storage medium of claim 37, wherein: the generated break event risk assessment value has a value range from a value indicating a process attribute has a high likelihood of causing a process break to a value indicating the process attribute has a low likelihood of causing a process break; and the fix event risk assessment value assigned to each of the known fixes has a value range from a value indicating the process fault has a high likelihood of being corrected by a respective possible response to a value indicating the process fault has a low likelihood of being corrected by the respective possible response. 39. The non-transitory computer-readable storage medium of claim 37, further comprising computer-readable program code that when executed causes the processor to: generate a procedure that implements a final response strategy as an ordered series of multiple known fixes that are applied serially to the network environment, and the order of the multiple known fixes in the ordered series is according to a respective fix risk assessment value of each known fix of each possible root cause associated with at least one of the respective process break flags. 40. The non-transitory computer-readable storage medium of claim 39, wherein the multiple known fixes in the generated procedure of the final response strategy is an ordered series of multiple responses that are applied serially to the network environment. 41. An apparatus, comprising: CA 3060105 50 a memory storing programming code; and a triage processing component, coupled to the memory and, via a communication interface, to a monitoring component that monitors operation of computer processes of a network, operable to execute the stored programming code, that when executed causes the triage processing component to perform functions to: populate a risk assessment matrix with a break risk assessment value and a fix risk assessment value both assigned to one or more corrective actions identified as being able to correct a possible cause of a potential operational breakdown of a computer process; identify interdependency risk patterns in the risk assessment matrix populated with the break risk assessment value and the fix risk assessment value; obtain, from a runbook, a list of corrective actions correlated to a first process break flag, wherein the runbook includes a plurality of corrective actions that correct potential operational breakdowns of the computer process of the network and the first process break flag indicating a symptom of the potential operational breakdown of one or more computer-implemented processes; and generate, based on the identified interdependency risk patterns, a response strategy incorporating at least one corrective action from the list of corrective actions. 42. The apparatus of claim 41, wherein the identified interdependency risk patterns indicate risks related to each corrective action in the runbook and an effect of applying each corrective action on the computer process in the network. 43. The apparatus of claim 41, wherein the memory further comprises programming code that causes the triage processing component to perform further functions to: assign the break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer process in the network; and assign a respective fix risk assessment value to each of the identified one or more corrective actions. 44. The apparatus of claim 43, wherein: the assigned break risk assessment value has a range from a value indicating the potential operational breakdown has a high likelihood of occurring to another value indicating the potential operation breakdown has a low likelihood of occurring; and CA 3060105 51 the respective fix risk assessment value assigned to each of the identified corrective actions has a range from a value indicating the potential operational breakdown has a high likelihood of being fixed to a different value indicating the potential operation breakdown has a low likelihood of being fixed by the respective identified corrective action. 45. The apparatus of claim 41, wherein the memory further comprises programming code that causes the triage processing component to perform further functions to: assign an interdependency rating to each corrective action in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of the computer process on other computer processes potentially affected by application of each of the one or more corrective action in the list of corrective actions; populate the risk assessment matrix with the assigned interdependency rating of each corrective action in the list of corrective actions; and evaluate the risk assessment matrix, based on the assigned interdependency rating of each corrective action in the list of corrective actions to one another. 46. The apparatus of claim 41, wherein the memory further comprises programming code that causes the triage processing component to perform further functions to: in response to an evaluation of the risk assessment matrix, flag a respective corrective action from the list of corrective actions as an optimal corrective action for use in the response strategy. 47. The apparatus of claim 46, wherein the memory further comprises programming code that causes the triage processing component to perform further functions to: apply the flagged respective corrective action to the computer process experiencing a process fault associated with a computer process break flag in the network. 48. The apparatus of claim 41, wherein the memory further comprises programming code that causes the triage processing component to perform further functions to: receive successive process break flags that follow previous process break flags from a monitoring circuit coupled to the triage processing component; generate, based on the process break flags, break risk assessment values and fix risk assessment values of the process break flags; CA 3060105 52 populate a copy of the risk assessment matrix using the generated break risk assessment values and the generated fix risk assessment values to produce a revised risk assessment matrix; analyze the break risk assessment values and the fix risk assessment values of the previous process break flags in the populated risk assessment matrix with reference to the break risk assessment values and the fix risk assessment values of the successive process break flags in the revised risk assessment matrix; and update, based on results of the analysis of the break risk assessment values and the fix risk assessment values, the runbook to identify one corrective action in the list of corrective actions for implementing the one corrective action to fix the potential operational breakdown. 49. A method, comprising: populating, by a triage processing component coupled to a plurality of monitoring circuits and a network environment, a risk assessment matrix with a break risk assessment value and a fix risk assessment value assigned to one or more corrective actions identified to correct a possible cause of a potential operational breakdown of a computer process; identifying interdependency risk patterns in the risk assessment matrix populated with the assigned break risk assessment value and the fix risk assessment value assigned for each of the one or more identified corrective actions; obtaining a list of corrective actions correlated to a first process break flag from a runbook, wherein the runbook includes a plurality of corrective actions that correct potential operational breakdowns of the computer process; and generating, based on the identified interdependency risk patterns, a response strategy incorporating at least one corrective action from the list of corrective actions. 50. The method of claim 49, further comprising: applying the at least one corrective action in the response strategy to the computer process experiencing a process fault associated with a computer process break flag in the network environment. 51. The method of claim 49, wherein the identified interdependency risk patterns indicate risks related to each corrective action in the runbook and an effect of applying each corrective action on the computer process in the network. CA 3060105 53 52. The method of claim 49, further comprising: assigning, by the triage component, the break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer process; assigning a respective fix risk assessment value to each of the identified corrective actions; and populating the risk assessment matrix with the assigned break risk assessment value of the computer process to each of the identified corrective actions and the assigned fix risk assessment value to each of the identified corrective actions. 53. The method of claim 49, wherein: the assigned break risk assessment value has a range from a value indicating the potential operational breakdown has a high likelihood of occurring to another value indicating the potential operation breakdown has a low likelihood of occurring; and the respective fix risk assessment value assigned to each of the identified corrective actions has a range from a value indicating the potential operational breakdown has a high likelihood of being fixed to a different value indicating the potential operation breakdown has a low likelihood of being fixed by a respective identified corrective action. 54. The method of claim 49, further comprising: assigning an interdependency rating to each corrective action in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of each of the computer processes potentially affected by application of each corrective action in the list of corrective actions; populating the risk assessment matrix with the assigned interdependency rating of each corrective action in the list of corrective actions; evaluating the risk assessment matrix, based on the assigned interdependency rating of each corrective action in the list of corrective actions to one another; and in response to the evaluation of the risk assessment matrix, flagging a respective corrective action from the list of corrective actions as an optimal corrective action. 55. The method of claim 54, wherein the interdependency rating assigned to each corrective action quantifies a level of interdependence of each respective individual computer process affected by an application of each of corrective action in the list of corrective actions. CA 3060105 54 56. A non-transitory computer-readable storage medium storing computer-readable program code executable by a processor, wherein execution of the computer-readable program code causes the processor to: populate by a triage processing component coupled to a plurality of monitoring circuits and a network environment, a risk assessment matrix with a break risk assessment value and a fix risk assessment value assigned to one or more corrective actions identified to correct a possible cause of a potential operational breakdown of a computer process; identify interdependency risk patterns in the risk assessment matrix populated with the assigned break risk assessment value and the fix risk assessment value assigned for each of the identified corrective actions; obtain a list of corrective actions correlated to a first process break flag from a runbook, wherein the runbook includes a plurality of corrective actions that correct potential operational breakdowns of computer implemented processes of the network; and generate, based on the identified interdependency risk patterns, a response strategy incorporating at least one corrective action from the list of corrective actions. 57. The non-transitory computer-readable storage medium of claim 56, wherein: apply the response strategy to a respective individual computer process experiencing a process fault associated with the first process break flag in the network environment. 58. The non-transitory computer-readable storage medium of claim 56, further comprising: computer-readable program code that when executed causes the processor to: assign, by the triage component, the break risk assessment value indicating a likelihood of occurrence of the potential operational breakdown of the computer process; and assign a respective fix risk assessment value to each of the identified corrective actions. 59. The non-transitory computer-readable storage medium of claim 58, wherein the identified interdependency risk patterns indicate risks related to each corrective action in the runbook and an effect of applying each corrective on the computer process in the network. 60. The non-transitory computer-readable storage medium of claim 56, further comprising computer-readable program code that when executed causes the processor to: assign an interdependency rating to each corrective action in the list of corrective actions, wherein the interdependency rating quantifies a level of interdependence of each of the CA 3060105 55 computer processes potentially affected by application of each corrective action in the list of corrective actions; populate the risk assessment matrix with the assigned interdependency rating of each corrective action in the list of corrective actions; evaluate the risk assessment matrix, based on the assigned interdependency rating of each corrective action in the list of corrective actions to one another; and in response to the evaluation of the risk assessment matrix, flag a respective corrective action from the list of corrective actions as an optimal corrective action for use in the response strategy. CA 3060105
Description
1 TECHNOLOGY SYSTEM AUTO-RECOVERY AND OPTIMALITY ENGINE AND TECHNIQUES BACKGROUND [0001] Current state of technology remediation is that, when computer process, computer hardware or software breaks, people gather resources and execute fail safes and contingency plans to recover the broken technology (i.e., the broken computer process, computer hardware or software). Workarounds and typical break-fix activities are the mainstays of technology remediation and make up the best practices for how to recover technological services when something goes awry. The aim of these recovery plans is address three metrics commonly used to indicate the efficacy of a technology remediation system: mean time to detect (MTTD); mean time to repair (MTTR); and mean time between failures (MTBF). An effective technology remediation system implements processes that reduce MTTD and MTTR, while increasing the MTBF. [0002] There are several commercial systems with offerings, such as ZabbixTM that allow a computer system “break-fix” to be paired with a “Response.” These commercial offerings, however, tend to require specific break events to trigger a single response. The evolution of technology services (e.g., computer systems that implement services and applications) means that the technological environments, technology, and their frameworks are becoming increasingly complex. Moreover, the identification of any single “root causing break event” may be complicated by cloud-based services such as AmazonTM web services (AWS), Microsoft AzureTM, Oracle CloudTM, Apache HadoopTM, or Google CloudTM platform, cross connections with physical hardware-based networks, and the many development frameworks and different coding languages that make up even simple applications. Presently, the determination of where a root-cause source of a technology problem is substantially an all-human experience driven, and humans are slow providers of "production system support" and "incident triage." [003] As a result, different types of production system support methodologies have been implemented to compensate for the human shortfalls. Across tech organizations, production system support functions use manually-generated and manually-maintained document libraries, called Runbooks, that are used to identify a problem via integrated monitoring and deploy ·a fix. These production system support functions are siloed to specific applications that have such documentation. [0004) For example, one production system support process may be termed "Fix on the go." In a fix on the go process, engineers may make weekly I monthly rotations to support issues for 24 hours, 7 days a week. In response to detection of an applicationspecific issue, a support team member pages one of the engineers in the "on call" group. The engineer on call will access via a graphical user interface an incident bridge that lists issues, attempt to understand the issue and implement a fix using an emergency change process. This is a slow labor-intensive process and does not help reduce MTTR. [0005) Another production support process utilizes document-based (where a "document" can be an actual document, on online file, help system or some other reference source) operational runbooks that a development team/support team documents steps to fix known/recurring issues. The document operational runbooks save some time but are not a significant improvement as an engineer needs to understand the procedure during an issue and implement the steps of fixing the known/recurring issues. There is always a chance of human error with either the understanding of the procedure or the implementation of the steps fixing the known/recurring issues. Related production support processes that automate the runbook (keep the scripts on some server/repo) offer slight improvement, but these processes still rely on a human to find a cause and trigger a fix from the corresponding runbook. [0006) Some automated systems rely heavily on operator expertise to correctly identify the problem, its solution, and deploy it as quickly as possible. When expertise, understanding of the broken system, and/or ability execute the fix are lacking, the brokenness escalates throughout a computing system and begins to impact upstream and downstream systems as well. This chain ofup and downstream systems is called "interdependency." [0007) Time is of the essence in nearly all remediation instances, but without proper resources, technology systems are subjected to lengthy and investigative triage. The fix is typically done in a silo of the impacted system first, which places the interdependencies at risk of ongoing impact and delay in restoration of service. This siloed focus on a single 2 .,. break event complicates root cause in the interdependent system chain and can lead to false positives where any single failure is fixed, but a root cause remains unaddressed in a systemically broken set of technology services. (0008) The evolution of cloud-bas