US-12619460-B2 - Remotely healing crashed processes
Abstract
A method of repairing crashed applications includes detecting a crash in an application operating in a host computing device. The application is migrated to a remote computer server. The remote computer server provisions computing resources to the application, while the application is resident in the remote computer server. Resumed operation of the application is executed, using the provisioned computing resources, in the remote computer server. Execution results are generated from the application, in the remote computer server. The generated execution results are migrated from the application back to the host computing device.
Inventors
- Marco Aurelio Stelmar Netto
- Bruno Silva
- RENATO LUIZ DE FREITAS CUNHA
- Renan Francisco Santos Souza
- Lucas Correia Villa Real
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260505
- Application Date
- 20210920
Claims (18)
- 1 . A computer-implemented method of repairing crashed applications, the computer-implemented method comprising: upon detecting a crash in an application operating in a host computing device, identified by a trap handler module in an operating system kernel of the host computing device as an illegal instruction, determining computing resources of the host computing device; receiving, by a resource limitation identifier module, data of the crashed application from the host computing device, wherein the data includes a copy of the application and a current state of operation of the application prior to the crash; upon determining, by a remote module, that the crash in the application was caused by a resource constraint of the host computing device, determining, by the remote module, a threshold computing resources to resume an execution of a remaining operation of the crashed application on a remote computer server, wherein the determining of the threshold computing resources is based on the received data; determining whether to repair the crashed application by: comparing a time to migrate execution of the application from a point prior to the crash with a time to restart the application from start until a crashed point of the execution; and comparing each of an amount of the threshold computing resources and a processing time for repairing the crashed application with a threshold cutoff value; migrating the crashed application to the remote computer server upon determining that the remote computer server has at least the threshold computing resources and upon determining to repair the crashed application; provisioning, by the remote computer server, the threshold computing resources to the application, during the application being resident in the remote computer server; executing the remaining operation of the application, using the provisioned threshold computing resources, in the remote computer server; generating execution results from the application, in the remote computer server; and migrating the generated execution results from the application back to the host computing device, wherein the host computing device is a local computing device that is separate and remote from the remote computer server, and the host computing device remains operational and transparent to a user of the host computing device during the provisioning to the remote computer server.
- 2 . The computer-implemented method of claim 1 , further comprising identifying one or more computing resource types for execution of the application, wherein: a lack of the identified one or more computing resource types caused the crash, and the provisioning of computing resources includes supplying, by the remote computer server, the identified one or more computing resource types to the execution of the application.
- 3 . The computer-implemented method of claim 2 , wherein: the identified one or more computing resource types includes access to a piece of hardware, and the provisioning of computing resources includes accessing the piece of hardware in the remote computer server.
- 4 . The computer-implemented method of claim 1 , further comprising: identifying a pre-crash state of the application, wherein the pre-crash state includes a current state of executed results, and the executed results correspond to the operation of the application; and rolling back the application to the pre-crash state, in the remote computer server, prior to provisioning the computing resources.
- 5 . The computer-implemented method of claim 4 , further comprising executing the application in the remote computer server, from the current state of executed results, prior to migrating the generated execution results from the application back to the host computing device.
- 6 . The computer-implemented method of claim 1 , further comprising: determining a resource cost associated with repairing the application; and determining whether to provision the computing resources based on the resource cost.
- 7 . The computer-implemented method of claim 1 , further comprising: receiving, by a memory state handler module, a memory state image from the host computing device, wherein an operating system of the host computing device captures the memory state image of processes associated with the application; reading, by the resource limitation identifier module, the received memory state image; querying, by the resource limitation identifier module, a troubleshooting strategy handler database; and determining, based on the querying, by the remote module, the threshold computing resources to resume the execution of the remaining operation of the crashed application on the remote computer server.
- 8 . A computer program product for repairing crashed applications, the computer program product comprising: one or more non-transitory computer readable storage media, and program instructions collectively stored on the one or more non-transitory computer readable storage media, wherein the program instructions when executed by a processor, cause the processor to perform operations comprising: upon detecting a crash in an application operating in a host computing device, identified by a trap handler module in an operating system kernel of the host computing device as an illegal instruction, determining computing resources of the host computing device; receiving, by a resource limitation identifier module, data of the crashed application from the host computing device, wherein the data includes a copy of the application and a current state of operation of the application prior to the crash; upon determining, by a remote module, that the crash in the application was caused by a resource constraint of the host computing device, determining, by the remote module, a threshold computing resources to resume an execution of a remaining operation of the crashed application on a remote computer server, wherein the determining of the threshold computing resources is based on the received data; determining whether to repair the crashed application by: comparing a time to migrate execution of the application from a point prior to the crash with a time to restart the application from start until a crashed point of the execution; and comparing each of an amount of the threshold computing resources and a processing time for repairing the crashed application with a threshold cutoff value; migrating the application to the remote computer server upon determining that the remote computer server has at least the threshold computing resources and upon determining to repair the crashed application; provisioning, by the remote computer server, the threshold computing resources to the application, during the application being resident in the remote computer server; executing the remaining operation of the application, using the provisioned threshold computing resources, in the remote computer server; generating execution results from the application, in the remote computer server; and migrating the generated execution results from the application back to the host computing device, wherein the host computing device is a local computing device that is separate and remote from the remote computer server, and the host computing device remains operational and transparent to a user of the host computing device of the provisioning to the remote computer server.
- 9 . The computer program product of claim 8 , wherein the operations further comprise identifying one or more computing resource types for execution of the application, wherein: a lack of the identified one or more computing resource types caused the crash, and the provisioning of computing resources includes supplying, by the remote computer server, the identified one or more computing resource types to the execution of the application.
- 10 . The computer program product of claim 9 , wherein: the identified one or more computing resource types includes access to a piece of hardware, and the provisioning of computing resources includes accessing the piece of hardware in the remote computer server.
- 11 . The computer program product of claim 8 , wherein the operations further comprise: identifying a pre-crash state of the application, wherein the pre-crash state includes a current state of executed results, and the executed results correspond to the operation of the application; and rolling back the application to the pre-crash state, in the remote computer server, prior to provisioning the computing resources.
- 12 . The computer program product of claim 11 , wherein the operations further comprise executing the application in the remote computer server, from the current state of executed results, prior to migrating the generated execution results from the application back to the host computing device.
- 13 . The computer program product of claim 8 , wherein the operations further comprise: determining a resource cost associated with repairing the application; and determining whether to provision the computing resources based on the resource cost.
- 14 . A remote computer server for repairing crashed applications in a host computing device, the remote computer server comprising: a network connection; one or more computer readable storage media; a processor coupled to the network connection and coupled to the one or more computer readable storage media; and a computer program product comprising program instructions collectively stored on the one or more computer readable storage media, wherein the program instructions when executed by the processor, cause the processor to perform operations comprising: detecting, through the network connection, a crash in an application operating in the host computing device; upon detecting the crash in the application, identified by a trap handler module in an operating system kernel of the host computing device as an illegal instruction, determining computing resources of the host computing device; receiving, by a resource limitation identifier module, data of the crashed application from the host computing device, wherein the data includes a copy of the application and a current state of operation of the application prior to the crash; upon determining, by the remote computer server, that the crash in the application was caused by a resource constraint of the host computing device, determining, by the remote computer server, a threshold computing resources to resume an execution of a remaining operation of the crashed application on the remote computer server, wherein the determination of the threshold computing resources is based on the received data; determining whether to repair the crashed application by: comparing a time to migrate execution of the application from a point prior to the crash with a time to restart the application from start until a crashed point of the execution; and comparing each of an amount of the threshold computing resources and a processing time for repairing the crashed application with a threshold cutoff value; receiving the application by the remote computer server, through the network connection upon determining to repair the crashed application; provisioning, by the remote computer server, the threshold computing resources to the application, during the application being resident in the remote computer server; executing the remaining operation of the application, using the provisioned threshold computing resources, in the remote computer server; generating execution results from the application, in the remote computer server; and migrating the generated execution results from the application back to the host computing device, wherein the host computing device is a local computing device that is separate and remote from the remote computer server, and the host computing device remains operational and transparent to a user of the host computing device of the provisioning to the remote computer server.
- 15 . The remote computer server of claim 14 , wherein the operations further comprise identifying one or more computing resource types for execution of the application, wherein: a lack of the identified one or more computing resource types caused the crash, and the provisioning of computing resources includes supplying, by the remote computer server, the identified one or more computing resource types to the execution of the application.
- 16 . The remote computer server of claim 15 , wherein: the identified one or more computing resource types includes access to a piece of hardware, and the provisioning of computing resources includes accessing the piece of hardware in the remote computer server.
- 17 . The remote computer server of claim 14 , wherein the operations further comprise: identifying a pre-crash state of the application based on the received data, wherein the pre-crash state includes a current state of executed results; and rolling back the application to the pre-crash state, in the remote computer server, prior to provisioning the computing resources.
- 18 . The remote computer server of claim 17 , wherein the operations further comprise executing the application in the remote computer server, from the current state of executed results, prior to migrating the generated execution results from the application back to the host computing device.
Description
BACKGROUND Technical Field The present disclosure generally relates to networks, and more particularly, to systems and methods of remotely healing crashed processes. Description of the Related Art Users of computationally intensive applications, such as computer simulations or training of artificial intelligence (A.I.) models, may try to run those applications in their computing devices at first because it is not readily apparent for the users what the actual application resources involved may be before running the application. When attempting to run simulation or A.I. models, such applications may not even start running because of lack of memory or some accelerator device that is missing or otherwise unavailable in the host computing device. More complex scenarios are those in which the user application starts execution and is able to run for a while, for example several hours. After running for some time, the application crashes due to limitations in the user computing device. Some approaches to pre-empting application crashing use application checkpointing. Checkpointing is usually explicitly programmed or configured before the system execution, which leads to extra effort and time required from the user. The user must explicitly program in checkpoint criteria. Checkpointing typically includes stopping the application, copying all the required data from the memory to reliable storage, and then continuing with the execution. If the application supports a checkpoint mechanism, the user may resume the application from the last saved checkpoint. Otherwise, the application is restarted from scratch, thus wasting several hours of processing previously performed. SUMMARY According to an embodiment of the present disclosure, a computer implemented method of repairing crashed applications includes detecting a crash in an application operating in a host computing device. The application is migrated to a remote computer server. The remote computer server provisions computing resources to the application, as the application is resident in the remote computer server. Operation of the application is executed, using the provisioned computing resources, in the remote computer server. Execution results are generated from the application, in the remote computer server. The generated execution results are migrated from the application back to the host computing device. In one embodiment, it is determined whether the crash was caused by a resource constraint. The provisioning of computing resources in the remote computer server allocates computing resources to meet the resource constraint. As will be appreciated, the remote computer server acts as an auxiliary source for execution of the application when the original host computer is unable to meet the demands for operating the application. This may be a temporary situation for the host computer, for example, when the host computer is running more operations than it can currently handle. Or it may be an on-demand service so that a user may have access to higher end applications that require more resources than the current computer can handle. According to another embodiment of the present disclosure, a computer program product for repairing crashed applications includes one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media. The program instructions include detecting a crash in an application operating in a host computing device. The application is migrated to a remote computer server. The remote computer server provisions computing resources to the application, as the application is resident in the remote computer server. Operation of the application is executed, using the provisioned computing resources, in the remote computer server. Execution results are generated from the application, in the remote computer server. The generated execution results are migrated from the application back to the host computing device. According to one embodiment, the instructions identify one or more computing resource types required for the execution of the application. A lack of the identified computing resource types caused the crash. The provisioning of computing resources includes supplying, by the remote computer server, the identified computing resource types to the execution of the application. This feature shows flexibility in the subject technology, where the embodiments may identify, for example, between memory requirements, hardware, and software instructions. The subject technology may identify what kind of resource was lacking in the host computing device and may then locate such a resource type within the remote server environment. According to another embodiment of the present disclosure, a remote computer server for repairing crashed applications in a host computing device includes: a network connection; one or more computer readable storage media; a processor coupled to the network conne