US-20260127076-A1 - NETWORK HEALTH SERVICES AND LIFECYCLE CORRECTNESS

US20260127076A1US 20260127076 A1US20260127076 A1US 20260127076A1US-20260127076-A1

Abstract

Monitoring health metrics of computing devices in a data system can be implemented at different levels. At a first level, local background services can be run locally on the computing devices to monitor a set of health metrics on the respective computing devices. At a second level, a central health marker service can monitor a set of health metrics for the computing devices.

Inventors

Libo Chen
Eddie Hao
Daniel Geoffrey Karp
Themistoklis Melissaris
Sai Bhargav Varanasi
Yuanfeng Wen

Assignees

SNOWFLAKE INC.

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (20)

1 . A system comprising: at least one hardware processor; and at least one memory storing instructions that cause the at least one hardware processor to perform operations comprising: receiving one or more timestamps associated with outputs of a computing device in a network-based data system in a defined time interval; comparing the one or more timestamps to a reference clock based on a first threshold to determine whether the computing device has a future clock drift; based on at least one of the one or more timestamps exceeding the first threshold, triggering a recycling operation for the computing device; based on the one or more timestamps not exceeding the first threshold, comparing the one or more timestamps to the reference clock based on a second threshold to determine whether the computing device has a past clock drift; and based on at least one of the one or more timestamps exceeding the second threshold, triggering the recycling operation for the computing device.
2 . (canceled)
3 . The system of claim 1 , wherein the second threshold is greater than the first threshold.
4 . The system of claim 1 , wherein the one or more timestamps are received from a metadata database in the network-based data system.
5 . The system of claim 1 , wherein the recycling operation comprises: changing a state of the computing device to a kill state; transmitting a kill command to the computing device; and terminating pending operations at the computing device in response to receiving the kill command.
6 . The system of claim 5 , wherein the recycling operation further comprises: changing the state of the computing device to a fail state; triggering a recovery operation for the computing device, the recovery operation comprising cleaning metadata associated with the computing device stored in a metadata database.
7 . The system of claim 5 , wherein the kill command is transmitted from a central health service to the computing device using a remote call.
8 . A method comprising: receiving one or more timestamps associated with outputs of a computing device in a network-based data system in a defined time interval; comparing the one or more timestamps to a reference clock based on a first threshold to determine whether the computing device has a future clock drift; based on at least one of the one or more timestamps exceeding the first threshold, triggering a recycling operation for the computing device; based on the one or more timestamps not exceeding the first threshold, comparing the one or more timestamps to the reference clock based on a second threshold to determine whether the computing device has a past clock drift; and based on at least one of the one or more timestamps exceeding the second threshold, triggering the recycling operation for the computing device.
9 . (canceled)
10 . The method of claim 8 , wherein the second threshold is greater than the first threshold.
11 . The method of claim 8 , wherein the one or more timestamps are received from a metadata database in the network-based data system.
12 . The method of claim 8 , wherein the recycling operation comprises: changing a state of the computing device to a kill state; transmitting a kill command to the computing device; and terminating pending operations at the computing device in response to receiving the kill command.
13 . The method of claim 12 , wherein the recycling operation further comprises: changing the state of the computing device to a fail state; triggering a recovery operation for the computing device, the recovery operation comprising cleaning metadata associated with the computing device stored in a metadata database.
14 . The method of claim 12 , wherein the kill command is transmitted from a central health service to the computing device using a remote call.
15 . Computer-storage media comprising instructions that, when executed by one or more processors of a machine, configure the machine to perform operations comprising: receiving one or more timestamps associated with outputs of a computing device in a network-based data system in a defined time interval; comparing the one or more timestamps to a reference clock based on a first threshold to determine whether the computing device has a future clock drift; based on at least one of the one or more timestamps exceeding the first threshold, triggering a recycling operation for the computing device; based on the one or more timestamps not exceeding the first threshold, comparing the one or more timestamps to the reference clock based on a second threshold to determine whether the computing device has a past clock drift; and based on at least one of the one or more timestamps exceeding the second threshold, triggering the recycling operation for the computing device.
16 . (canceled)
17 . The computer-storage media of claim 15 , wherein the second threshold is greater than the first threshold.
18 . The computer-storage media of claim 15 , wherein the one or more timestamps are received from a metadata database in the network-based data system.
19 . The computer-storage media of claim 15 , wherein the recycling operation comprises: changing a state of the computing device to a kill state; transmitting a kill command to the computing device; and terminating pending operations at the computing device in response to receiving the kill command.
20 . The computer-storage media of claim 19 , wherein the recycling operation further comprises: changing the state of the computing device to a fail state; triggering a recovery operation for the computing device, the recovery operation comprising cleaning metadata associated with the computing device stored in a metadata database.

Description

TECHNICAL FIELD Embodiments of the disclosure relate generally to cloud data platforms and, more specifically, to managing health services and lifecycles of computing instances in a network-based data system. BACKGROUND Data platforms are widely used for data storage and data access in computing and communication contexts. With respect to architecture, a data platform could be an on-premises data platform, a network-based data platform (e.g., a cloud-based data platform), a combination of the two, and/or include another type of architecture. With respect to type of data processing, a data platform could implement online transactional processing (OLTP), online analytical processing (OLAP), a combination of the two, and/or another type of data processing. Moreover, a data platform could be or include a relational database management system (RDBMS) and/or one or more other types of database management systems. The data platforms may include a plurality of computing instances, such as virtual machines. The computing instances can suffer from different health concerns, such as high central processing unit (CPU) utilization and clock drift. BRIEF DESCRIPTION OF THE DRAWINGS The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the disclosure. FIG. 1 illustrates an example computing environment that includes a cloud data platform, in accordance with some embodiments of the present disclosure. FIG. 2 is a block diagram illustrating components of a compute service manager of the cloud data platform, in accordance with some embodiments of the present disclosure. FIG. 3 is a block diagram illustrating components of a framework for health monitoring and recovery services, in accordance with some embodiments of the present disclosure. FIG. 4 is a flow diagram for a method for detecting clock drift, according to some example embodiments of the present disclosure. FIG. 5 is a flow diagram for a method for recovery operations that mitigate metadata corruption, according to some example embodiments of the present disclosure. FIG. 6 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION Reference will now be made in detail to specific example embodiments for carrying out the inventive subject matter. Examples of these specific embodiments are illustrated in the accompanying drawings, and specific details are set forth in the following description to provide a thorough understanding of the subject matter. It will be understood that these examples are not intended to limit the scope of the claims to the illustrated embodiments. On the contrary, they are intended to cover such alternatives, modifications, and equivalents as may be included within the scope of the disclosure. A network-based data system, as described in detail below, may include a plurality of computing devices. Monitoring and managing the health of the computing devices can be difficult. Computing devices can fail due to health issues, such as high CPU usage, memory usage, and clock drift. Techniques for monitoring different health metrics of computing devices in a data system are described herein. The monitoring can be implemented at different levels. At a first level, local background services can be run locally on the computing devices to monitor a set of health metrics on the respective computing devices. At a second level, a central health marker service can monitor a set of health metrics for the computing devices. Also, techniques for recovering failed computing devices are described below. The recovery techniques can include remote recovery that can mitigate data corruption, such as metadata corruption. Actively monitoring different health metrics and managing computing devices in a distributed data system improves the technical performance and efficiency of the data system. Unhealthy devices can also lead to data corruption. Therefore, active management of the computing device based on health metrics can mitigate data corruption and ensure data accuracy. FIG. 1 illustrates an example computing environment 100 that includes a cloud data platform 102, in accordance with some embodiments of the present disclosure. To avoid obscuring the inventive subject matter with unnecessary detail, various functional components that are not germane to conveying an understanding of the inventive subject matter have been omitted from FIG. 1. However, a skilled artisan will readily recognize that various additional functional components may be included as part of the computing environment 100 to facilitate additional functionality that is not specifically described herein. As shown, the cloud data platform 102 compris