Search

CN-112711492-B - Firmware-based solid state drive block failure prediction and avoidance scheme

CN112711492BCN 112711492 BCN112711492 BCN 112711492BCN-112711492-B

Abstract

A Solid State Drive (SSD) is disclosed. The SSD may include flash memory for data, the flash memory organized into a plurality of blocks. The controller may manage reading data from and writing data to the flash memory. The metadata store may store device-based log data to prevent errors in the SSD. The identification firmware may identify the block in response to the device-based log data. In some embodiments of the inventive concept, verification firmware may determine whether a suspicious block is predicted to fail in response to both precise block-based data and device-based log data.

Inventors

  • N. Eyasi
  • CUI CHANGHAO

Assignees

  • 三星电子株式会社
  • 三星电子株式会社

Dates

Publication Date
20260421
Application Date
20201023
Priority Date
20191202

Claims (20)

  1. 1. A solid state drive SSD comprising: a flash memory for storing data, the flash memory comprising a plurality of blocks; A controller for managing reading and writing data from and to the flash memory, the controller including a translation layer to map logical addresses used by a processor on the flash memory to physical addresses, and A metadata memory storing first data including device-based log data for error events in the SSD and second data including block-based data for a counter of a number of errors for each of the plurality of blocks; wherein the controller is configured to execute a first instruction based at least in part on the first data to identify a suspicious block of the plurality of blocks and a second instruction based at least in part on both the second data and the first data to verify whether the suspicious block is predicted to fail, and Wherein the error event is associated with an abnormal activity performed by the controller on the flash memory.
  2. 2. The SSD as recited in claim 1, wherein said metadata storage stores device-based log data for only a most recent set of errors in said SSD, and The SSD also includes identification firmware executing on the processor, the identification firmware operable to identify a suspicious block of the plurality of blocks in response to device-based log data.
  3. 3. The SSD of claim 2, wherein: The metadata store is further operable to store accurate block-based data regarding errors in the SSD, and The SSD also includes verification firmware executing on the processor, the verification firmware operable to determine whether a suspected block is predicted to fail in response to precision block-based data and device-based log data.
  4. 4. The SSD of claim 3, wherein the verification firmware is executed only on the suspect blocks.
  5. 5. The SSD of claim 3, wherein the verification firmware is operable to retire the suspect block in response to the exact block-based data and the device-based log data.
  6. 6. The SSD of claim 3, wherein the verification firmware performs one of a random forest, logistic regression, outlier detection analysis, and anomaly detection analysis on the precise block-based data and the device-based log data.
  7. 7. The SSD of claim 2, wherein the identification firmware is operable to derive approximately block-based data from the device-based log data.
  8. 8. The SSD of claim 2, wherein the SSD is operable to periodically execute the identification firmware.
  9. 9.A method for fault prediction, comprising: tracking errors in a solid state drive, SSD, the SSD comprising a plurality of blocks; Storing first data comprising device-based log data for error events in an SSD and second data comprising block-based data for a counter of a number of errors for each of the plurality of blocks in an SSD, and Executing a first instruction based at least in part on the first data to identify a suspicious block of the plurality of blocks and executing a second instruction based at least in part on both the second data and the first data to verify whether the suspicious block is predicted to fail; wherein the error event is associated with an abnormal activity in the SSD.
  10. 10. The method of claim 9, wherein storing device-based log data regarding errors in the SSD comprises storing device-based log data for only a last set of errors in the SSD.
  11. 11. The method of claim 10, further comprising: storing accurate block-based data about errors in the SSD, and Once the suspicious block is identified, a determination is made as to whether the suspicious block is predicted to fail in response to both the accurate block-based data and the device-based log data.
  12. 12. The method of claim 11, wherein determining whether a suspicious block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether a suspicious block is predicted to fail responsive to both the accurate block-based data and the device-based log data only for the suspicious block.
  13. 13. The method of claim 11, further comprising retiring the suspicious block based at least in part on the exact block-based data and based on log data of the device.
  14. 14. The method of claim 11, wherein determining whether a suspicious block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises performing one of a random forest, logistic regression, outlier detection analysis, and anomaly detection analysis on the accurate block-based data and the device-based log data.
  15. 15. The method of claim 10, wherein identifying a suspicious block of the plurality of blocks in response to the device-based log data comprises deriving approximate block-based data from the device-based log data.
  16. 16. The method of claim 10, further comprising periodically identifying a new suspect block of said plurality of blocks in response to the device-based log data.
  17. 17. A program product comprising a non-transitory storage medium having instructions stored thereon that when executed by a machine result in: tracking errors in a solid state drive, SSD, the SSD comprising a plurality of blocks; storing first data including device-based log data for error events in the SSD and second data including block-based data for a counter of a number of errors for each of the plurality of blocks in the SSD, and Executing a first instruction based at least in part on the first data to identify a suspicious block of the plurality of blocks and executing a second instruction based at least in part on both the second data and the first data to verify whether the suspicious block is predicted to fail; wherein the error event is associated with an abnormal activity in the SSD.
  18. 18. The program product of claim 17, wherein storing device-based log data regarding errors in the SSD comprises storing device-based log data for only a last set of errors in the SSD.
  19. 19. The program product of claim 18, wherein the non-transitory storage medium has stored thereon further instructions that, when executed by the machine, result in: storing accurate block-based data about errors in the SSD, and Once the suspicious block is identified, a determination is made as to whether the suspicious block is predicted to fail in response to both the accurate block-based data and the device-based log data.
  20. 20. The program product of claim 19 wherein determining whether a suspicious block is predicted to fail responsive to both the accurate block-based data and the device-based log data comprises determining whether a suspicious block is predicted to fail responsive to both the accurate block-based data and the device-based log data only for the suspicious block.

Description

Firmware-based solid state drive block failure prediction and avoidance scheme Cross Reference to Related Applications The present application claims the benefit of U.S. provisional patent application Ser. No. 62/926,420 filed on 10/25 of 2019, which is incorporated herein by reference for all purposes. Technical Field The present inventive concept relates generally to storage devices and, more particularly, to providing fine-grained block failure prediction. Background A NAND flash Solid State Drive (SSD) failure in the field may cause the server to shut down, compromising performance and availability of the data center level application. To prevent such unexpected failures, systems employing SSDs typically use a simple model based on thresholds to avoid such failures by replacing the drive before the failure occurs. Such protection mechanisms may result in a high degree of false alarms or failure to predict/avoid all SSD failures. In addition, in the event of a physical error, the SSD cannot recover from the error, thus avoiding device failure. There remains a need to provide fine-grained block failure prediction. Disclosure of Invention Drawings Fig. 1 illustrates a system including a Solid State Drive (SSD) that may perform fine-grained block failure prediction, according to an embodiment of the inventive concept. Fig. 2 shows a detail of the machine of fig. 1. Fig. 3 shows details of the SSD of fig. 1. FIG. 4 illustrates example block-based data that may be used by the SSD of FIG. 1. Fig. 5 shows device-based log data that may be used by the SSD of fig. 1. FIG. 6 illustrates the identification firmware and verification firmware of FIG. 3 operating to determine whether a particular block is expected to fail. Fig. 7A-7B illustrate a flowchart of an example process of determining whether a block is expected to fail according to an embodiment of the inventive concepts. Detailed Description Reference will now be made in detail to embodiments of the present inventive concept, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present inventive concepts. It will be appreciated, however, by one of ordinary skill in the art that the inventive concepts may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments. It will be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first module may be referred to as a second module, and similarly, a second module may be referred to as a first module, without departing from the scope of the inventive concept. The terminology used in the description of the inventive concepts herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the inventive concepts. As used in the description of the inventive concepts and the claims below, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It is also to be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The components and features of the drawings are not necessarily to scale. A firmware-based Solid State Drive (SSD) failsafe mechanism is presented for early detection and error isolation. The failure mechanism may prevent the drive from failing, or at least prevent premature replacement of the drive. The SSD includes a plurality of flash memory chips, each flash memory chip including a number of blocks. A block may contain any number of pages. The size of a page is typically a few kilobytes and is typically the smallest unit for reading and writing data to an SSD. The SSD controller (firmware) may include all logic needed to service read and write requests, run wear leveling algorithms, and run error recovery procedures. Each SSD page may include Error Correction Code (ECC) metadata that may be used by the SSD controller to recover and repair a limited number of bit errors (typically 1-2 bit errors). However, if the number of bit errors due to hardware failure exceeds a certain number, the SSD controller may not correct the errors, and thus