US-20260127067-A1 - Selective Panic Mitigation
Abstract
Not all panic situations in a data storage device necessitate a host device initiated reset. When it is possible for the data storage device to handle a panic event and simply inform the host device that the panic event was avoided, efficiencies are achieved. For multi-tenant situations, the data storage device can track the types of traces and determine whether a host device initiated reset is necessary or whether the data storage device can handle the reset internally. The data storage device can delay a host device initiated reset needed by one tenant until other tenants are ready for the host device initiated reset.
Inventors
- Shay Benisty
- Ariel Navon
- JUDAH GAMLIEL HAHN
- Alexander Bazarsky
Assignees
- SanDisk Technologies, Inc.
Dates
- Publication Date
- 20260507
- Application Date
- 20241106
Claims (20)
- 1 . A data storage device, comprising: a memory device; and a controller coupled to the memory device, wherein the controller is configured to: track an indication for reset for one or more physical functions (PFs), one or more virtual functions (VFs), or a combination of PFs and VFs; determine whether a reset is to occur; determine whether a workload is a read workload; and determine whether to handle a reset internally or turn to a host device to initiate a reset.
- 2 . The data storage device of claim 1 , wherein the one or more PFs, the one or more VFs, or the combination of PFs and VFs comprises a first PF and a second PF, wherein the first PF comprises a first VF and a second VF.
- 3 . The data storage device of claim 1 , wherein the controller is configured to determine that the workload is a read workload and wherein the controller is configured to handle the reset internally.
- 4 . The data storage device of claim 1 , wherein the controller is configured to determine that the workload is other than a read workload and wherein the controller is configured to turn to the host device to initiate the reset.
- 5 . The data storage device of claim 1 , wherein the controller is configured to collect reset feedbacks and internal reset preparations, wherein the controller is configured to determine whether all resets are ready, and wherein the controller is configured to initiate reset for all relevant VFs and PFs.
- 6 . The data storage device of claim 5 , wherein the initiated reset is both internal and external.
- 7 . The data storage device of claim 1 , wherein the controller is configured to track types of traces.
- 8 . The data storage device of claim 7 , wherein the controller is configured to store an indication of whether a workload is a read workload for each of the one or more PFs, the one or more VFs, or the combination of PFs and VFs.
- 9 . The data storage device of claim 1 , wherein the controller comprises a failure detector.
- 10 . The data storage device of claim 1 , wherein the controller comprises a host interface module (HIM) that includes a panic reset module, a transparent reset and logs module, and a reset synchronization module.
- 11 . A data storage device, comprising: a memory device; and a controller coupled to the memory device, wherein the controller is configured to: operate as a multitenant device coupled to one or more physical functions (PFs), one or more virtual functions (VFs), or a combination of PFs and VFs; track traces for each function of the one or more PFs, the one or more VFs, or the combination of PFs and VFs; determine whether the traces are for a read workload; and store an indication of whether the workload is a read workload.
- 12 . The data storage device of claim 11 , wherein the controller is configured to determine that a reset should occur and handle the reset internally for functions having read workloads.
- 13 . The data storage device of claim 11 , wherein the controller is configured to determine that a reset should occur and turn to a host device to initiate the reset for functions having other than read workloads.
- 14 . The data storage device of claim 11 , wherein the tracking is performed continuously.
- 15 . The data storage device of claim 11 , wherein the tracking is performed by determining a current workload for each PF of the one or more PFs, each VF of the one or more VFs, or each VF and PF of the combination of PFs and VFs once reset is indicated.
- 16 . The data storage device of claim 11 , wherein at least one PF comprises a plurality of VFs.
- 17 . The data storage device of claim 11 , wherein the storing comprises storing values in a bitmap indicating whether the workload is a read workload or an other than read workload.
- 18 . A data storage device, comprising: means to store data; and a controller coupled to the means to store data, wherein the controller is configured to: determine that a near failure event has occurred; determine that the near failure event can be handled without host device reset; initiate host device isolation; handle state reset; restore system state from a state recovery database; and remove host device isolation.
- 19 . The data storage device of claim 18 , wherein the controller comprises a failure indication module and a failure recovery module, and wherein the controller maintains the state recovery database.
- 20 . The data storage device of claim 19 , wherein the controller operates as a multitenant device coupled to one or more physical functions (PFs), one or more virtual functions (VFs), or a combination of PFs and VFs.
Description
BACKGROUND OF THE DISCLOSURE Field of the Disclosure Embodiments of the present disclosure generally relate to panic mitigation. Description of the Related Art The peripheral component interconnect (PCI) express (PCIe) standard introduces a single root input/output (I/O) virtualization (SR-IOV) that includes physical functions (PFs) and virtual functions (VFs). PFs are full featured PCIe functions. VFs are lightweight functions that lack some configuration resources. A multi-tenant environment typically means that there is some kind of virtualization implemented in the device controller such as one or more VFs, one or more PFs, or combinations thereof. Most specifically, a multi-tenant environment involves multiple functions. When a data storage device encounters an internal failure, the data storage device has several recovery paths. Some failures can be handled within the data storage device, and some involve resetting the host interface or otherwise disrupting host-device communication. Events that involve host device interactions are called panic events. There are mechanisms in the nonvolatile memory (NVM) express (NVMe) and open compute project (OCP) standards to address panic events while minimizing impact to end-users. Regardless of whether operating in a client or an enterprise solid state drive (SSD) environment, reducing the frequency of panic events would be valuable to avoid disrupting the host interface wherever possible. Therefore, there is a need in the art for mitigating panic events. SUMMARY OF THE DISCLOSURE Not all panic situations in a data storage device necessitate a host device initiated reset. When it is possible for the data storage device to handle a panic event and simply inform the host device that the panic event was avoided, efficiencies are achieved. For multi-tenant situations, the data storage device can track the types of traces and determine whether a host device initiated reset is necessary or whether the data storage device can handle the reset internally. The data storage device can delay a host device initiated reset needed by one tenant until other tenants are ready for the host device initiated reset. In one embodiment, a data storage device comprises: a memory device; and a controller coupled to the memory device, wherein the controller is configured to: track an indication for reset for one or more physical functions (PFs), one or more virtual functions (VFs), or a combination of PFs and VFs; determine whether a reset is to occur; determine whether a workload is a read workload; and determine whether to handle a reset internally or turn to a host device to initiate a reset. In another embodiment, a data storage device comprises: a memory device; and a controller coupled to the memory device, wherein the controller is configured to: operate as a multitenant device coupled to one or more physical functions (PFs), one or more virtual functions (VFs), or a combination of PFs and VFs; track traces for each function of the one or more PFs, the one or more VFs, or the combination of PFs and VFs; determine whether the traces are for a read workload; and store an indication of whether the workload is a read workload. In another embodiment, a data storage device comprises: means to store data; and a controller coupled to the means to store data, wherein the controller is configured to: determine that a near failure event has occurred; determine that the near failure event can be handled without host device reset; initiate host device isolation; handle state reset; restore system state from a state recovery database; and remove host device isolation. BRIEF DESCRIPTION OF THE DRAWINGS So that the manner in which the above recited features of the present disclosure can be understood in detail, a more particular description of the disclosure, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this disclosure and are therefore not to be considered limiting of its scope, for the disclosure may admit to other equally effective embodiments. FIG. 1 is a schematic block diagram illustrating a storage system in which a data storage device may function as a storage device for a host device, according to certain embodiments. FIG. 2 is a block scheme depicting a standard introduces a single root input/output (I/O) virtualization (SR-IOV) system. FIG. 3 is a schematic illustration of an internal reset for a single port memory device. FIG. 4 is a flowchart illustrating handling of failure events according to one embodiment. FIG. 5 is a flowchart illustrating the state logging module and database operation according to one embodiment. FIG. 6 is a flowchart illustrating panic event mitigation in a multi-tenant system. FIG. 7 is a flowchart illustrating selective reset preparation for a multi-tenant system. FIG. 8 is a schematic il