CN-121979705-A - Storage equipment management method based on fault prediction and automatic recovery
Abstract
The invention discloses a storage equipment management method based on fault prediction and automatic recovery, which dynamically calculates health scores of storage modules and automatically switches the modules in a normal mode, a degradation mode and an isolation mode according to the scores. When the module enters a degradation mode, the controller divides the data into a plurality of data segments which can be independently moved, sequentially transfers the data segments to the health module according to the risk sequence, and updates the local logic mapping table in real time after each segment is transferred. And in the migration process, the controller evaluates health scores in real time, if the scores return to a preset threshold, the residual data segment moving and new data writing rate is adjusted until the module is restored to a normal mode, and if the health scores are continuously lower than the threshold, the module is permanently switched to an isolation mode, writing is forbidden, and data logic is mapped to other health modules. By the method, the active protection and automatic recovery of SSD data security can be realized, and the reliability and the service life of a storage system are improved.
Inventors
- WANG CHONGYU
Assignees
- 晋达半导体有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251208
Claims (1)
- 1. The storage equipment management method based on fault prediction and automatic recovery is suitable for the solid state disk, and is characterized by comprising the following steps: (A) The method comprises the steps of health evaluation and state switching, wherein a controller of a computer system continuously monitors trend changes of a plurality of storage behavior indexes, calculates a health score of a plurality of storage modules according to the trend changes, and enables the plurality of storage modules to be automatically switched among three states of a normal mode, a degraded mode and an isolated mode according to the health score, wherein the plurality of storage behaviors comprise a write-in error event frequency, a write-in delay time offset and a reserved error amplification, and the plurality of storage modules refer to a storage unit formed by at least one flash memory crystal grain or a logic subarea thereof; (B) A step of sectional migration, in which b1 to b4 actions are executed when any one of the storage modules enters the degradation mode; b1. The controller divides the existing effective data in the storage module entering the degradation mode into a plurality of data segments which can be independently moved according to the logic address continuity, the physical page aggregation degree or the block correspondence, wherein each data segment corresponds to at least one segment of logic address range and at least one physical page group; b2. The controller determines risk indexes of the plurality of data segments according to the local aging degree of the physical page group to which the data segments belong aiming at each data segment, and forms a segmented migration sequence according to the high-to-low ordering of risks; b3. The controller starts to migrate the data segment with highest risk to any one of the storage modules belonging to the other normal modes in sequence according to the segment migration sequence; b4. After the migration of one data segment is completed, the controller immediately updates the local logic mapping table corresponding to the data segment, so that the data segment takes effect in real time at the logic position of the new module without waiting for the completion of the migration of all other data segments which are not migrated yet; (C) A step of real-time migration evaluation, in which, when an abnormal storage module is the first of the data segment migration, a new data writing is suspended and after each data segment is migrated, a controller recalculates the health score of the abnormal storage module in real time, if the health score is raised above a first preset threshold, the migration and synchronous execution of new data writing operation of other residual data segments are continued, the migration time sequence of the residual data segments is maintained at the original speed, the operation time sequence of the new data writing is adjusted to be twice the period of the original time sequence to reduce the instantaneous writing speed, if the health score is raised above a second preset threshold, the abnormal storage module is switched to the normal mode, the migration of the residual data segments is stopped and the new data writing operation time sequence is resumed to be the normal writing speed, and the second preset threshold is higher than the first preset threshold, wherein, the residual data segments after the migration is maintained in the original storage module and the new data writing operation is carried out again according to the health score when the subsequent state is switched, the new isolation process is carried out again or the new process is judged, and the new data writing process is carried out (D) And a permanent isolation step, wherein if the health score of any storage module is judged to belong to the isolation mode at first, writing is permanently prohibited and related logic mapping is replaced to other storage modules belonging to the normal mode, or if the health score of the storage module is judged to belong to the degradation mode at first, and before the first preset threshold is not met, when the health score after N data segments are moved is still lower than or equal to the moved (N-5) health scores, the controller switches the storage module to the isolation mode, and writing is permanently prohibited and related logic mapping is replaced to the plurality of storage modules belonging to the normal mode, wherein N is an integer greater than or equal to 5.
Description
Storage equipment management method based on fault prediction and automatic recovery Technical Field The invention relates to a data management technology of a solid state disk, in particular to a storage equipment management method based on fault prediction and automatic recovery, which is used for improving the reliability and service life of each storage module in the solid state disk, and reducing the risk of data loss, improving the writing efficiency and ensuring the stable operation of a storage system through dynamic monitoring of health scores, sectional data movement, real-time logic mapping updating and automatic isolation control. Background In the field of data storage management of solid state disks, a system needs to continuously monitor the health state of each storage unit and take corresponding measures for possible faults or performance degradation. However, the prior art has insufficient consideration of dynamic changes of data distribution, local aging and write load in the storage unit, and has limited overall coordination ability for data movement strategy and write operation, which may make it difficult to achieve both performance and reliability. In addition, the existing method mostly uses static or preset conditions as health judgment basis, lacks a strategy for automatically adjusting the real-time health state of the storage unit, and especially cannot flexibly adjust the data moving and writing rhythm when processing local aging or high-risk data, and may generate situations of low moving efficiency or excessive interference to normal operation. Therefore, the prior art still has the problem that the reliability and the efficiency are difficult to be compatible when dealing with the dynamic change of the health state, the local data aging and the segment moving requirement of the storage unit. Disclosure of Invention An objective of the present invention is to provide a storage device management method based on failure prediction and automatic recovery, which can evaluate the health status of each storage module in real time, dynamically adjust the data transfer and write operations according to the health score of the module, and improve the reliability and the service life of the module, and simultaneously reduce the influence on normal data writing and improve the performance and stability of the overall storage system through sectional data transfer and operation timing control. The invention provides a storage device management method based on failure prediction and automatic recovery, which is suitable for a solid state disk and comprises (A) a health evaluation and status switching step, wherein a controller of a computer system continuously monitors trend changes of a plurality of storage behavior indexes, calculates a health score of a plurality of storage modules, and automatically switches the plurality of storage modules among three states of a normal mode, a degraded mode and an isolated mode according to the health score, wherein the plurality of storage behaviors comprise a write error event frequency, a erasure delay time offset and a retention error amplification, and the plurality of storage modules refer to a storage unit formed by at least one flash memory die or logic subareas thereof, wherein the health score is based on distribution of effective data in the plurality of storage modules, The dynamic calculation of local ageing degree and writing load change to make data migration affect the health score in real time, and the step of sectional migration includes the steps of executing B1-B4 actions when one of the storage modules enters the degradation mode, and the step of B1 the controller based on the logic address continuity, The physical page aggregation degree or block corresponding relation divides the existing effective data in the storage module entering the degradation mode into a plurality of data segments which can be independently moved; wherein each data segment corresponds to at least one segment of logical address range and at least one physical page group; the controller determines the risk index of the multiple data segments according to the local aging degree of the physical page group of each data segment, and forms a segmented migration sequence according to the high-to-low ordering of risks, b3 the controller starts to migrate to any one of the other storage modules belonging to the normal mode in sequence from the data segment with the highest risk according to the segmented migration sequence, b4 the controller immediately updates the local logic mapping table corresponding to the data segment after completing the migration of the data segment, so that the data segment takes effect in real time at the logic position of the new module without waiting for the completion of the migration of all other data segments which have not been migrated, C migration real-time assessment step of suspending new data writing when the abnormal sto