EP-4439564-B1 - METHOD AND SYSTEM FOR REPAIRING A DYNAMIC RANDOM ACCESS MEMORY (DRAM) OF MEMORY DEVICE

EP4439564B1EP 4439564 B1EP4439564 B1EP 4439564B1EP-4439564-B1

Inventors

SHUKLA, NIDHI
JOSEPH, Preeti
CHO, Hyunbum
SHIN, YONGJAE

Dates

Publication Date: 20260506
Application Date: 20240328

Claims (14)

A method of repairing a Dynamic Random Access Memory, DRAM, memory device (101) the method comprising: reserving a memory space within the DRAM memory device, the reserved memory space (223) including a plurality of spare rows; identifying one or more faulty rows within the DRAM memory device using at least one memory testing method; classifying the identified one or more faulty rows into one or more correctable faulty rows and/or one or more uncorrectable faulty rows; repairing any classified correctable faulty rows by, updating an error information table (205) based on information of a respective classified correctable faulty row, the error information table including a row identifier corresponding to the respective classified correctable faulty row and an error count for the respective classified correctable faulty row, in response to the error count for the respective classified correctable faulty row exceeding a desired threshold value, mapping the respective classified correctable faulty row to an available spare row of the plurality of spare rows, storing the mapping of the respective correctable faulty row and the mapped spare row in a row repair translation table, and copying data stored in the respective correctable faulty row into the mapped spare row; enabling the reserved memory space via a Basic Input Output System, BIOS, boot menu associated with the DRAM memory device; and increasing or decreasing a number of spare rows included in the reserved memory space of the DRAM memory device from the BIOS boot menu.
The method as claimed in claim 1, further comprising: repairing any classified uncorrectable faulty rows by, mapping a respective classified uncorrectable faulty row to an available spare row of the plurality of spare rows, storing the mapping of the respective classified uncorrectable faulty row and the mapped spare row in the row repair translation table, and copying data stored in the respective classified uncorrectable faulty row into the mapped spare row.
The method as claimed in claim 1 or 2, wherein classifying the identified one or more faulty rows into the one or more correctable faulty rows and/or the one or more uncorrectable faulty rows further comprises: classifying an identified faulty row as a correctable faulty row in response to all errors within the identified faulty row being identified as correctable errors; and classifying an identified faulty row as an uncorrectable faulty row in response to at least one error within the identified faulty row being identified as an uncorrectable error.
The method as claimed in any preceding claim, wherein the at least one memory testing method comprises at least one of: an on-die Error Correction Code, ECC, memory testing method, an Error Check and Scrub, ECS, memory testing method, a side band ECC memory testing method, a memtest, a patrol scrub, or any combination thereof.
The method as claimed in any preceding claim, wherein the row repair translation table includes a plurality of rows; each row of the plurality of rows includes at least a first field, a second field, a third field, and a fourth field; the first field includes an identifier of a spare row of the plurality of spare rows; the second field includes a status identifier corresponding to the spare row, the status identifier indicating whether the spare row is available for mapping; the third field includes a fault identifier corresponding to the spare row, the fault identifier indicating whether the spare row is faulty; and the fourth field includes an identifier of a faulty row which is mapped to the spare row.
The method as claimed in any preceding claim, further comprising: receiving a request to access data stored in a row of the DRAM memory device; determining whether the requested row corresponds to one or more identified faulty rows; and in response to the requested row corresponding to one or more of the identified faulty rows, redirecting the received request to at least one mapped spare row corresponding to the requested faulty row based on mapping information of the requested faulty row stored in the row repair translation table.
The method as claimed in any preceding claim, further comprising: determining whether any row of the plurality of spare rows of the reserved memory space is faulty; and upon determining that at least one row of the plurality of spare rows is faulty, excluding the at least one faulty spare row from the plurality of spare rows.
The method as claimed in any preceding claim, further comprising: storing the error information table and the row repair translation table in non-volatile memory during a power cycle of the DRAM memory device; and restoring the error information table and the row repair translation table to the DRAM memory device in response to the DRAM memory device being powered on during a next power cycle.
A system for repairing a Dynamic Random Access Memory, DRAM, memory device (101), the system comprising: DRAM memory (203); and at least one processor (103) communicatively coupled to the DRAM memory, the at least one processor configured to: reserve a memory space within the DRAM memory, the reserved memory space (223) including a plurality of spare rows; identify one or more faulty rows within the DRAM memory using at least one memory testing method; classify the identified one or more faulty rows into one or more correctable faulty rows and/or one or more uncorrectable faulty rows; repair any classified correctable faulty rows by: updating an error information table (205) based on information of a respective classified correctable faulty row, the error information table including an identifier corresponding to the respective classified correctable faulty row and an error count for the respective classified correctable faulty row, in response to the error count for the respective classified correctable faulty row exceeding a desired threshold value, mapping the respective classified correctable faulty row to an available spare row of the plurality of spare rows, storing the mapping of the respective classified correctable faulty row and the mapped spare row in a row repair translation table, and copying data stored in the respective classified correctable faulty row into the mapped spare row; enable the reserved memory space via a Basic Input Output System, BIOS, boot menu associated with the DRAM memory device; and increase or decrease a number of spare rows included in the reserved memory space of the DRAM memory device from the BIOS boot menu.
The system as claimed in claim 9, wherein the at least one processor is further configured to repair any classified uncorrectable faulty rows by: mapping a respective classified uncorrectable faulty row with an available spare row of the plurality of spare rows; store the mapping of the respective classified uncorrectable faulty row and the mapped spare row in the row repair translation table; and copying data stored in the respective classified uncorrectable faulty row into the mapped spare row.
The system as claimed in claim 9 or 10, wherein the at least one processor is further configured to classify the identified one or more faulty rows into one or more correctable faulty rows and/or one or more uncorrectable faulty rows by: classifying an identified faulty row as a correctable faulty row in response to all errors within the identified faulty row being identified as correctable errors; and classifying an identified faulty row as an uncorrectable faulty row in response to at least one error within the identified faulty row being identified as an uncorrectable error.
The system as claimed in any one of claims 9 to 11, wherein the at least one memory testing method comprises at least one of: an on-die Error Correction Code, ECC, memory testing method, an Error Check and Scrub, ECS, memory testing method, a side band ECC memory testing method, a memtest, a patrol scrub, or any combination thereof.
The system as claimed in any one of claims 9 to 12, wherein the row repair translation table includes a plurality of rows; and each row of the plurality of rows includes a first field, a second field, a third field, and a fourth field; the first field includes an identifier of a spare row of the plurality of spare rows; the second field includes a status identifier corresponding to the spare row, the status identifier indicating whether the spare row is available for mapping; the third field includes a fault identifier corresponding to the spare row, the fault identifier indicating whether the spare row is faulty; and the fourth field includes an identifier of a faulty row which is mapped to the spare row.
The system as claimed in any one of claims 9 to 13, wherein the at least one processor is further configured to: receive a request to access data stored in a row of the DRAM memory; determine whether the requested row corresponds to any of the one or more identified faulty rows; and upon determining that the requested row corresponds to a faulty row of the one or more identified faulty rows, redirect the received request to at least one mapped spare row corresponding to the requested faulty row based on mapping information of the requested faulty row stored in the row repair translation table.

Description

FIELD Various example embodiments of the inventive concepts generally relate to memory devices. Particularly, one or more example embodiments of the inventive concepts relate to memory failure management of a memory device, methods, and/or systems for repairing a Dynamic Random Access Memory (DRAM) of a memory device. BACKGROUND Generally, memory failures are a common cause of server failures. In other words, the memory failures can be a source of system crash and customer dissatisfaction, unless they are reduced, prevented and/or managed properly. A modern Dynamic Random Access Memory (DRAM), such as a Double Data Rate 5 (DDR5) RAM, is equipped with error-identifying techniques to identify and/or correct memory errors (e.g., correctable errors and uncorrectable errors). The error identification techniques may include an On-die Error Correction Code (ECC) and a sideband ECC. The On-die ECC is a feature and/or technique designed to correct bit errors within the DRAM and protect the integrity of data stored in memory cells of DRAM arrays. Although the On-die ECC technique is used to correct bit errors, it does not provide end-to-end protection. Further, the On-die ECC technique does not detect, reduce, and/or prevent errors that occur during data transmission between a memory controller and a memory module. To provide full end-to-end protection, the On-die ECC may be used in conjunction with sideband ECC. The sideband ECC is an error identifying technique implemented in all devices using standard DDR memories (for example, DDR4, DDR5, etc.). In sideband ECC, an error correction code is sent as sideband data along with actual data to the DRAM. During write/read operation, a memory controller of the DRAM may write/read the error correction code along with the actual data. No additional write or read overhead commands are desired and/or required for sideband ECC technique. Further, an error correction technique, such as Post Package Repair (PPR), etc., may be used to correct the detected errors. The PPR is a memory self-healing process of substituting access to a bad cell and/or faulty row with a spare row within the DRAM. The PPR may comprise using a Reliability Availability and Serviceability (RAS) feature wherein a Dual In-line Memory Module (DIMM) with errors (for example, a faulty row) may be repaired after packaging. The PPR may map faulty rows encountered dynamically to at least one spare row of a plurality of available spare rows. The PPR fuses a faulty row with the spare row permanently or temporarily based on the type of PPR. If the PPR fuses the faulty row with the spare row permanently, then the PPR is known as hard PPR, and if the PPR fuses the faulty row with the spare row temporarily, then the PPR is known as soft PPR. It may be noted that the number of spare rows available in a system for hard PPR is limited, and for example, the PPR may have one spare row per bank group. Once hard PPR is performed, the same faulty row will always be mapped to the same spare row. If more errors occur then PPR may not be able to correct the errors and hence, the errors may become potential hard errors (e.g., permanent errors). A sparing technique may be used to detect and correct such errors. In the context of memory devices, sparing may refer to substituting a faulty memory element with a spare or redundant memory element. The sparing is performed on an entire Rank/DIMM/Channel of the DRAM. The sparing enables an entire faulty Rank/DIMM/Channel to be replaced by a redundant Rank/DIMM/Channel. In these scenarios a complete Rank/DIMM/Channel becomes unused, and in case of another row becoming faulty there will be no spare memory for repairing the faulty row. Thus, there exists a need for further improvements in DRAM repairing techniques. US 2015/199234 A1 discloses a method of operating a memory device which includes: checking for errors in data read from a first address of a memory cell array of the memory device; counting the number of errors that occurred in the data read from the first address; receiving a first command for data read from the first address; determining whether the number of errors that occurred in the data read from the first address is greater than or equal to a first value; and mapping the first address to a second address, if the number of errors that occurred in the data read from the first address is greater than or equal to the first value. SUMMARY One or more shortcomings of the prior art discussed above may be overcome and/or additional advantages may be provided by at least one example embodiment of the inventive concepts. At least one object of at least one example embodiment of the inventive concepts is to increase the number of spare rows by reserving memory in DRAM of a memory device. According to the present invention there is provided a method of repairing a Dynamic Random Access Memory (DRAM) memory device according to claim 1. Optional features of the method are defined accordi