EP-4736007-A1 - METHOD AND SYSTEM FOR ERROR CHECK AND SCRUB ERROR DATA COLLECTION AND REPORTING FOR A MEMORY DEVICE

EP4736007A1EP 4736007 A1EP4736007 A1EP 4736007A1EP-4736007-A1

Abstract

A method and system for error check and scrub (ECS) error data collection and reporting for a memory device. A controller includes circuitry and a buffer. The circuitry may be configured to read ECS error data from a register of a memory device and calculate an ECS error increase rate based on the ECS error data. The circuitry may be configured to inform basic input output system (BIOS) by interrupt if a total number of ECS errors reaches or exceeds an ECS error number threshold or if the ECS error increase rate reaches or exceeds an ECS error rate threshold. The controller may be an out-of-band device, e.g., a baseboard management controller or a memory micro controller.

Inventors

Zhao, Yanxin
XU, TAO
LI, YUFU
LIU, SHIJIE
ZHU, LEI

Assignees

Intel IP Corporation

Dates

Publication Date: 20260506
Application Date: 20230629

Claims (19)

A controller for error check and scrub (ECS) error data collection and reporting for a memory device, comprising: circuitry configured to read ECS error data from a register of a memory device and calculate an ECS error increase rate based on the ECS error data, wherein the circuitry is further configured to inform basic input output system (BIOS) by interrupt if a total number of ECS errors reaches or exceeds an ECS error number threshold or if the ECS error increase rate reaches or exceeds an ECS error rate threshold; and a buffer configured to store the ECS error data.
The controller of claim 1, wherein the circuitry is configured to read the ECS error data from the memory device periodically.
The controller of claim 2, wherein the ECS error data includes the total number of ECS errors on the memory device during a predetermined period for ECS operation.
The controller as in any one of claims 2-3, wherein the ECS error data includes a highest number of ECS errors per memory row and a corresponding memory address during a predetermined period for ECS operation.
The controller as in any one of claims 1-3, wherein the controller is a baseboard management controller.
The controller as in any one of claims 1-3, wherein the controller is a memory micro controller.
The controller as in any one of claims 1-3, wherein the memory device is a Double Data Rate (DDR) 5 memory device.
A system, comprising: a memory device including an array of memory cells, circuitry configured to perform error check and scrub (ECS) on the array of memory cells, and registers for storing ECS error data; the controller as in any one of claims 1-3; and a processor configured to obtain the ECS error data and perform an action to recover a failed memory in the array of memory cells in response to an interrupt by the controller.
The system of claim 8, wherein the processor is configured to read the ECS error data from the controller upon reception of the interrupt.
The system of claim 8, wherein the processor is configured to perform post-package repair (PPR) or adaptive double DRAM device correction (ADDDC) after obtaining the ECS error data to recover a failed memory.
The system of claim 8, wherein the controller is a baseboard management controller (BMC) or a memory micro-controller (MMC) .
The system of claim 8, wherein the processor is configured to report an error log to an operating system.
A method for error check and scrub (ECS) error data collection and reporting for a memory device, comprising: reading ECS error data from a register of a memory device; calculating an ECS error increase rate based on the ECS error data; and informing basic input/output system (BIOS) if an ECS error number reaches or exceeds an ECS error number threshold or if the ECS error increase rate reaches or exceeds an ECS error rate threshold.
The method of claim 13, further comprising saving the ECS error data in a buffer.
The method as in any one of claims 13-14, wherein the ECS error data is read from the memory device periodically.
The method as in any one of claims 13-14, wherein the ECS error data includes a total number of ECS errors on the memory device during a predetermined period for ECS operation.
The method as in any one of claims 13-14, wherein the ECS error data includes a highest number of ECS errors per memory row and a corresponding memory address during a predetermined period for ECS operation.
The method as in any one of claims 13-14, wherein the method is implemented by a baseboard management controller or a memory micro controller.
The method as in any one of claims 13-14, wherein the memory device is a Double Data Rate (DDR) 5 memory device.

Description

Method and system for error check and scrub error data collection and reporting for a memory device Background A system management mode (SMM) is a special-purpose operating mode provided for handling system-wide functions. An SMM is intended for use by system firmware. A system will enter an SMM when a system management interrupt (SMI) is triggered. SMI has a higher priority than other external interrupts. When an SMM is invoked through an SMI, all processer cores enter an SMM for a specific task such as error collection or correction and resume back to the operating system (OS) when the task is finished. A patrol scrub complete SMI is an example of an SMI. Memory scrubbing includes reading data from each memory location, correcting bit errors in the data (if any) with an error-correcting code (ECC) , and writing the corrected data back to the same memory location. Patrol scrubbing runs in an automated manner when the system is idle, while demand scrubbing performs the error correction when the data is actually requested from a memory. Patrol scrubbing is performed using an integrated memory controller (IMC) patrol engine that generates read requests to memory addresses in a stride. A guarantee is made that it will scrub every address in a memory at least once in a pre-determined duration (normally 24 hours) . Once the patrol scrub is complete, an SMI will be triggered to collect error data. In order to not disturb regular memory requests from processors/processor cores and thus prevent performance decrease, scrubbing is usually done during idle periods. As the scrubbing includes normal read and write operations, it may increase power consumption for the memory compared to non-scrubbing operation. Therefore, scrubbing is not performed continuously but periodically. For many servers, the scrub period can be configured in the basic input/output system (BIOS) setup program. Brief description of the Figures Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which FIG. 1 shows a block diagram of an example system for implementing an ECC error check and scrub (ECS) function; FIG. 2 illustrates a scenario for conventional patrol scrubbing; FIG. 3 shows a block diagram of a system in accordance with one example; FIG. 4 shows an example flow for collecting ECS error data using a baseboard management controller (BMC) or a memory microcontroller (MMC) in accordance with one example; FIG. 5 illustrates an example case that a controller monitors the ECS error number and calculates an ECS error increase rate and reports the ECS error status based on the ECS error number or the ECS error increase rate; FIG. 6 is a block diagram of an electronic apparatus incorporating at least one electronic assembly and/or method described herein; FIG. 7 illustrates a computing device in accordance with one implementation of the invention; and FIG. 8 shows an example of a higher-level device application for the disclosed embodiments. Detailed Description Various examples will now be described more fully with reference to the accompanying drawings in which some examples are illustrated. In the figures, the thicknesses of lines, layers and/or regions may be exaggerated for clarity. Accordingly, while further examples are capable of various modifications and alternative forms, some particular examples thereof are shown in the figures and will subsequently be described in detail. However, this detailed description does not limit further examples to the particular forms described. Further examples may cover all modifications, equivalents, and alternatives falling within the scope of the disclosure. Like numbers refer to like or similar elements throughout the description of the figures, which may be implemented identically or in modified form when compared to one another while providing for the same or a similar functionality. It will be understood that when an element is referred to as being “connected” or “coupled” to another element, the elements may be directly connected or coupled or via one or more intervening elements. If two elements A and B are combined using an “or” , this is to be understood to disclose all possible combinations, i.e., only A, only B as well as A and B. An alternative wording for the same combinations is “at least one of A and B” . The same applies for combinations of more than 2 elements. The terminology used herein for the purpose of describing particular examples is not intended to be limiting for further examples. Whenever a singular form such as “a, ” “an” and “the” is used and using only a single element is neither explicitly or implicitly defined as being mandatory, further examples may also use plural elements to implement the same functionality. Likewise, when a functionality is subsequently described as being implemented using multiple elements, further examples may implement the same functionality using