US-12619492-B1 - Burst error correction for reed-solomon codes
Abstract
A system and method for processing errors in a self-managed DRAM module. A method includes receiving a block of code symbols comprising s code segments; loading the code symbols into a bank of list-decoding modules, wherein each module is configured to correct errors in a specific combination of code-segments from the block; for each list-decoder module, erase code symbols and calculate replacement code symbols to form a valid codeword; for each valid codeword, determine a codeword metric based on a number of corrections between the replacement code symbols and the erased received symbols; identify a first and a second valid codeword from the set of unique codewords generated by a bank of list-decoder modules having a minimum codeword metric and a second minimum codeword metric, respectively; and correct the first valid codeword if, based on the minimum and second minimum codeword metric, no decoding error is detected.
Inventors
- Kelly Fitzpatrick
Assignees
- ScaleFlux, Inc.
Dates
- Publication Date
- 20260505
- Application Date
- 20240710
Claims (17)
- 1 . A method for correcting errors in a self-managed dynamic random access memory (DRAM) device, comprising: receive a block of code symbols comprising s code segments each with T symbols from a memory channel into a controller chip within the self-managed DRAM device; in response to a detected error in the block of code symbols, load the block of code symbols into a bank of ( s b ) decoders within the controller chip, wherein each decoder includes a separate memory location for storing the block of code symbols, and wherein b is a number of correctable code segments, and wherein each decoder is configured to correct 2t symbols associated with a combination of b code-segments from the block of received code symbols, where t is a number of symbols in each device in a channel of DRAM devices; for each decoder, erase code symbols associated with b code segments and calculate replacement code symbols to form a valid codeword; for each valid codeword, determine a codeword metric based on a number of corrections between the replacement code symbols and the erased received symbols; identify a first and a second valid codeword on the list having a minimum codeword metric and a second minimum codeword metric, respectively; and correct the first valid codeword based on the minimum codeword metric and the second minimum codeword metric.
- 2 . The method of claim 1 , wherein the first valid codeword is only corrected if: there are less than t corrections or t corrections in b/2 code-segments required in the first valid codeword, wherein 2t=bT is a number of symbols in a combination of b code segments and T is a number of symbols in each code segment; or the minimum codeword metric is correctable and no second codeword metric is identified.
- 3 . The method of claim 1 , wherein the codeword metric is further determined based on whether error locations within the combination of b code-segments are classified as correctable or uncorrectable.
- 4 . The method of claim 1 , wherein each decoder includes a hardware decoder to process an associated correctable code-segment combination.
- 5 . The method of claim 1 , wherein the block of code symbols comprises 64 data symbols, the number of segments s is 10 and b is two.
- 6 . The method of claim 5 , wherein the list of valid codewords is generated according to a process that includes: generating a first valid codeword by calculating replacement code symbols for a first and a second segment of the block of code symbols; generating s−1 valid codewords by calculating replacement code symbols for the first segment and each of a set of remaining segments of the block of code symbols utilizing a set of error symbols from the first valid codeword calculation; generating a remaining ( s 2 ) - s valid codewords utilizing the error symbols from the first s valid codewords.
- 7 . The method of claim 6 , wherein after the first valid codeword is generated, decoding and writing out the block of codes symbols if no errors are detected.
- 8 . The method of claim 7 , wherein after the s valid codewords are generated, correcting the block of code symbols with error correction codes (ECC) if t errors or less than t errors are detected in any one code segment.
- 9 . A self-managed dynamic random-access memory (DRAM) device, comprising: a plurality of DRAM modules arranged in channels; and a controller chip, the controller chip configured to correct errors in the plurality of DRAM modules according to a process that includes: receiving a block of code symbols comprising s code segments each with T symbols; loading the block of code symbols into a bank of ( s b ) decoders within the controller chip, wherein each decoder includes a separate memory location for storing the block of code symbols, and wherein b is a number of correctable code segments, and wherein each decoder is configured to correct 2t symbols associated with a combination of b code-segments from the block of received code symbols, where t is a number of symbols in each DRAM device in a channel; for each decoder, erasing code symbols associated with b code segments and calculate replacement code symbols to form a valid codeword; for each valid codeword, determining a codeword metric based on a number of corrections between the replacement code symbols and the erased received symbols; identifying a first and a second valid codeword on the list having a minimum codeword metric and a second minimum codeword metric, respectively; and correcting the first valid codeword based on the minimum code word metric and the second minimum codeword metric.
- 10 . The device of claim 9 , wherein the first valid codeword is only corrected if: there are less than t corrections or t corrections in b/2 code-segments, wherein 2t=bT is a number of symbols in a combination of b code segments and T is a number of symbols in each code segment; or the minimum codeword metric is correctable and no second codeword metric is identified.
- 11 . The device of claim 9 , wherein the codeword metric is further determined based on whether error locations within the combination of b code-segments are classified as correctable or uncorrectable.
- 12 . The device of claim 9 , wherein each decoder includes a hardware decoder to process an associated correctable code-segment combination.
- 13 . The device of claim 9 , wherein the block of code symbols comprises 64 data symbols, the number of segments s is 10 and b is two.
- 14 . The device of claim 13 , wherein the list of valid codewords is generated according to a process that includes: generating a first valid codeword by calculating replacement code symbols for a first and a second segment of the block of code symbols; generating s−1 valid codewords by calculating replacement code symbols for the first segment and each of a set of remaining segments of the block of code symbols utilizing a set of error symbols from the first valid codeword calculation; and generating a remaining ( s 2 ) - s valid codewords utilizing the error symbols from the first s valid codewords.
- 15 . The device of claim 14 , wherein after the first valid codeword is generated, decoding and writing out the block of codes symbols if no errors are detected.
- 16 . The device of claim 15 , wherein after the s valid codewords are generated, correcting the block of code symbols with error correction codes (ECC) if t errors or less than t are detected in any one code segment.
- 17 . The device of claim 9 , wherein the controller chip comprises an application specific integrated circuit (ASIC) device.
Description
PRIORITY CLAIM This application claims priority to copending provisional application, 63/512,967, filed on Jul. 11, 2023, entitled BURST ERROR CORRECTION FOR REED-SOLOMON CODES, which is hereby incorporated by reference. TECHNICAL FIELD The present invention relates to the field of burst-error correction in latency-sensitive multi-device memory and storage systems, specifically methods of list decoding of Reed-Solomon codes in DDR (double data rate) DRAM (dynamic random-access memory) channels. BACKGROUND OF THE INVENTION In conventional DRAM practice, each CPU (central processing unit) connects to its exclusively owned/controlled DRAM modules, typically in the form of DIMM (dual in-line memory module), through dedicated DDR channels. More recently, the computing industry has developed open standards, in particular CXL (Compute Express Link), that allow CPU-memory connections over high-speed PCIe links. In this context, much of DRAM control/management functionalities are migrated from CPUs into a CXL/DRAM controller, leading to self-managed DRAM modules in contrast to the conventional CPU-managed DRAM modules. DRAM memory typically consists of a set of DDR devices for data and parity that are jointly accessed by a high-speed memory channel interface. Reed-Solomon (RS) error-correction codes (ECC) are typically used to generate the parity, since RS codes are well-known for their burst error-correction capabilities. Correcting one ECC symbol corrects up to m-bit errors in a RS code defined over Galois Field GF(2m) with m-bit wide symbols. Due to super low-latency requirements, error correction for direct-attached DRAM memory has been limited to decoding algorithms completed in one clock cycle. SUMMARY The present invention relates to low-latency ECC decoding algorithms that can be completed in a few clock cycles using highly parallel code-specific hardware within a CXL/DRAM controller. Present embodiments harness lesser known burst error-correction capabilities through list decoding to correct a small number of burst-errors with a large number of symbol errors. In one aspect, a method is provided for correcting errors in a self-managed DRAM device, comprising: receive a block of code symbols comprising s code segments each with T symbols; load the block of code symbols into a bank of (sb) listdecoder modules, wherein b is a number of correctable code segments, and wherein each list-decoder module is configured to correct 2t symbols associated with a combination of b code-segments from the block of received code symbols; for each list-decoder module, erase code symbols associated with b code segments and calculate replacement code symbols to form a valid codeword; for each valid codeword, determine a codeword metric based on a number of corrections between the replacement code symbols and the erased received symbols; identify a first and a second valid codeword on the list having a minimum codeword metric and a second minimum codeword metric, respectively; and correct the first valid codeword if, based on the minimum and second minimum codeword metric, no decoding error is detected. A further aspect comprises a self-managed DRAM device, comprising: a plurality of DRAM modules arranged in channels; and a controller chip, the controller chip configured to correcting errors in the plurality of DRAM modules according to a process that includes: receiving a block of code symbols comprising s code segments each with T symbols; loading the block of code symbols into a bank of (sb) list-decoder modules, wherein b is a number of correctable code segments, and wherein each list-decoder module is configured to correct 2t symbols associated with a combination of b code-segments from the block of received code symbols, where t is a number of symbols in each DRAM device in a channel; for each list-decoder module, erasing code symbols associated with b code segments and calculate replacement code symbols to form a valid codeword; for each valid codeword, determining a codeword metric based on a number of corrections between the replacement code symbols and the erased received symbols; identifying a first and a second valid codeword on the list having a minimum codeword metric and a second minimum codeword metric, respectively; and correcting the first valid codeword if, based on the minimum and second minimum codeword metric, no decoding error is detected. BRIEF DESCRIPTION OF THE DRAWINGS The numerous advantages of the present invention may be better understood by those skilled in the art by reference to the accompanying figures in which: FIG. 1 illustrates the architecture of a self-managed DRAM module using a CXL/PCIe channel. FIG. 2 illustrates one ECC codeword stored across all the s DRAM devices on one DDR channel. FIG. 3 illustrates an RS List Decoding Algorithm Flow Chart flow diagram of an ECC decoding scheme according to aspects of the disclosure. FIG. 4 depicts a set of received symbols and associated list of code