Search

CN-122019235-A - Error counter, error counting method, computing device, and chip

CN122019235ACN 122019235 ACN122019235 ACN 122019235ACN-122019235-A

Abstract

The application discloses an error counter, an error counting method, computing equipment and a chip, and belongs to the field of memory management. The error counter comprises a control unit, n items, a control unit and an increase unit, wherein the control unit is used for receiving a first error address, the control unit is used for distributing the item which is not accessed for the longest time in the n items as the first item corresponding to the first memory line when the first memory line corresponding to the first error address does not correspond to the items in the n items, resetting the error times stored in the first item, and the control unit is used for increasing the error times stored in the first item when the first memory line corresponding to the first error address corresponds to the first item in the n items. In the application, the error counter can simultaneously store the error times of a plurality of memory lines, and when software reads the error times of one memory line to repair the memory line, the error times of other memory lines can still be counted continuously, so that missing line errors are avoided.

Inventors

  • WANG YUXUAN
  • SONG MINGHUI
  • ZHONG LIHUI

Assignees

  • 海光信息技术股份有限公司

Dates

Publication Date
20260512
Application Date
20260130

Claims (20)

  1. 1. An error counter, characterized in that the error counter comprises a control unit and n entries, each entry is used for storing the error times of the same memory row in a memory, n is a positive integer greater than one; The control unit is configured to receive a first error address, where the first error address at least includes a row level address of a storage unit in the memory where an error occurs; the control unit is configured to allocate an entry which is not accessed for the longest time in the n entries as a first entry corresponding to the first memory line when the first memory line corresponding to the first error address does not correspond to an entry in the n entries, and reset the error times stored in the first entry; the control unit is configured to increase the number of errors stored in the first entry when the first memory line corresponding to the first error address corresponds to the first entry among the n entries.
  2. 2. The error counter of claim 1, wherein each entry is configured to store a row tag and a number of errors in a memory row corresponding to the row tag; the control unit is configured to extract the row-level address from the first error address, obtain a first row tag based on the row-level address, and determine that no entry corresponds to the first memory row in the n entries if the first row tag is not equal to a row tag of any entry in the n entries; And if the first line tag is equal to the line tag of one of the n entries, determining that the first memory line corresponds to the first entry of the n entries.
  3. 3. The error counter of claim 2, wherein the first error address comprises an actual physical address of a memory location in the memory where the error occurred; the control unit is configured to extract the row-level address from the actual physical address.
  4. 4. The error counter of claim 3, wherein the memory is a double data rate synchronous dynamic random access memory; And the control unit is used for extracting the memory bank group address, the memory grain sequence number, the memory bank address and the row address of the memory unit with the error in the double-data-rate synchronous dynamic random access memory from the actual physical address to form the row level address.
  5. 5. The error counter of claim 3, wherein the memory is a high bandwidth memory; And the control unit is used for extracting the stack serial number, the memory bank address and the row address of the memory unit with the error in the high-bandwidth memory from the actual physical address to form the row level address.
  6. 6. An error counter according to any one of claims 2 to 5, characterized in that, The control unit is configured to take the row level address as the first row tag if the error counter is located in the memory controller; and under the condition that the error counter is positioned outside the memory controller, the memory controller number corresponding to the row-level address and the first error address is used as the first row label.
  7. 7. An error counter according to any one of claims 1 to 5, characterized in that, The control unit is further configured to send an interrupt request to a central processing unit when the number of errors recorded in an ith entry in the n entries reaches a threshold, where the interrupt request is used to trigger the central processing unit to execute memory line repair for a memory line corresponding to the ith entry, and i is a positive integer.
  8. 8. An error counter according to any one of claims 1 to 5, characterized in that, The control unit is used for receiving the first error address sent by the error checking and correcting module in the memory controller; Or alternatively The control unit is configured to receive the first error address sent by the error checking and correcting module located in the memory.
  9. 9. An error counter according to any one of claims 1 to 5, characterized in that, The control unit is configured to allocate the free entry as a first entry corresponding to the first memory line when the first memory line corresponding to the first error address does not have an entry corresponding to the n entries, and when the free entry exists in the n entries, initialize the error number stored in the first entry; The control unit is configured to allocate, when a first memory line corresponding to the first error address does not have an entry corresponding to the n entries and no free entry exists in the n entries, an entry that is not accessed for the longest time in the n entries as a first entry corresponding to the first memory line, and reset the number of errors stored in the first entry.
  10. 10. An error counter according to any one of claims 1 to 5, characterized in that, The control unit is configured to keep the number of errors stored in the jth entry unchanged when the number of errors stored in the jth entry reaches a maximum value and an error address corresponding to a memory row corresponding to the jth entry is received, where the jth entry is any one of the n entries, and j is a positive integer.
  11. 11. An error counter according to any one of claims 2 to 5, characterized in that, The n entries are further configured to receive a read operation of a central processing unit, where the central processing unit is configured to determine that an error mode of the memory belongs to a column error mode when a variation amplitude between row tag distribution results read multiple times meets a variation amplitude condition.
  12. 12. The error counter of any one of claims 1 to 5, wherein the n entries are used to form m-level entries, each of the m-level entries comprising all of the n entries and comprising at least two combinations of entries, a kth level of entry combination comprising at least one kth+1th level of entry combination, k, m each being a positive integer and k not greater than m; The control unit is further configured to determine, from the first-level entries, a target first-level entry combination including the longest non-accessed entry, and continuously determine, from the target kth-level entry combination, a target entry combination of a next level until the determined entry combination includes only one of the longest non-accessed entries.
  13. 13. The error counter of claim 12, wherein the m-level entry corresponds to m-level age bits, each level age bit corresponding to a level entry; the control unit is further configured to determine, by reading the first level age bit, a target first level entry combination including the longest non-accessed entry in the first level entry, and continuously read the next level age bit, thereby continuously determining, in the target kth level entry combination, the next level target entry combination until the determined entry combination includes only one of the longest non-accessed entries.
  14. 14. The error counter of claim 13, wherein each kth level entry combination in the kth level entries comprises two kth +1 level entry combinations, the kth level age bits comprising 2 (k-1) bits; The control unit is further configured to determine, by reading a bit in the first-level age bit, a target first-level entry combination including the longest non-accessed entry from two first-level entry combinations included in the first-level entry, and continuously read a bit corresponding to a target kth-level entry combination in the next-level age bit, thereby continuously determining, from the target kth-level entry combination, the target entry combination of the next-level until the determined entry combination includes only one of the longest non-accessed entries.
  15. 15. The error counting method is characterized by being applied to an error counter, wherein the error counter comprises a control unit and n items, each item is used for storing the error times of the same memory row in a memory, and n is a positive integer greater than one; the control unit receives a first error address, wherein the first error address at least comprises a row level address of a storage unit with an error in the memory; the control unit allocates the longest unvisited item in the n items as the first item corresponding to the first memory line under the condition that the first memory line corresponding to the first error address does not correspond to the n items, and resets the error times stored in the first item; and the control unit increases the error times stored in the first entries under the condition that the first memory row corresponding to the first error address corresponds to the first entries in the n entries.
  16. 16. The method of claim 15, wherein each entry is configured to store a row tag and a number of errors in a memory row corresponding to the row tag, the method further comprising: The control unit extracts the row level address from the first error address, and obtains a first row tag based on the row level address; If the first row tag is not equal to the row tag of any one of the n entries, the control unit determines that the first memory row has no entry corresponding to the n entries; And if the first row tag is equal to the row tag of one of the n entries, the control unit determines that the first memory row corresponds to the first entry of the n entries.
  17. 17. The method of claim 15, wherein the method further comprises: And the control unit sends an interrupt request to the central processing unit under the condition that the error times recorded by the ith item in the n items reach a threshold value, wherein the interrupt request is used for triggering the central processing unit to execute memory line repair on the memory line corresponding to the ith item, and i is a positive integer.
  18. 18. The method of claim 15, wherein the n entries are used to form m-level entries, each of the m-level entries comprising all of the n entries and comprising at least two combinations of entries, a kth-level entry combination comprising at least one kth+1th-level entry combination, k, m each being a positive integer and k not greater than m, the method further comprising: The control unit determines a target first-level item combination containing the item which is not accessed for the longest time from the first-level items, and continuously determines a target item combination of the next level from the target kth-level item combination until the determined item combination only contains one item which is not accessed for the longest time.
  19. 19. The computing device is characterized by comprising a memory controller and a memory, wherein the memory controller comprises an error counter, the error counter comprises a control unit and n items, each item is used for storing the error times of the same memory row in the memory, and n is a positive integer greater than one; The control unit is configured to receive a first error address, where the first error address at least includes a row level address of a storage unit in the memory where an error occurs; The control unit is further configured to allocate an entry which is not accessed for the longest time in the n entries as a first entry corresponding to the first memory line when the first memory line corresponding to the first error address does not correspond to an entry in the n entries, and reset the error times stored in the first entry; The control unit is further configured to increase the number of errors stored in the first entry when the first memory line corresponding to the first error address corresponds to the first entry among the n entries.
  20. 20. The computing device of claim 19, wherein the memory controller further comprises an error check correction module; the error checking and correcting module is used for sending the first error address to the control unit under the condition that an error exists in the memory; And/or the number of the groups of groups, The memory comprises an error checking and correcting module; the error checking and correcting module is configured to send the first error address to the control unit when an error is detected to exist in the memory.

Description

Error counter, error counting method, computing device, and chip Technical Field The present application relates to the field of memory management, and in particular, to an error counter, an error counting method, a computing device, and a chip. Background With the continuous development of memory technology, a Row (Row) error mode has become a core mode of memory failure, where the Row error mode refers to a phenomenon that a physical failure or a stability problem occurs in a memory unit at a Row level in a memory, so that data read-write abnormality occurs at a plurality of addresses in the same Row. In the related art, a row error detection scheme based on an MCA (MACHINE CHECK Architecture) error log is provided, in which firmware accurately locates an abnormal memory row by reading a detailed error log recorded by the MCA, so as to determine whether to execute a row repair operation. However, in the above scheme, the MCA stores the error information and sends an interrupt whenever detecting an error, and the interrupt handler in the firmware reads the error information stored in the MCA register, during which the newly found error information cannot be stored in the MCA register, so that in a high-frequency error scenario, the error log provided by the MCA inevitably lacks a large amount of error information, and an abnormal memory line may be missed. Disclosure of Invention The application provides an error counter, an error counting method, a computing device and a chip, wherein the error counter can simultaneously store the error times of a plurality of memory lines, when software reads the error times of one memory line to repair the memory line, the error times of other memory lines can be counted continuously, so that missing line errors are avoided. According to one aspect of the present application, there is provided an error counter, the error counter including a control unit and n entries, each entry for storing a number of errors of a same memory line in a memory, n being a positive integer greater than one; The control unit is configured to receive a first error address, where the first error address at least includes a row level address of a storage unit in the memory where an error occurs; the control unit is configured to allocate an entry which is not accessed for the longest time in the n entries as a first entry corresponding to the first memory line when the first memory line corresponding to the first error address does not correspond to an entry in the n entries, and reset the error times stored in the first entry; the control unit is configured to increase the number of errors stored in the first entry when the first memory line corresponding to the first error address corresponds to the first entry among the n entries. According to one aspect of the present application, there is provided an error counting method applied to an error counter, the error counter including a control unit and n entries, each of the n entries being used for storing the number of errors of a same memory row in a memory, n being a positive integer greater than one; the control unit receives a first error address, wherein the first error address at least comprises a row level address of a storage unit with an error in the memory; the control unit allocates the longest unvisited item in the n items as the first item corresponding to the first memory line under the condition that the first memory line corresponding to the first error address does not correspond to the n items, and resets the error times stored in the first item; and the control unit increases the error times stored in the first entries under the condition that the first memory row corresponding to the first error address corresponds to the first entries in the n entries. According to one aspect of the application, a computing device is provided, the computing device comprises a memory controller and a memory, the memory controller comprises an error counter and an error checking and correcting module, the error counter comprises a control unit and n items, each item is used for storing the error times of the same memory row in the memory, and n is a positive integer greater than one; the error checking and correcting module is used for sending a first error address to the control unit under the condition that an error exists in the memory, wherein the first error address at least comprises a row level address of a storage unit with the error in the memory; the control unit is used for receiving the first error address; The control unit is further configured to allocate an entry which is not accessed for the longest time in the n entries as a first entry corresponding to the first memory line when the first memory line corresponding to the first error address does not correspond to an entry in the n entries, and reset the error times stored in the first entry; The control unit is further configured to increase the number of errors stored in