Search

CN-121979730-A - Chip fault repair evaluation method and electronic equipment

CN121979730ACN 121979730 ACN121979730 ACN 121979730ACN-121979730-A

Abstract

The invention provides a chip fault repair evaluation method and electronic equipment, and relates to the technical field of computers. The method comprises the steps of obtaining multi-dimensional hardware state values before and after operation, calculating hardware state value change amounts of all dimensions before and after operation respectively, determining comprehensive progress scores according to the hardware state value change amounts of all dimensions, determining comprehensive rewards according to the comprehensive progress scores, and adjusting the comprehensive rewards according to equipment repairing difficulty to obtain target rewards for guiding equipment repairing directions. The method and the device quantify the effect of the repair operation based on the variable quantity of the multidimensional hardware state, and dynamically adjust rewards according to the equipment repair difficulty to generate target rewards, so that the equipment can distinguish which operations truly advance the repair process, the repair process can be effectively guided, and the repair speed and the repair efficiency are improved.

Inventors

  • Zheng Hanxun
  • LIU YOUQUN

Assignees

  • 中昊芯英(杭州)科技有限公司

Dates

Publication Date
20260505
Application Date
20260407

Claims (10)

  1. 1. A method for evaluating the repair of a chip fault, comprising: Acquiring multi-dimensional hardware state values before and after operation, and respectively calculating the hardware state value variation of each dimension before and after operation; Determining comprehensive progress scores according to the hardware state value variation of each dimension; determining a comprehensive reward according to the comprehensive progress score; And adjusting the comprehensive rewards according to the equipment repairing difficulty to obtain target rewards for guiding the equipment repairing direction.
  2. 2. The method of claim 1, wherein determining the composite progress score based on the hardware state value variation of each dimension comprises: Determining the target weight of each dimension; and carrying out weighted summation on the hardware state value variation quantity of each dimension and the target weight of each dimension to obtain the comprehensive progress score.
  3. 3. The method of claim 2, wherein determining the target weights for each dimension comprises: Acquiring a device history repairing record; determining the conditional success rate of each dimension according to the equipment history restoration record, and normalizing the conditional success rate of each dimension to obtain a first weight of each dimension, wherein the conditional success rate is used for representing the probability of improving the hardware state value of a single dimension; Determining the correlation coefficient of hardware state value variation and restoration success of each dimension according to the equipment history restoration record by adopting a regression analysis method, and normalizing the correlation coefficient of each dimension to obtain a second weight of each dimension; fusing the first weight of each dimension and the second weight of each dimension to obtain a third weight of each dimension; Performing disturbance verification on the third weight of each dimension; And if the verification is passed, taking the third weight of each dimension as the target weight of each dimension.
  4. 4. A chip fail-over assessment method according to any of claims 1 to 3, wherein said determining a composite prize from said composite progress score comprises: acquiring the type of the repair operation, and determining the operation cost according to the type of the repair operation; Determining a base bonus score based on the composite progress score; subtracting the operation cost from the basic rewards to obtain the comprehensive rewards.
  5. 5. The method of claim 4, wherein determining a base bonus point based on the composite progress score comprises: determining the base bonus point in combination with a first formula based on the composite progress point; The first formula includes: Wherein, the For the base bonus points described, And (3) the comprehensive progress score.
  6. 6. A method for evaluating chip repair according to any one of claims 1 to 3, wherein the adjusting the integrated rewards according to the equipment repair difficulty to obtain the target rewards includes: According to the equipment repairing difficulty, looking up a table to obtain expected progress scores; calculating the ratio of the comprehensive progress score to the expected progress score to obtain a relative repair performance; according to the relative repair performance, looking up a table to obtain a reward adjustment value; And summing the reward adjustment value and the comprehensive reward to obtain the target reward.
  7. 7. The method for evaluating a chip fail-over according to claim 6, characterized in that the method further comprises: Acquiring a device history repairing record; And determining the equipment repairing difficulty according to the equipment history repairing record based on a statistical analysis method.
  8. 8. The method for evaluating a chip fail-over according to claim 7, characterized in that the method further comprises: And when detecting that three continuous repair failures, the target rewards corresponding to multiple repair continuously decline or the comprehensive progress score is not more than 0, updating the equipment historical repair record, executing the statistical analysis method, and determining the equipment repair difficulty according to the equipment historical repair record.
  9. 9. The method of claim 1 to 3, wherein the multi-dimensional hardware state values include device visibility, PCIe link state, inter-ring initialization progress, inter-chip interconnect link synchronization, and error density.
  10. 10. An electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the chip fail-over evaluation method according to any one of claims 1 to 9 when executing the computer program.

Description

Chip fault repair evaluation method and electronic equipment Technical Field The present invention relates to the field of computer technologies, and in particular, to a method for repairing and evaluating a chip failure and an electronic device. Background With the explosive growth of the artificial intelligence industry, large-scale AI (ARTIFICIAL INTELLIGENCE ) computing power chip clusters have become an infrastructure supporting core scenarios such as large model training, high concurrency reasoning, and the like. Under the scene of ultra-large scale cluster deployment of kilocalorie and Mo Ka level, the links interconnected among chips bear the communication scheduling tasks of mass computing units in the cluster, and the stable operation of the communication scheduling tasks is a key premise for guaranteeing the overall performance of the cluster. With the exponential expansion of cluster scale, failure modes of inter-chip interconnect links exhibit significantly high complexity and strongly dependent features. The abnormality of a single link may cause multi-node cascade faults through topological association, and the faults cause a plurality of factors such as hardware aging, signal interference, thermal stress fluctuation and the like, which puts a very high requirement on an automatic fault repair technology. In the prior art, for inter-chip interconnection link faults, a result-oriented reinforcement learning scheme represented by a PPO (Proximal Policy Optimization, near-end policy optimization) algorithm is generally adopted for repairing. However, due to the lack of the perceptibility of the algorithm to the intermediate state of the hardware, the reinforcement learning Agent (Agent) needs to locate the fault through a large number of blind exploration steps, which results in an excessively long average repair time. And under the condition of not accurately sensing the hardware state, the Agent can frequently execute invalid global restarting operation, redundant operation is too many, repair time is further prolonged, and repair efficiency is seriously affected. Disclosure of Invention The embodiment of the invention provides a chip fault repairing and evaluating method and electronic equipment, which aim to solve the existing problems. In a first aspect, an embodiment of the present invention provides a method for evaluating chip fault repair, including: Acquiring multi-dimensional hardware state values before and after operation, and respectively calculating the hardware state value variation of each dimension before and after operation; Determining comprehensive progress scores according to the hardware state value variation of each dimension; determining a comprehensive reward according to the comprehensive progress score; And adjusting the comprehensive rewards according to the equipment repairing difficulty to obtain target rewards for guiding the equipment repairing direction. In a second aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for evaluating chip fail-over according to the first aspect or any one of the possible implementations of the first aspect when executing the computer program. The embodiment of the application provides a chip fault repair evaluation method and electronic equipment. The chip fault repairing evaluation method comprises the steps of obtaining multidimensional hardware state values before and after operation, calculating hardware state value variation of each dimension before and after operation, determining comprehensive progress scores according to the hardware state value variation of each dimension, determining comprehensive rewards according to the comprehensive progress scores, and adjusting the comprehensive rewards according to equipment repairing difficulty to obtain target rewards for guiding equipment repairing directions. The method and the device quantify the effect of the repair operation based on the variable quantity of the multidimensional hardware state, and dynamically adjust rewards according to the equipment repair difficulty to generate target rewards, so that the equipment can distinguish which operations truly advance the repair process, the repair process can be effectively guided, and the repair speed and the repair efficiency are improved. Drawings FIG. 1 is a flowchart of an implementation of a method for evaluating chip fail-over according to an embodiment of the present invention; fig. 2 is a schematic structural diagram of a chip fault repair evaluation device according to an embodiment of the present invention; Fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present invention. Detailed Description Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Referring to fig. 1, a flow