Search

CN-120045274-B - Fault processing method and device, electronic equipment and computer readable storage medium

CN120045274BCN 120045274 BCN120045274 BCN 120045274BCN-120045274-B

Abstract

The disclosure provides a fault processing method and device, electronic equipment and a computer readable storage medium, relates to the technical field of data processing, and particularly relates to the technical fields of cloud computing, virtual machines, computer hardware faults and the like. The method comprises the steps of obtaining memory fault information reported by faulty memory hardware, wherein the memory fault information comprises a faulty host physical address of a host machine, determining a user state process corresponding to the faulty host physical address according to the faulty host physical address through reverse mapping, obtaining a faulty host virtual address corresponding to the faulty host physical address according to address information of a virtual memory area when the user state process is a virtual machine simulator process, and sending the faulty host virtual address to the user state process so that the user state process can isolate the memory hardware of a faulty virtual machine running in the host machine according to the faulty host virtual address indication.

Inventors

  • CHU KAIPING
  • WANG LIANG
  • YING RU
  • ZHENG RAN
  • MENG XIANJUN

Assignees

  • 北京百度网讯科技有限公司

Dates

Publication Date
20260512
Application Date
20241220

Claims (12)

  1. 1. A fault handling method, comprising: Obtaining memory fault information of machine inspection abnormal driving which is reported to a host by the failed memory hardware, wherein the memory fault information comprises a physical address of the failed host of the failed memory hardware and a grade of the memory fault; Invoking a reverse mapping function of a host operating system kernel, and determining a user state process corresponding to the physical address of the fault host according to the physical address of the fault host; Under the condition that the user state process is a virtual machine simulator process, acquiring a fault host virtual address corresponding to the physical address of the fault host according to the address information of the virtual memory area; The virtual address of the fault host is sent to the user state process, so that the user state process can operate the fault virtual machine of the host according to the virtual address of the fault host and the grade indication of the memory fault; the fault virtual machine is configured to judge the received grade of the memory fault by adopting a corresponding fault judging component or daemon according to the type of an operating system of the fault virtual machine, and differentially isolate the memory hardware based on a judging result; The differential isolation of the memory hardware comprises the steps of calling a hard-offline function of the failed virtual machine to isolate the memory hardware and killing virtual machine processes using the failed memory hardware under the condition that the memory failure is a failure which can be processed by taking measures or the failure which must be processed by taking measures, accumulating the times of the memory failure under the condition that the memory failure is in a correctable error, and calling a soft-offline function of the failed virtual machine to isolate the memory hardware when the times of the memory failure reach a preset threshold value.
  2. 2. The method of claim 1, further comprising: Storing the corresponding relation of the physical address of the fault host, the fault process identifier and the fault process identifier as historical fault information; Wherein the failed process identifier is a process controller of the user state process.
  3. 3. The method of claim 2, further comprising: obtaining operation identifiers corresponding to virtual machine operations of one or more virtual machines running on the host machine; retrieving in the historical fault information based on the virtual machine operation and the operation identification; modifying the historical fault information and/or indicating the virtual machine with the operation identifier to isolate the memory hardware based on the retrieval result; And the operation identifier is a process controller of a virtual machine simulator process corresponding to the virtual machine operated by the virtual machine.
  4. 4. The method of claim 3, wherein the retrieving in the historical failure information based on the virtual machine operation and the operation identification comprises: retrieving in the historical fault information based on the operation identification under the condition that the virtual machine is operated as a hot restart; modifying the historical fault information and/or indicating the virtual machine to operate to isolate the memory hardware through the operation identifier based on the retrieval result, wherein the method comprises the following steps: and under the condition that the fault process identifier consistent with the operation identifier is retrieved from the historical fault information, acquiring a fault host virtual address corresponding to the physical address of the fault host, and sending the fault host virtual address to a virtual machine simulator process corresponding to the operation identifier.
  5. 5. The method of claim 3, wherein the retrieving in the historical failure information based on the virtual machine operation and the operation identification comprises: Searching in the historical fault information based on the operation identification under the condition that the virtual machine is operated to delete the virtual machine; modifying the historical fault information and/or indicating the virtual machine to operate to isolate the memory hardware through the operation identifier based on the retrieval result, wherein the method comprises the following steps: And under the condition that the fault process identification consistent with the operation identification is retrieved from the historical fault information, modifying the fault process identification consistent with the operation identification in the historical fault information into an invalid value.
  6. 6. The method of claim 5, wherein the retrieving in the historical failure information based on the virtual machine operation and the operation identification comprises: under the condition that the virtual machine operates as a virtual machine, acquiring a physical address of a fault host corresponding to a fault process identifier of an invalid value in the historical fault information; modifying the historical fault information and/or indicating the virtual machine to operate to isolate the memory hardware through the operation identifier based on the retrieval result, wherein the method comprises the following steps: Determining a fault process controller of a user state process corresponding to the physical address of the fault host machine obtained by searching according to the physical address of the fault host machine obtained by searching through reverse mapping; and under the condition that the fault process controller is consistent with the operation identifier, modifying the fault process identifier corresponding to the physical address of the fault host machine obtained by searching in the historical fault information into the operation identifier.
  7. 7. The method of claim 2, further comprising: detecting that the host machine is restarted, and before the virtual machine running on the host machine is restarted, modifying the fault process identifier in the history fault information into an invalid value; After the virtual machine running on the host machine is restarted, determining a user state process corresponding to the physical address of the failed host machine according to the physical address of the failed host machine in the historical failure information through reverse mapping; and under the condition that the user state process is a virtual machine simulator process, modifying the fault process identifier corresponding to the physical address of the fault host in the history fault information into a process controller of the user state process.
  8. 8. The method of claim 2, further comprising: And deleting fault information corresponding to the physical address of the fault host of the memory hardware in the host under the condition that the memory hardware fault is cleared.
  9. 9. A fault handling apparatus comprising: The system comprises an information collection module, a storage management module and a storage management module, wherein the information collection module is used for obtaining the memory fault information of a machine inspection abnormal drive which is reported to a host by the failed memory hardware, and the memory fault information comprises the physical address of the failed memory hardware in the host and the grade of the memory fault; The reverse mapping module is used for calling the reverse mapping function of the host operating system kernel, and determining a user state process corresponding to the physical address of the fault host according to the physical address of the fault host; The fault injection module is used for sending the virtual address of the fault host to the user state process so that the user state process can operate the fault virtual machine of the host according to the virtual address of the fault host and the grade indication of the memory fault; The fault virtual machine is configured to judge the received grade of the memory fault by adopting a corresponding fault judging component or daemon according to the type of an operating system of the fault virtual machine and differentially isolate the memory hardware based on a judging result, and particularly configured to call a hard-offline function of the fault virtual machine to isolate the memory hardware and kill a virtual machine process using the memory hardware with the fault under the condition that the grade of the memory fault is a fault which can be processed by taking measures or the fault which must be processed by taking measures, and to accumulate the times of the memory fault under the condition that the grade of the memory fault is a correctable error and call a soft-offline function of the fault virtual machine to isolate the memory hardware when the times of the memory fault reach a preset threshold value.
  10. 10. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
  11. 11. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.
  12. 12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.

Description

Fault processing method and device, electronic equipment and computer readable storage medium Technical Field The disclosure relates to the technical field of data processing, in particular to the technical fields of cloud computing, virtual machines, computer hardware faults and the like. In particular, embodiments of the present disclosure relate to a fault handling method and apparatus, an electronic device, and a computer-readable storage medium. Background In the cloud computing field, a large server is split into multiple small virtual machines for use by multiple clients through virtualization technology. There may be VFIO (a device pass-through technology for Linux operating system) devices in these small virtual machines, in order to support DMA (Direct Memory Access ) of VFIO devices in the virtual machines, HPA (Host PHYSICAL ADDRESS ) memory Pin (lock) corresponding to GPA (Guest PHYSICAL ADDRESS, virtual machine physical address) participating in DMA transfer in the virtual machines needs to be locked, so as to prevent these memories from being swapped to a swap partition by the Host machine or from being moved due to memory regularity. Disclosure of Invention The disclosure provides a fault processing method and device, electronic equipment and a computer readable storage medium. According to a first aspect of the present disclosure, there is provided a fault handling method, the method comprising: Obtaining memory fault information reported by the failed memory hardware, wherein the memory fault information comprises a physical address of a failed host of the failed memory hardware; Determining a user state process corresponding to the physical address of the fault host machine through reverse mapping according to the physical address of the fault host machine; Under the condition that the user state process is a virtual machine simulator process, acquiring a fault host virtual address corresponding to the physical address of the fault host according to the address information of the virtual memory area; and sending the virtual address of the fault host to the user state process so that the user state process can isolate the memory hardware according to the virtual address of the fault host indicating the fault virtual machine running on the host. According to a second aspect of the present disclosure, there is provided a fault handling apparatus, the apparatus comprising: The information collection module is used for obtaining memory fault information reported by the failed memory hardware, wherein the memory fault information comprises a physical address of a failed host of the failed memory hardware; The reverse mapping module is used for determining a user state process corresponding to the physical address of the fault host through reverse mapping according to the physical address of the fault host; under the condition that the user state process is a virtual machine simulator process, acquiring a fault host virtual address corresponding to the physical address of the fault host according to the address information of the virtual memory area; and the fault injection module is used for sending the virtual address of the fault host to the user state process so that the user state process can isolate the memory hardware according to the fault host virtual address indicating the fault virtual machine running on the host. According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the fault handling method. According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the above-described fault handling method. According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above-described fault handling method. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification. Drawings The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein: Fig. 1 is a schematic flow chart of a fault handling method according to an embodiment of the disclosure; FIG. 2 illustrates an overall architecture diagram of a host and virtual machine in an embodiment of the present disclosure; FIG. 3 is a flow chart illustrating partial steps of another fault handling method provided by an