CN-121166522-B - Software fault diagnosis method for electronic control unit
Abstract
The application relates to a software fault diagnosis method for an electronic control unit, which comprises the steps of monitoring the running state of software in the electronic control unit based on a watchdog timer, generating a timeout event and triggering a non-maskable interrupt associated with the timeout event when the software fails to reset the watchdog timer within a preset timeout period, executing a corresponding interrupt service program to capture software context information of a fault moment, storing the software context information into a preset memory area of which the content is kept after the electronic control unit is reset, and reading the software context information stored in the preset memory area through a communication interface after the electronic control unit is reset. The application has the effect of automatically and reliably capturing the high-value underlying software context information at the moment of fault occurrence under the condition of no need of physically connecting with a debugger and no need of reproducing the fault.
Inventors
- WEI QING
- DENG MAO
- SUN ZHIYONG
Assignees
- 上海拓殷电子科技技术有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20250911
Claims (8)
- 1.A software fault diagnosis method for an electronic control unit, comprising the steps of: S1, monitoring the running state of software in the electronic control unit based on a watchdog timer, and generating a timeout event by the watchdog timer when the software fails to reset the watchdog timer within a preset timeout period; s2, triggering a non-maskable interrupt associated with a timeout event based on the timeout event; S3, responding to the non-maskable interrupt, executing a corresponding interrupt service routine to capture software context information at the moment of failure; S4, storing the software context information into a preset memory area of which the content is kept after the electronic control unit is reset; S5, after the electronic control unit is reset, reading the software context information stored in the preset memory area through a communication interface, and performing diagnosis analysis on software faults; The step S3 comprises the following substeps: S31, traversing a CSA linked list automatically generated by processor hardware so as to capture a function call history, wherein CSA is a context save area, PCXI is the previous context information, the linked list is composed of a plurality of CSA nodes, each CSA node is linked to the previous CSA node through a PCXI pointer, the traversing starts from the current context information pointer and backtracks according to PCXI pointers stored in each CSA node until the linked list is ended; s32, copying the content of a corresponding stack memory area based on a stack pointer register value at the fault moment and predefined stack boundary information so as to capture a process stack memory; The step S4 comprises the following substeps: S41, constructing a data structure comprising a predefined data head, wherein the data head is used for packaging the software context information and at least comprises a status flag bit for indicating the validity of data and a check code for verifying the integrity of the data; s42, filling the captured software context information into a designated area of the data structure, calculating a check code of the whole data structure content, and updating the check code into a check code field of the data head; S43, writing the whole content of the data structure into the preset memory area at one time; s44, after the data writing is completed, setting the status flag bit of the data head to be valid so as to mark that the fault information is successfully stored.
- 2. The software fault diagnosis method for an electronic control unit according to claim 1, wherein the software context information includes a function call history for recording a software execution path before a fault occurs and a process stack memory for storing a local variable, a parameter, and a return address of a function call when the fault occurs.
- 3. The software fault diagnosis method for an electronic control unit according to claim 2, wherein each CSA node is a data structure defined by processor hardware, the data structure comprising at least: A PCXI pointer for linking to a previous CSA node; A register field for storing the return address of the function to determine the location at which the current function should return to the code after execution is completed; a stack pointer register field for storing the top of the current stack frame to locate the memory region in which the local variables and parameters associated with the function call are located; a plurality of general purpose register fields for storing the operating state of the processor core to hold data calculation and processing intermediate values when a fault occurs.
- 4. A software fault diagnosis method for an electronic control unit according to claim 3, characterized in that said S31 comprises the sub-steps of: S311, acquiring an entry pointer of the current context automatically saved by processor hardware when triggered by the non-maskable interrupt, wherein the entry pointer points to the latest CSA node; S312, based on the entry pointer, reading and recording the software context contained in the latest CSA node as a first recording point of a function call history; S313, extracting a PCXI pointer stored in the current read CSA node from the CSA node; S314, judging whether the PCXI pointer points to a valid preface CSA node address, if so, jumping to the preface CSA node pointed by the PCXI pointer, and repeatedly executing the steps S312 and S313, otherwise, ending the traversal; S315, combining all recorded software contexts according to the reverse sequence of traversal backtracking, so as to reconstruct the complete function call history from the initial function call to the fault occurrence point.
- 5. The software fault diagnosis method for an electronic control unit according to claim 1, wherein the S2 includes the substeps of S21 configuring an interrupt system to set a timeout event of the watchdog timer as a trap request source at an initialization stage of the electronic control unit, S22 routing a service request of the trap request source to a non-maskable interrupt vector of the electronic control unit, S23 enabling a trap corresponding to the non-maskable interrupt to ensure that any task currently executed can be forcibly interrupted when the timeout event occurs, and jumping to a preset non-maskable interrupt service routine entry address.
- 6. The software fault diagnosis method for an electronic control unit according to claim 1, wherein said S5 comprises returning said software context information stored in said predetermined memory area as response data in response to a specific data read request transmitted by a diagnostic communication protocol by an external diagnostic device.
- 7. The software fault diagnosis method for an electronic control unit according to claim 1, wherein the predetermined memory area is configured as a non-initialized RAM area for skipping an initialization operation during system start-up.
- 8. A software fault diagnosis method for an electronic control unit according to claim 1, characterized in that the non-maskable interrupt has the property that it can still be triggered during the time when the global interrupt is disabled or other interrupt service routine is executed.
Description
Software fault diagnosis method for electronic control unit Technical Field The application relates to the field of automobile control, in particular to a software fault diagnosis method for an electronic control unit. Background In the modern automotive industry, the electronic control unit is used as a core component of a vehicle system, and the reliability, stability and safety of software become the basic stones of a whole vehicle safety system. In order to ensure that the electronic control unit can recover from unexpected software exceptions, such as a dead loop, a logic error, or a memory access conflict, when the program execution flow is stopped, a watchdog timer is generally used as a standard system security mechanism in the industry. A watchdog is a hardware timer module that is periodically reset by software in the electronic control unit during normal operation to indicate that it is operating properly. Once the software loses responsiveness, the timer cannot be reset on time, and the watchdog will timeout and force a system hardware reset to be triggered. The mechanism can effectively enable the electronic control unit to recover from the paralysis state, and ensures the basic functional safety of the vehicle. However, while watchdog reset addresses the risk of continuous jamming of software, this forced restart procedure itself can have a significant negative impact on the user experience. For example, during driving, the dashboard may suddenly flash entirely, some warning lights may be briefly triggered by mistake, or the in-vehicle infotainment system may appear as a momentary stuck or black screen. These phenomena, while transient, are sufficient to cause confusion and anxiety to the driver, reducing the quality perception and brand confidence of the product. Thus, automotive manufacturers typically require that the provider must locate and eradicate the root cause of the software card, rather than rely solely on a watchdog for passive recovery. Such software failures that lead to watchdog triggering, the positioning of their root causes is extremely difficult. In an actual road use environment, these software failures tend to be extremely sporadic, may occur once for two weeks or even longer, and have no clearly reproducible triggering rules. This low frequency and random nature renders traditional laboratory debugging approaches almost completely ineffective. The development engineer cannot stably reproduce the problem on the rack by connecting the hardware debugger, that is, cannot interrupt the program execution when a fault occurs, to check the register state and the function call stack at that time. Another auxiliary means is to embed points in the software in advance and output a large amount of running log or tracking information through a communication bus. By analyzing the log printed last before the fault occurs, the approximate execution path of the program can be deduced. However, this approach itself may affect the running time of the software, which may cause the original sporadic problem to disappear or shift. And at the moment of the system about to crash, the log output function itself may be abnormal, resulting in incomplete or inaccurate recorded information. In addition, modern vehicles commonly support remote diagnostics based on protocols such as unified diagnostic services. However, such diagnostic protocols are primarily used to diagnose known, preset failure scenarios and are not designed to debug unknown, catastrophic software program crashes. In summary, in the prior art, a mechanism for automatically and reliably capturing high-value underlying software context information at the moment of occurrence of a fault without physically connecting a debugger and without repeating the fault is currently lacking in the problem of sporadic software locking of an electronic control unit. Disclosure of Invention In order to automatically and reliably capture the high-value underlying software context information at the moment of failure occurrence without physically connecting a debugger and without repeating the failure, the application provides a software failure diagnosis method for an electronic control unit. The application provides a software fault diagnosis method for an electronic control unit, which adopts the following technical scheme: A software fault diagnosis method for an electronic control unit, comprising the steps of: S1, monitoring the running state of software in the electronic control unit based on a watchdog timer, and generating a timeout event by the watchdog timer when the software fails to reset the watchdog timer within a preset timeout period; s2, triggering a non-maskable interrupt associated with a timeout event based on the timeout event; S3, responding to the non-maskable interrupt, executing a corresponding interrupt service routine to capture software context information at the moment of failure; S4, storing the software context informati