DE-102024133090-A1 - State and fault management mechanism for a PCIe communication stack in motor vehicles
Abstract
The invention relates to a centralized computer architecture (S) for a vehicle, comprising the following: multiple high-performance computing (HPC) platforms for the provision of automotive applications (008), e.g. ADAS, infotainment, infrastructure, etc. multiple hardware and/or software components configured to provide Peripheral Component Interconnect Express (PCle) communication to connect high-performance computing (HPC) platforms, the centralized computer architecture (S) further includes: - a preferably centralized condition management system (SMS) and - a preferably centralized error management system (EMS).
Inventors
- Licong Zhang
- Stefan Herr
- Krishnaswamy Durai
Assignees
- CARIAD SE
Dates
- Publication Date
- 20260513
- Application Date
- 20241112
Claims (15)
- A centralized computing architecture (S) for a vehicle, comprising: multiple high-performance computing (HPC) platforms for providing automotive applications (008), e.g., ADAS, infotainment, infrastructure, etc., multiple hardware and/or software components configured to provide Peripheral Component Interconnect Express (PCle) communication to connect high-performance computing (HPC) platforms, the centralized computing architecture (S) further comprising: - a preferably centralized condition management system (SMS) and - a preferably centralized fault management system (EMS).
- The centralized computer architecture (S) according to Claim 1 , wherein the state management system (SMS) comprises: - a state acquisition and mapping (103), which is implemented in particular in a PCle communication driver (004), - a main state management (110), which is implemented in particular in an abstraction manager (005), and/or wherein the fault management system (EMS) comprises: - a fault mapping mechanism (205), which is implemented in particular in a PCle communication driver (004), - a main fault management (211), which is implemented in particular in an abstraction manager (005), - a software monitoring mechanism (220), which is implemented in particular in an abstraction manager (005).
- The centralized computer architecture (S) according to Claim 1 or 2 , wherein the state management system (SMS) comprises: - PCle switch states (101), in particular provided on a PCle switch (001), - PCle controller states (102), in particular provided on a PCle controller (002), - PCle communication driver states (106) and end-to-end heartbeat (107), in particular implemented on a PCle communication driver (004), - a state acquisition and mapping component (104) and a link state machine (105), in particular implemented on a PCle communication driver (004), - a driver state interface (108), in particular implemented on a PCle communication driver (004), - a system state machine (111), in particular implemented on an abstraction manager (005), - a stack-level system state interface (112), in particular implemented on an abstraction manager (005) for a client library for SEM (014), integrated into an application state and error management (012), - a memory segment state machine (113), in particular implemented in a client library for COM (010) for the automotive applications (008), and/or - a memory segment state interface (114), in particular implemented on a client library for COM (010) for the automotive applications (008).
- The centralized computer architecture (S) according to one of the preceding claims, wherein the error management system (EMS) comprises: - PCle transmission errors (201) and PCle switch errors (202), in particular implemented on a PCle switch (001), - PCle protocol errors (203), in particular detected on a PCle controller (002), - an extended error reporting system (204), in particular implemented on an operating system (003), - PCle communication driver software errors (208), in particular provided by the PCle communication driver (004), - a data transmission error mapping system (206) and a system error mapping system (207), in particular implemented on a PCle communication driver (004), - a driver-level transmission error interface (209) and a driver-level system error interface (210), in particular implemented on a PCle communication driver (004), - an allocation power counter (212), an error mapping system (213) and errors from the system state machine (214), which are implemented in particular on a PCle communication manager (006) as part of an abstraction manager (005), - a stack-level power counter interface (215) and a stack-level system error interface (216), which are implemented in particular on an abstraction manager (005) for application state and error management (012), - a stack-level transmission error interface (217), which is implemented in particular on a PCle communication driver (004) for the automotive applications (008), - end-to-end, sequence counter, cyclic redundancy check, data ID error (218), which is detected in particular by a PCle end-to-end library (011) as part of automotive applications (008), - Timeout errors (219), in particular detected by the automotive applications (008), - Software timing and execution monitoring (222) and software errors (223), in particular implemented and/or detected by a software monitor (007), and/or - Software monitoring interfaces (224), (225), (226), (227), (228) for software timing and execution monitoring (222).
- The centralized computer architecture (S) according to one of the preceding claims, further comprising a multi-layered software architecture (SWA) comprising: - a PCle switch (001), wherein the PCle switch (001) is a hardware module running switch firmware, and/or wherein the PCle switch (001) is configured to detect PCle transmission errors (201), PCle switch errors (202), and/or the PCle switch states (101) and report them in-band via a PCle connection to PCle software stacks running on high-performance computing (HPC) platforms, and/or wherein the PCle switch (001) is configured to report PCle switch errors (202) to a remote management host (RMH), preferably a microcontroller used for controlling and flashing the PCle switch (001) is configured, - a PCle controller (002), wherein the PCle controller is a hardware module on the high-performance computing (HPC) platforms that preferably uses low-level drivers of a board support package (BSP), and/or wherein the PCle controller and preferably low-level drivers of the board support package (BSP) are configured to detect PCle protocol errors (203) and PCle controller states (102) specified in the PCle standards and report them to higher layers of a software architecture (SWA), - an operating system (003), e.g. B. Linux or QNX, wherein the operating system (003) includes standard modules to support PCle capabilities, and/or wherein the operating system (003) is configured for error management, and/or wherein the operating system (003) is configured to collect and report errors specified in PCle standards for extended error reporting (204), and/or wherein the operating system (003) is configured to collect errors for extended error reporting (204) that are reported directly by a board support package (BSP), - PCle communication driver (004), wherein the PCle communication driver (004) is a layer that provides drivers for the PCle switch (001) and services for basic communication over shared memory, e.g. B. DMA, PIO and/or NTB, and/or wherein the PCle communication driver (004) is configured to provide mainly two functions: a fault mapping mechanism (205) and state acquisition and mapping (103), and/or wherein the PCle communication driver (004) is configured to provide a standardized interface (IF) for driver-level transmission faults (209), system faults (210) and a state (108), - an abstraction layer comprising: an abstraction manager (005), comprising: - a PCle communication manager (006), wherein the PCle communication manager (006) is configured to provide a main fault management (211) and main state management (110) functionality to consolidate and abstract low-level faults and states and compile them into faults and states suitable for automotive applications (008) are understandable and usable, - a software monitor (007) wherein the software monitor (007) is configured to implement functionality of a software monitoring mechanism (220) used to monitor software components, and a fallback state and fault management (221) used to safeguard state and fault management in the event that the PCle communication manager (006) crashes and the main state and fault management functionality is lost, - a client library for COM (010) for the automotive applications (008) wherein the client library for COM (010) is a client library within a process of a generic application that uses PCle communication, and/or wherein the client library for COM (010) is configured to provide application programming interfaces (APIs) for communication and an interface to the automotive applications (008) for a stack-level transmission error interface (217), and/or wherein the A client library for COM (010) is configured to implement a memory segment state machine (113) and provide an interface for a stack-level segment state interface (114) to the automotive applications (008); a client library for SEM (014), wherein the client library for SEM (014) is a client library within a process of an application state and fault management (012), e.g., from an ADAS system, which provides the interfaces for a stack-level power counter interface (215), for a stack-level system fault interface (216), for a stack-level system state interface (112), and for a stack-level system fault fallback interface (231). provides, - a PCle end-to-end library (011), wherein the PCle end-to-end library (011) is a client library configured to be used by the automotive applications (008) to compute and verify various variants of end-to-end protection and its elements, such as a sequence counter (SQC), a cyclic redundancy check (CRC), and a data ID, and/or - the automotive applications (008), comprising: - an automotive application operating logic (009), wherein the automotive application operating logic (009) is a generic automotive application that uses PCle communication, and/or wherein the automotive application operating logic (009) is configured to implement corresponding automotive application business logic (401), - an application state and fault management system (012), wherein the application state and fault management system (012) is configured to perform state and fault management. Implemented at the application level.
- The centralized computer architecture (S) according to one of the preceding claims, wherein a system state machine (111) is configured to provide the following states: - Startup state, covering a startup phase of the PCle software stack, - Running state, covering the main operating states of the PCle software stack, - Disconnect state, in particular as a substate of the running state, - Termination and Exited state, covering a proper shutdown phase of the PCle software stack, - PreparingToSuspend, in particular as a substate of the running state, - ReadyToSuspend state, SuspendToRam and RecoverFromSuspend state, covering states that implement suspension to RAM and recovery from it, - FaultAndRestart state, covering states that indicate a system fault and recovery from it, - Silent state to provide a failsafe in the event of a malfunction to ensure operation and to initiate an interruption of communication.
- The centralized computer architecture (S) according to one of the preceding claims, wherein a connection state machine (105) is configured to provide the following substate machines: - an ego-node connection state machine to provide a state of the PCle connection of an ego-node, in particular the node on which the state machine is located, - remote-node connection state machines to provide a state of the PCle connection of remote nodes, in particular the nodes as communication partners of the ego-node, whereby a remote-node connection state is preferably implemented for each individual remote node, whereby both the ego-node connection machines and the remote-node connection state machines advantageously implement the following states: - "nodeConnected" state and - "nodeUnconnected" state, whereby, in particular, the connection states of the ego-node and the remote nodes can be determined by states that are defined by the PCle controller (002) for the ego node and by the PCle switch (001) for the ego node and the remote nodes, and/or wherein an end-to-end heartbeat mechanism (107) is implemented in which each node sends a periodically adaptable payload to all communication partner nodes, whereby, in particular, the end-to-end heartbeat (107) can be used to derive and improve certain states from the PCle controller (002) and the PCle switch (001), also covering the detection of transmission errors in software layers above the PCle controller (002) and the PCle switch (001) in the PCle software stack.
- The centralized computer architecture (S) according to one of the preceding claims, wherein a memory segment state machine (113) is configured to provide the following states for both DMA and PIO data transmission: - Unconnected state, which denotes a state in which a segment is not connected, - Ready state, which denotes a state in which a segment is ready for any operation, - BlockedForWrite state, which denotes a state in which a segment is blocked for writing on the sender side, - "BlockedForWrite: WriteInProgress" state, which denotes a state in which writing is currently taking place on the sender side, - BlockedForRead state, which denotes a state in which a segment is blocked for reading on the receiver side, - BlockedForRead: ReadInProgress state, which denotes a state in which reading is currently taking place on the receiver side, - BlockedForSync state, particularly for DMA data transmission only, which denotes a state in which a segment is blocked for DMA transmission is blocked. - SynclnProgress state, especially for DMA data transmission, refers to a state in which a DMA transmission is taking place.
- The centralized computer architecture (S) according to any one of the preceding claims, wherein a PCIe controller state machine (102) is configured to provide the following states: - Normal operating state, - Unavailable state and/or - Fault state, and/or wherein a PCIe communication driver state machine (106) is configured to provide the following states: - Initialization state, - Normal operating state, - Silent state, - Fault state, - Ready to shut down state and/or - Ready to suspend state, suspend-to-RAM state, and restore-from-suspension state.
- The centralized computer architecture (S) according to one of the preceding claims, wherein a software monitoring mechanism (220) is configured to provide the following functionality: - liveness monitoring and timing monitoring of software components and/or processes in the PCle communication stack, in particular, wherein corresponding software components and/or processes: - a PCle communication driver (004), - a PCle communication manager (006), - a client library for COM (010) for automotive applications (008), - a PCle end-to-end library (011) and/or - a client library for SEM (014) implement a software monitoring interface (224, 225, 226, 227, 228) to report liveness and/or timing checks, in particular, wherein the results of the software monitoring are preferably reported as software errors (223) which are sent to the fault management system (EMS). are integrated.
- The centralized computer architecture (S) according to one of the preceding claims, an error display (EI) further comprising input errors: - PCle protocol errors (203) and/or errors detected by the extended error reporting (204) of the PCle communication stack, - PCle switch errors (201, 202), including PCle transmission errors detected by the PCle switch and PCle switch errors and/or - software errors (208, 223, 214), including errors detected by software monitoring and/or by software components, and/or errors from the system state machine, whereby, in particular, the input errors are inputs to the entire error display system, whereby the input errors are preferably not reported directly to the automotive applications (008), but are mapped and transformed by an error display system of the PCle communication stack.
- The centralized computer architecture (S) according to Claim 11 , wherein the error indication (EI) further includes output errors: - transmission errors and/or errors related to the transmission of individual data, - system errors and/or errors indicating errors of the PCle communication stack, and/or - power counters indicating optionally correctable errors of the PCle switch (001) that can be corrected by hardware, wherein in particular transmission errors are reported to individual automotive applications (008) that perform data transmission via a stack-level transmission error interface (217), wherein preferably system errors are reported to an application state and error management system (012) of individual automotive applications (008) via a stack-level system error interface (216), wherein power counters are advantageously reported to an application state and error management system (012) via a stack-level power counter interface (215).
- The centralized computer architecture (S) according to one of the preceding Claims 11 or 12 , wherein the error indicator (EI) further includes an error mapping and message to provide the following functions: - a data transmission error mapping (206) located in an error mapping mechanism (205) of the PCle communication driver (004) and specifically configured to map input errors from various sources to transmission errors, - a system error mapping (207) located in an error mapping mechanism (205) of a PCle communication driver (004) and specifically configured to map input errors from various sources as input to a main error management (211) of an abstraction manager (005), - an error mapping system (213) located in a main error management (211) of an abstraction manager (005) and specifically configured to route from an error mapping mechanism (205) of a PCle communication driver (004) to a system error interface to map system errors reported at the stack level (216) to an application state and fault management system (012) and/or - an allocation power counter (212) located in a main fault management system (211) of an abstraction manager (005) and specifically configured to map system errors reported by a fault mapping mechanism (205) of a PCle communication driver (004) to a stack level power counter interface (215) provided in an application state and fault management system (012).
- The centralized computer architecture (S) according to one of the preceding claims, the fault management system (EMS), further comprises components for fallback state and fault management (221) that provide the following functionalities: - state and fault system fallback (229), - fault handling fallback (230), and/or - a stack-level system fault fallback interface (231), whereby, in particular, the state and fault system fallback (229) is used as a simplified state machine and fault indication mechanism for an application state and fault management system (012) when a main state management system (110) and a main fault management system (211) are unavailable due to a software fault in a PCle communication manager (006), whereby, preferably, the state and fault system fallback (229) reports faults to an application state and fault management system (012) so that the latter can take action, whereby the state and fault system fallback (229) further provides a direct establishes a connection to a state acquisition and mapping (103) of the PCle communication driver (004) to provide basic functions such as the silent state, whereby the fault handling fallback (230) is used in particular to provide interfaces to the automotive applications (008) to execute fault handling measures, whereby the fault handling fallback (230) preferably does not make any treatment decisions, but receives decisions from an application state and fault management (012) and triggers appropriate measures for platform modules, such as the execution management module for resetting software components.
- Method for operating a centralized computer architecture (S) for a vehicle, comprising: multiple high-performance computing (HPC) platforms for providing automotive applications (008), e.g., ADAS, infotainment, infrastructure, etc., multiple hardware and/or software components configured to provide Peripheral Component Interconnect Express (PCle) communication to connect high-performance computing (HPC) platforms, provision of communication for a preferably centralized condition management system (SMS), and provision of communication for a preferably centralized fault management system (EMS).
Description
The invention relates to a centralized computer architecture for a vehicle and a corresponding method for operating such an architecture. PCle communication (Peripheral Component Interconnect Express) for distributed E/E systems (e.g., end-to-end) in motor vehicles was recently described by the inventors in European patent applications. EP 24 177 734.1 and EP 24 177 747.3 proposed. The entire content of these patent applications is incorporated into this disclosure by reference. PCle is a new communication technology in automotive E/E systems that offers advantages over Ethernet communication for connecting multiple high-performance computing platforms (e.g., ADAS and infotainment) in a centralized computer architecture. Several hardware and software components contribute to the overall system architecture (see 6 High-performance computing platforms (HPCs, whose computer devices include one or more microprocessors) are the primary participants in PCle communication. Data between HPCs is transferred either via the Direct Memory Access (DMA) or the Programmed Input/Output (PIO) protocol. A uniform PCle software stack is deployed on each HPC to enable PCle communication services for automotive applications running on the HPCs. A PCIe switch is provided for data transmission between the HPCs via a non-transparent bridge (NTB) (and, if required, for a DMA engine for DMA transfers). To enable this, the switch runs a software- and system-dependent configuration. A remote management host (RMH), typically a microcontroller, is provided to control and flash the PCIe switch. Software is deployed on the RMH to perform these tasks. The software stack mainly comprises three layers: - The operating system and base layer provide the basic functions of PCIe communication as defined in the PCIe standards. - The PCIe stack driver provides the drivers for cooperation with the PCIe switch (DMA or PIO and NTB drivers) and delivers the basic drivers for memory management. - The abstraction layer abstracts the underlying layers and provides a shared-memory API for high-level automotive applications. Further details regarding the system architecture and the software architecture for the PCIe software stack on the HPCs are contained in the patent application. EP 24 177 734.1 to be taken. It is therefore an object of the present invention to improve a centralized computer architecture for a vehicle and a corresponding method for operating such an architecture. In particular, it is an object of the invention to provide a condition and fault management concept that addresses the following topics: 1) A condition management system that: - not only encompasses and integrates the low-level states defined by the PCIe standards, but also the states of the PCIe switch and the software components from a system perspective, - can be seamlessly integrated into the condition and lifecycle management of commonly used automotive operating systems, such as QNX, Linux or Android, and into automotive software frameworks, including but not limited to the AUTOSAR framework, - introduces additional state machines such as link and segment states, which are useful at the system and application levels, - provides interfaces to automotive applications that consolidate and abstract states into states that are useful for generic automotive applications that utilize PCle communication, and enable the interaction of these generic applications with the state management systems in certain use cases, - deals with the safety requirements of critical automotive systems (e.g. driver assistance systems or ADAS for short). 2) An error management system that: - not only detects and integrates low-level errors defined by the PCIe standards, but also errors from the PCIe switch and software components from a system perspective, - provides an error display scheme that converts low-level errors to high-level errors lidized and abstracted, reflecting the general performance state of the PCle communication capability that a generic automotive application or a condition and fault management system of an application domain (e.g., ADAS) can understand, interpret, and handle. - provides tailored functions for the application to handle errors from PCle communication and clearly defines the responsibilities for error handling between the PCle software stack and the application logic, - seamlessly identifies a software monitoring scheme that identifies software errors, - provides interfaces to the applications of the corresponding abstraction level, - meets the requirements for safety-critical automotive systems (e.g. ADAS). The aforementioned problem is solved by: A centralized computer architecture for a vehicle and a corresponding method for operating such an architecture, comprising the features of an independent system claim and an independent method claim, respectively. The features and details described in connection with the various embodiments and/or aspects of the invention