Generation of fault-tolerant state-based communication schedules for real-time systems

EURASIP Journal on Embedded Systems, Dec 2017

State-based schedules use a time division multiple access (TDMA) mechanism that supports executing conditional semantics and making on-the-fly decisions at runtime in each communication cycle. Until now, state-based schedules are unable to tolerate transient faults due to the assumption that stations make the on-the-fly decision on which message to execute next. Stations may make a faulty decision at run time in an unreliable communication environment such as wireless medium due to the presence of transient faults. This faulty decision causes state inconsistency among the stations in the system. In this work, we extend state-based schedules to tolerate faulty decisions in environments where transient faults can occur at the communication layer. Our proposed approach generates fault-tolerant state-based schedules using an integer linear programming optimization model after reducing the possibility of state inconsistency through using a clock and a sampling rate synchronization mechanism. The optimization model maximizes the use of time slots to place checkpoints for fault tolerance and resolving state inconsistency.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://jes-eurasipjournals.springeropen.com/track/pdf/10.1186/s13639-017-0082-x

Generation of fault-tolerant state-based communication schedules for real-time systems

Azim EURASIP Journal on Embedded Systems Generation of fault-tolerant state-based communication schedules for real-time systems Akramul Azim 0 0 Department of Electrical, Computer and Software Engineering, University of Ontario Institute of Technology (UOIT) , Oshawa , Canada State-based schedules use a time division multiple access (TDMA) mechanism that supports executing conditional semantics and making on-the-fly decisions at runtime in each communication cycle. Until now, state-based schedules are unable to tolerate transient faults due to the assumption that stations make the on-the-fly decision on which message to execute next. Stations may make a faulty decision at run time in an unreliable communication environment such as wireless medium due to the presence of transient faults. This faulty decision causes state inconsistency among the stations in the system. In this work, we extend state-based schedules to tolerate faulty decisions in environments where transient faults can occur at the communication layer. Our proposed approach generates fault-tolerant state-based schedules using an integer linear programming optimization model after reducing the possibility of state inconsistency through using a clock and a sampling rate synchronization mechanism. The optimization model maximizes the use of time slots to place checkpoints for fault tolerance and resolving state inconsistency. 1 Introduction The popularity of wireless networks is increasing every day because of their easy and affordable deployment characteristics. Due to the management issues, wired networks such as Ethernet-based networks often impede rapid deployment. However, wired networks in general are more reliable than wireless networks due to the transmission characteristics such as low channel interference and high bandwidth. Several communication barriers such as channel interference and environmental challenges are the reasons for occurring faults in wireless networks. Moreover, faults can occur due to hardware and software glitches. For example, device memory can flip bits and routers may drop packets. In our context, a fault is a defect or flaw that occurs in a hardware or software component of the system. An error is a consequence of such a fault. As described in [ 1 ], a fault remains inactive until it produces an error. A failure occurs when an error results in the cancelation of the requested service of a system. The failures can have catastrophic affects in the system. For example, Therac-25 had catastrophic consequences due to software failures. Fault recovery can be effectively carried out by either restoring a previously correct state [ 2 ] or using redundancy [ 3 ]. Faults like floating point arithmetic may occur but not be apparent at the same time [ 4 ]. Fault-tolerant systems attempt to detect and correct errors before they become effective. Safety-critical real-time applications must function correctly and meet their timing constraints even in the presence of faults. Such faults can be permanent such as broken communication links and damaged stations, or transient such as temporary faults caused by interference. Transient faults occur temporarily in the system but occur more frequently (100 times more than permanent faults) than permanent faults [ 5, 6 ]. This paper discusses transient fault tolerance, leaving the extension to tolerate permanent faults in future work. State-based schedules [ 7, 8 ] are effective in saving system resources for hard real-time systems because of scheduling messages for the average-case rather than the worst-case, and several case studies across different application areas already demonstrate the advantages of this approach including control theory [9], hybrid systems [ 10 ], video-on-demand, hierarchical scheduling frameworks [ 11 ], and bursty demand models [ 12, 13 ]. It is possible to avoid executing the worst-case due to the ability of making on-the-fly decisions at run time. On the other hand, messages are always scheduled for the worstcase in the traditional static scheme that is TDMA-based and does not allow to make a decision at run time. In safety-critical systems, the triple modular redundancy (TMR) technique [ 14 ] is widely used for fault tolerance. Although TMR is not a robust mechanism for fault tolerance, the scheme can mask faults quickly and runs efficiently. A state-based schedule can become faulttolerant by the use of TMR, but it might not remain effective in unreliable environments due to the possibility of occurring faulty decisions. A faulty decision is an incorrect or inconsistent decision taken by any of the participating stations in the network. This results in state inconsistency and a potential deadline loss, which is unacceptable in real-time systems. To ensure making the correct decision in a timely manner for safety-critical applications, architectures using a state-based schedule require state inconsistency detection and resoluti (...truncated)


This is a preview of a remote PDF: https://jes-eurasipjournals.springeropen.com/track/pdf/10.1186/s13639-017-0082-x

Akramul Azim. Generation of fault-tolerant state-based communication schedules for real-time systems, EURASIP Journal on Embedded Systems, pp. 34,