Generation of fault-tolerant state-based communication schedules for real-time systems
Azim EURASIP Journal on Embedded Systems
Generation of fault-tolerant state-based communication schedules for real-time systems
Akramul Azim 0
0 Department of Electrical, Computer and Software Engineering, University of Ontario Institute of Technology (UOIT) , Oshawa , Canada
State-based schedules use a time division multiple access (TDMA) mechanism that supports executing conditional semantics and making on-the-fly decisions at runtime in each communication cycle. Until now, state-based schedules are unable to tolerate transient faults due to the assumption that stations make the on-the-fly decision on which message to execute next. Stations may make a faulty decision at run time in an unreliable communication environment such as wireless medium due to the presence of transient faults. This faulty decision causes state inconsistency among the stations in the system. In this work, we extend state-based schedules to tolerate faulty decisions in environments where transient faults can occur at the communication layer. Our proposed approach generates fault-tolerant state-based schedules using an integer linear programming optimization model after reducing the possibility of state inconsistency through using a clock and a sampling rate synchronization mechanism. The optimization model maximizes the use of time slots to place checkpoints for fault tolerance and resolving state inconsistency.
1 Introduction
The popularity of wireless networks is increasing every
day because of their easy and affordable deployment
characteristics. Due to the management issues, wired
networks such as Ethernet-based networks often impede
rapid deployment. However, wired networks in general
are more reliable than wireless networks due to the
transmission characteristics such as low channel interference
and high bandwidth.
Several communication barriers such as channel
interference and environmental challenges are the reasons for
occurring faults in wireless networks. Moreover, faults can
occur due to hardware and software glitches. For example,
device memory can flip bits and routers may drop
packets. In our context, a fault is a defect or flaw that occurs
in a hardware or software component of the system. An
error is a consequence of such a fault. As described in [
1
],
a fault remains inactive until it produces an error. A
failure occurs when an error results in the cancelation of the
requested service of a system. The failures can have
catastrophic affects in the system. For example, Therac-25 had
catastrophic consequences due to software failures.
Fault recovery can be effectively carried out by either
restoring a previously correct state [
2
] or using
redundancy [
3
]. Faults like floating point arithmetic may occur
but not be apparent at the same time [
4
]. Fault-tolerant
systems attempt to detect and correct errors before they
become effective.
Safety-critical real-time applications must function
correctly and meet their timing constraints even in the
presence of faults. Such faults can be permanent such
as broken communication links and damaged stations,
or transient such as temporary faults caused by
interference. Transient faults occur temporarily in the system but
occur more frequently (100 times more than permanent
faults) than permanent faults [
5, 6
]. This paper discusses
transient fault tolerance, leaving the extension to tolerate
permanent faults in future work.
State-based schedules [
7, 8
] are effective in saving
system resources for hard real-time systems because of
scheduling messages for the average-case rather than the
worst-case, and several case studies across different
application areas already demonstrate the advantages of this
approach including control theory [9], hybrid systems
[
10
], video-on-demand, hierarchical scheduling
frameworks [
11
], and bursty demand models [
12, 13
]. It is
possible to avoid executing the worst-case due to the
ability of making on-the-fly decisions at run time. On the
other hand, messages are always scheduled for the
worstcase in the traditional static scheme that is TDMA-based
and does not allow to make a decision at run time.
In safety-critical systems, the triple modular
redundancy (TMR) technique [
14
] is widely used for fault
tolerance. Although TMR is not a robust mechanism for
fault tolerance, the scheme can mask faults quickly and
runs efficiently. A state-based schedule can become
faulttolerant by the use of TMR, but it might not remain
effective in unreliable environments due to the
possibility of occurring faulty decisions. A faulty decision is an
incorrect or inconsistent decision taken by any of the
participating stations in the network. This results in state
inconsistency and a potential deadline loss, which is
unacceptable in real-time systems.
To ensure making the correct decision in a timely
manner for safety-critical applications, architectures using a
state-based schedule require state inconsistency
detection and resoluti (...truncated)