A Probabilistic Spatial Distribution Model for Wire Faults in Parallel Network-on-Chip Links
Hindawi Publishing Corporation
Mathematical Problems in Engineering
Volume 2015, Article ID 410172, 13 pages
http://dx.doi.org/10.1155/2015/410172
Research Article
A Probabilistic Spatial Distribution Model for Wire
Faults in Parallel Network-on-Chip Links
Arseniy Vitkovskiy, Paul Christodoulides, and Vassos Soteriou
Faculty of Engineering and Technology, Cyprus University of Technology, 3603 Limassol, Cyprus
Correspondence should be addressed to Paul Christodoulides;
Received 4 October 2014; Accepted 11 January 2015
Academic Editor: Jinhu Lü
Copyright © 2015 Arseniy Vitkovskiy et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
High-performance chip multiprocessors contain numerous parallel-processing cores where a fabric devised as a network-onchip (NoC) efficiently handles their escalating intertile communication demands. Unfortunately, prolonged operational stresses
cause accelerated physically induced wearout leading to permanent metal wire faults in links. Where only a subset of wires may
malfunction, enduring healthy wires are leveraged to sustain connectivity when a partially faulty link recovery mechanism is
utilized, where its data recovery latency overhead is proportional to the number of consecutive faulty wires. With NoC link failure
models being ultimately important, albeit being absent from existing literature, the construction of a mathematical model towards
the understanding of the distribution of wire faults in parallel on-chip links is very critical. This paper steps in such a direction,
where the objective is to find the probability of having a “fault segment” consisting of a certain number of consecutive “faulty” wires
in a parallel NoC link. First, it is shown how the given problem can be reduced to an equivalent combinatorial problem through
partitions and necklaces. Then the proposed algorithm counts certain classes of necklaces by making a separation between periodic
and aperiodic cases. Finally, the resulting analytical model is tested successfully against a far more costly brute-force algorithm.
1. Introduction
Continuous complementary metal-oxide-semiconductor
(CMOS) transistor miniaturization, following Moore’s law,
has sparked the multicore era [1, 2] in which the architectural
paradigm dictates that software application execution is
handled by numerous processing cores that operate in
parallel. This modular design of chips, including generalpurpose chip multiprocessors (CMPs), not only ensures
ultrahigh performance attainment but also provides a
number of advantageous attributes such as those of power
and thermal management, reconfigurability, and fault-tolerance, among others [3–5]. Networks-on-chips (NoCs) [6, 7],
microscale equivalents of large-scale interconnection networks [8, 9], which also draw similarities to complex networks [10–12], as they are homogenous and exhibit clustering
behaviour and short-distance communication between
node-pairs, have become the de facto communication
backbone in these multicore chips, including CMPs such as
the Tilera TILE64 CMP [2] and Intel’s 48-core Single-chip
Cloud Computer (SCC) [1], hence becoming inherent
components in these parallel on-chip systems.
Unfortunately, deep submicron CMOS process technology is marred by increasing susceptibility to wearout,
expected to increase by 10x in the next 10 years by ITRS
[13], dramatically shortening the useful lifespan of multicore
systems. Point-to-point links, comprising a set of parallel
metallic wires [14], interconnect neighbouring routers, allowing message transfers on-chip. Prolonged operational stress
onto these parallel wires gives rise to accelerated wearout, due
to physical failure mechanisms primarily including electromigration (EM) and negative bias temperature instability [15]
that cause permanent device faults that can, in turn, quickly
lead to architectural-level failures and possible catastrophic
NoC operational failure.
Faults induced by these anomalies are widely predicted
to become increasingly common in the near future [16].
Research indicates that about 20% of all link errors are caused
by permanent failures, occurring both at manufacture-time
and at run-time [17, 18]. Moreover, the wire repeaters
2
(buffers), that is, the link drivers found in each router, the
output latches, and the flip-flops of pipelined links are also
susceptible and potentially vulnerable [19].
Even an isolated intrarouter or communication link
failure in the NoC fabric can turn a static regular topology
into an irregular one with subconnected geometry; hence,
either physical connectivity among routers may not exist at
all, and/or the associated routing protocol may not be able
to advance packets to their destinations due to protocol-level
violation(s) [20]. In-transit messages cannot traverse faulty
links, with back-pressure causing the effects of the fault(s)
to spread backwards, quickly causing congestion, and even
leading the entire system to stall indefinitely. Further, vital
components such as vital input/output (I/O) and various offchip memory modules may be partitioned away from the
CMP as well, making them inaccessible. Indeed, a number of
surveys [4, 5, 21, 22], which outline the design challenges and
lay the roadmap in future multicore design, have emphasized
the need to conduct research and identify the primary
challenges in NoC reliability maintenance techniques, including link-level fault diagnosis and tolerance, as a means to
safeguard the scalability and performance sustainability of
general-purpose CMPs and application-driven systems-onchips (SoCs).
The facts that high data rate on-chip links are susceptible to increasing failure rates that decelerate the NoC’s
performance, that the NoC is critical to a CMP’s overall
functionality, and that no real link failure data are readily
available from manufacturers (for obvious reasons) point to
the crucial need in constructing a mathematical model to
aid in the understanding and exploration of the distribution
of wire faults in parallel on-chip links. This model can
potentially be coupled to fault-tolerant mechanisms at the
chip’s architectural-level to realize improvements in intercore
communication resiliency [1, 2]. This work takes decisive
steps in such a direction.
In this paper, we derive and demonstrate combinatoricsbased models that can be used to calculate the spatial
probability distribution of individual wire faults in a parallel
network-on-chip (NoC) [6] interconnect link given its bitwidth (summation of the numbers of single-bit width healthy
and unhealthy wires in this parallel link) and a given number
of faulty single-bit width wires that reside in this link.
Modern NoCs employ interrouter links comprising several
unidirectional parallel wires [14] that can transfer an entire
data flit in one clock cycle. Since (...truncated)