A Probabilistic Spatial Distribution Model for Wire Faults in Parallel Network-on-Chip Links (pdf)

Article PDF cannot be displayed. You can download it here:

http://downloads.hindawi.com/journals/mpe/2015/410172.pdf

A Probabilistic Spatial Distribution Model for Wire Faults in Parallel Network-on-Chip Links

Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2015, Article ID 410172, 13 pages http://dx.doi.org/10.1155/2015/410172 Research Article A Probabilistic Spatial Distribution Model for Wire Faults in Parallel Network-on-Chip Links Arseniy Vitkovskiy, Paul Christodoulides, and Vassos Soteriou Faculty of Engineering and Technology, Cyprus University of Technology, 3603 Limassol, Cyprus Correspondence should be addressed to Paul Christodoulides; Received 4 October 2014; Accepted 11 January 2015 Academic Editor: Jinhu Lü Copyright © 2015 Arseniy Vitkovskiy et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. High-performance chip multiprocessors contain numerous parallel-processing cores where a fabric devised as a network-onchip (NoC) efficiently handles their escalating intertile communication demands. Unfortunately, prolonged operational stresses cause accelerated physically induced wearout leading to permanent metal wire faults in links. Where only a subset of wires may malfunction, enduring healthy wires are leveraged to sustain connectivity when a partially faulty link recovery mechanism is utilized, where its data recovery latency overhead is proportional to the number of consecutive faulty wires. With NoC link failure models being ultimately important, albeit being absent from existing literature, the construction of a mathematical model towards the understanding of the distribution of wire faults in parallel on-chip links is very critical. This paper steps in such a direction, where the objective is to find the probability of having a “fault segment” consisting of a certain number of consecutive “faulty” wires in a parallel NoC link. First, it is shown how the given problem can be reduced to an equivalent combinatorial problem through partitions and necklaces. Then the proposed algorithm counts certain classes of necklaces by making a separation between periodic and aperiodic cases. Finally, the resulting analytical model is tested successfully against a far more costly brute-force algorithm. 1. Introduction Continuous complementary metal-oxide-semiconductor (CMOS) transistor miniaturization, following Moore’s law, has sparked the multicore era [1, 2] in which the architectural paradigm dictates that software application execution is handled by numerous processing cores that operate in parallel. This modular design of chips, including generalpurpose chip multiprocessors (CMPs), not only ensures ultrahigh performance attainment but also provides a number of advantageous attributes such as those of power and thermal management, reconfigurability, and fault-tolerance, among others [3–5]. Networks-on-chips (NoCs) [6, 7], microscale equivalents of large-scale interconnection networks [8, 9], which also draw similarities to complex networks [10–12], as they are homogenous and exhibit clustering behaviour and short-distance communication between node-pairs, have become the de facto communication backbone in these multicore chips, including CMPs such as the Tilera TILE64 CMP [2] and Intel’s 48-core Single-chip Cloud Computer (SCC) [1], hence becoming inherent components in these parallel on-chip systems. Unfortunately, deep submicron CMOS process technology is marred by increasing susceptibility to wearout, expected to increase by 10x in the next 10 years by ITRS [13], dramatically shortening the useful lifespan of multicore systems. Point-to-point links, comprising a set of parallel metallic wires [14], interconnect neighbouring routers, allowing message transfers on-chip. Prolonged operational stress onto these parallel wires gives rise to accelerated wearout, due to physical failure mechanisms primarily including electromigration (EM) and negative bias temperature instability [15] that cause permanent device faults that can, in turn, quickly lead to architectural-level failures and possible catastrophic NoC operational failure. Faults induced by these anomalies are widely predicted to become increasingly common in the near future [16]. Research indicates that about 20% of all link errors are caused by permanent failures, occurring both at manufacture-time and at run-time [17, 18]. Moreover, the wire repeaters 2 (buffers), that is, the link drivers found in each router, the output latches, and the flip-flops of pipelined links are also susceptible and potentially vulnerable [19]. Even an isolated intrarouter or communication link failure in the NoC fabric can turn a static regular topology into an irregular one with subconnected geometry; hence, either physical connectivity among routers may not exist at all, and/or the associated routing protocol may not be able to advance packets to their destinations due to protocol-level violation(s) [20]. In-transit messages cannot traverse faulty links, with back-pressure causing the effects of the fault(s) to spread backwards, quickly causing congestion, and even leading the entire system to stall indefinitely. Further, vital components such as vital input/output (I/O) and various offchip memory modules may be partitioned away from the CMP as well, making them inaccessible. Indeed, a number of surveys [4, 5, 21, 22], which outline the design challenges and lay the roadmap in future multicore design, have emphasized the need to conduct research and identify the primary challenges in NoC reliability maintenance techniques, including link-level fault diagnosis and tolerance, as a means to safeguard the scalability and performance sustainability of general-purpose CMPs and application-driven systems-onchips (SoCs). The facts that high data rate on-chip links are susceptible to increasing failure rates that decelerate the NoC’s performance, that the NoC is critical to a CMP’s overall functionality, and that no real link failure data are readily available from manufacturers (for obvious reasons) point to the crucial need in constructing a mathematical model to aid in the understanding and exploration of the distribution of wire faults in parallel on-chip links. This model can potentially be coupled to fault-tolerant mechanisms at the chip’s architectural-level to realize improvements in intercore communication resiliency [1, 2]. This work takes decisive steps in such a direction. In this paper, we derive and demonstrate combinatoricsbased models that can be used to calculate the spatial probability distribution of individual wire faults in a parallel network-on-chip (NoC) [6] interconnect link given its bitwidth (summation of the numbers of single-bit width healthy and unhealthy wires in this parallel link) and a given number of faulty single-bit width wires that reside in this link. Modern NoCs employ interrouter links comprising several unidirectional parallel wires [14] that can transfer an entire data flit in one clock cycle. Since (...truncated)