AuthorsR. Peñaranda, M. E. Gómez, P. Lopez, E. G. Gran and T. Skeie
TitleA Fault-Tolerant Routing Strategy for KNS Topologies Based on Intermediate Nodes
AfilliationCommunication Systems
StatusAccepted
Publication TypeJournal Article
Year of Publication2017
JournalConcurrency and Computation: Practice and Experience
IssueSI HiPINEB 2016
PublisherJohn Wiley & Sons, Ltd.
Keywordsexascale computing, fault-tolerant routing, hybrid topology, KNS topology
Abstract

Exascale computing systems are being built with thousands of nodes. The high number of components of these systems significantly increases the probability of failure. A key component for them is the interconnection network. If failures occur in the interconnection network, they may isolate a large fraction of the machine. For this reason, an efficient fault-tolerant mechanism is needed to keep the system interconnected, even in the presence of faults. A recently proposed topology for these large systems is the hybrid k-ary n-direct s-indirect (KNS) family that provides optimal performance and connectivity at a reduced hardware cost. This paper presents a fault-tolerant routing methodology for the KNS topology that degrades performance gracefully in presence of faults and tolerates a large number of faults without disabling any healthy computing node. In order to tolerate network failures, the methodology uses a simple mechanism. For any source-destination pair, if necessary, packets are forwarded to the destination node through a set of intermediate nodes (without being ejected from the network) with the aim of circumventing faults. The evaluation results shows that the proposed methodology tolerates a large number of faults. For instance, it is able to tolerate more than 99.5% of fault combinations when there are ten faults in a 3-D network with 1,000 nodes using only one intermediate node and more than 99.98% if two intermediate nodes are used. Furthermore, the methodology offers a gracious performance degradation. As an example, performance degrades only by 1% for a 2-D network with 1,024 nodes and 1% faulty links.

DOI10.1002/cpe.4065

Contact person