Efficient deadlock-free routing

Project Details

Project Lead
Thomas William 
Project Manager
Jens Domke 
Project Members
Jens Domke  
Institution
GWT-TUD GmbH, IT  
Discipline
Computer Science (401) 

Abstract

Efficient deadlock-free routing strategies are crucial to the performance of large-scale computing systems. There are many methods but it remains a challenge to achieve lowest latency and highest bandwidth for irregular or unstructured high-performance networks. The lately published deadlick-free single-source-shortest-path routing algorithm is able to minimize the congestions in the network and is able to avoid deadlocks in the network traffic. We would like to install a patched version of the InfiniBand Open Subnet Manager on India to test the potential performace gain for users and their applications. The users of FG would benefit from the usage of this algorithm on the hpc systems, not only in terms of stability of the network, but also in terms of a speed-up of their MPI programs. The lowered runtime, on the one hand will reduce the costs per project and on the other hand will clear the space for more projects.

Intellectual Merit

We investigated a novel routing strategy based on the single-source-shortest-path routing algorithm and extended it to use virtual channels to guarantee deadlock-freedom. We showed that this algorithm achieves minimal latency and high bandwidth with only a low number of virtual channels and can be implemented in practice. Using FutureGrid hardware these results shall now be backed up.

Broader Impacts

These tests and results of the performance measurements should encourage the InfinBand consortium to include our patches to the InfiniBand Open Subnet Manager in the official release. The users of FG would benefit from the usage of this algorithm on the hpc systems, not only in terms of stability of the network, but also in terms of a speed-up of their MPI programs.

Scale of Use

As these tests involve changes to the core of the infiniband installation, India needs to be solely available to us during a maintenance for a couple of minutes up to 2h max to install, test and possibly roll back our changes to the InfiniBand Open Subnet Manager