Optimizing Partitioned Address-space Programs for Shared Memory and Hybrid Clusters

Project ID
Project Categories
Computer Science
NSF Grant Number

Parallel programming with partitioned address spaces has several advantages, including, explicit data dependencies, programmer awareness of data-locality, and absence of obscure bugs cause by forgotten synchronization around shared data. However, partitioned address spaces do create their own problems. Our past and ongoing research has shown that it is possible to address productivity-related issues by devising abstractions for specifying communication However, an important shortcoming of programming with partitioned address spaces is that it is difficult to leverage hardware shared memory for such programs, due to program semantics and the limitations of the underlying communication libraries.

MPI is the leading communication library used for partitioned address-space programming. In this project, we will develop techniques that combine MPI-aware compiler analysis and run-time systems to optimize MPI programs for shared memory by achieving zero-copy communication in a large number of cases. We avoid the difficult problem of matching sends and receives in MPI programs by developing a smart run-time system that serves as a drop-in replacement for standard MPI primitives. A source-level compiler, based on LLNL's ROSE framework, will optimize the original MPI program by converting the original calls to MPI primitives and selectively globalizing communication buffers into shared space to enable zero-copy communication. We will extend our analysis and implementation to also work with a declarative language for specifying communication that we have been developing, called Kanor.

The ability to optimize partitioned address-space programs, such as those using MPI, on increasingly common many-cores gives us a powerful motivation for our project. However, to achieve improved scaling, we will also explore optimization challenges for hybrid platforms, consisting of clusters of shared-memory machines.

Use of FutureSystems
We will run experiments on Delta to assess the performance of our techniques on multiple cores of a node. Later, we will use multiple nodes of a cluster to evaluate the performance in hybrid scenarios.
Scale of Use
One node at a time, initially. Multiple nodes in the later part of the project.