End-to-end Optimization of Data Transfer Throughput over Wide-area High-speed Networks

Project ID
FG-113
Project Categories
Computer Science
NSF Grant Number
0926701
NSF Grant URL
http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=0926701
Completed
Abstract
Data produced by large-scale scientific applications has reached the amount of multiple petabytes/exabytes while in transfer speed we are able to achieve multiple gigabits per second due to the improvement of high-performance optical networking technology, which can support up to 100Gbps today. The transport layer protocols (e.g. TCP), that are highly popular, were not originally designed to cope with the capacity and speed of these types of networks currently available to the scientific community. Many alternative transport layer protocols have been designed to be suitable for high-speed networks, however they failed to replace the existing protocols. Moreover, to get transfer speeds of 100Gbps, the end-system capacities must also be taken into account apart from protocol improvements. The end-systems have evolved from single desktop computers to complex massively parallel multi-node clusters, supercomputers and multi-disk parallel storage systems. Additional level of parallelism by using multiple CPUs and parallel disk access are needed combined with the network protocol optimization to achieve high data transfer throughput. In this project, we develop models and algorithms in the application level that do network and end-system optimization to be able to utilize multiple Gbps bandwidth of high-speed networks by using the existing transport protocols without making any changes to the OS kernel. We claim that users should not have to change their existing protocols and tools to achieve high data transfer speeds. We achieve this in the application level via ‘end-to-end data-flow parallelism ‘in which we use parallel streams and stripes utilizing multiple CPUs and disks. It is very important to be able to utilize the network capacity without affect the existing traffic too much. Our prediction models can detect that level and only use minimal number of end-system resources to achieve that. We use very little information regarding network and end-systems by using previous transfer information or immediate samplings. We keep the overhead of the sampling to minimal through special techniques so that the overall gain in throughput will be much higher than this overhead included. We want to verify the feasibility and accuracy of the proposed prediction models by comparing to actual TCP data transfers over wide area networks. We would like to use several FutureGrid resources over wide area to validate our models.
Use of FutureSystems
We will be running TCP and UDP based transfer protocol services on several FutureGrid nodes and test our protocol optimization models on these services. We are especially interested in wide-area high-bandwith experiments.
Scale of Use
We will need 1-2 nodes on each cluster for 1-2 days per experiment. We expect to perform 3-4 experiments per month.