PIRE: Training and Workshops in Data Intensive Computing Using The Open Science Data Cloud

Project ID
Project Categories
Project Keywords
NSF Grant Number

Many scientists today face the unprecedented challenge of managing and analyzing a rapidly growing set of complex data. This international PIRE project aims to narrow the growing gap between the capability of modern scientific instruments to produce data and the ability of researchers to manage, analyze, and share those data in a reliable and timely manner. The emerging technology of cloud computing is a step forward from the current cyberinfrastructure. Cloud computing involves clusters (the ""clouds"") of distributed computers that provide potentially less expensive, more flexible, and more powerful on-demand resources and services over a network, usually the Internet, while providing the scale and the reliability of a data center. This PIRE team intends to help develop large-scale distributed computing capabilities - the Open Science Data Cloud (OSDC) - to provide long term persistent storage for scientific data and state-of-the-art services for integrating, analyzing, sharing and archiving scientific data. The group proposes to study and strengthen storage systems that integrate specialized network protocols and support data transport over wide-area, high-performance networks. As data grows in size, the only practical means to analyze it is to use parallel programming, but until recently it has been time consuming for a domain scientist to take advantage of parallel programming. Another research focus will be to develop new classes of cloud-based parallel programming frameworks and to integrate them into the cloud infrastructure so that this technology is more broadly available to scientists. In addition to the research dimensions of this project, another key aspect is the involvement, in workshops and in subsequent use of the cloud cyberinfrastructure, of many domain scientists and their students. These groups will be trained in the basics of cloud computing and then will work to ensure that the cloud computing research advances maximize the manageability and analytical power of the complex datasets unique to their disciplines. This PIRE project embraces cloud computing as a global issue and so taps the cloud computing, high performance networking, domain science, e-Science, education and outreach expertise of its many collaborators in Europe, Asia and South America. Foreign partners also provide a natural mechanism to engage international scientific datasets and distributed networks, and accommodate different international standards to guarantee interoperability. The international collaborators can provide an entry into international collaborations for U.S. graduate students and early career scientists and can also serve as global ambassadors for this new cyberinfrastructure, helping to garner widespread support that will be critical to its future adoption. The project will build a strong cadre of students with a global perspective on scientific data management in many research areas vital to U.S. and international scientific collaborations. The project will provide U.S. graduate students and early career scientists with international research and education experiences with leading scientists via research and training at foreign institutions and participation in annual workshops. As a group, the PIRE students will share an interest in data intensive computing but will be drawn from fields as diverse as computer science, physics, astronomy, geosciences, chemistry, engineering, and biology, lending an interdisciplinary vigor to their training. The PIRE team members will also develop 1-2 day and 1-2 week courses on data intensive computing, with hands-on exercises developed by U.S. and international faculty in computer science and the domain sciences. This PIRE project is likely to have numerous impacts above the level of the individual collaborators. For the U.S. PIRE institutions, it will strengthen current linkages and collaborations in the global Cloud Computing community and engage more U.S. students in international interdisciplinary research teams for the service, support and analysis of large scientific datasets. The project will enhance internationalizing efforts both at the University of Illinois at Chicago and at Florida International University by providing opportunities for short term research abroad and other academic experiences to a diverse group of students. This project will increase the virtual international engagement of the U.S. institutions via distributed research collaborations, courses with transcontinental participation, global web discussions, and focused social networking forums. Increasing the number of scientists with expertise in managing and analyzing very large datasets is also vital to the future of our nation. Finally, since this transformative technology is broadly applicable to any scientific project struggling to manage and analyze the volume of data produced, the OSDC and its facilitative impacts are likely to persist long after the PIRE project has ended. Participating U.S. institutions include the National Center for Data Mining (NCDM) at the University of Illinois at Chicago and Florida International University. OSDC U.S. partner institutions include the University of Chicago and Johns Hopkins University. Partnering foreign institutions include the University of Edinburgh (UK); Universidade Federal Fluminense (Brazil); University of Amsterdam (The Netherlands); National Institute of Advanced Industrial Science and Technology (AIST) (Japan); Korea Institute of Science and Technology Information (KISTI) Supercomputing Center; Beijing Institute of Genomics (BIG) - Chinese Academy of Sciences, and the State University of Sao Paulo (Brazil). This project is cofunded by the NSF's Office of International Science and Engineering, the Office of Cyberinfrastructure, the Division of Computer and Communication Foundations, the Division of Astronomical Sciences, and the Division of Physics.

Use of FutureSystems
test and benchmark Sector (distributed file system and parallel data processing framework), help other FutureGrid users to learn and use Sector to store and process their large data sets.
Scale of Use
100 VMs, 100GB local disk per VM. Only need it when the system is running.