FutureGrid Project Challenge (Project FG-172)

Project Details

Project Lead
Sebastiano Peluso 
Project Manager
Sebastiano Peluso 
Institution
IST / INESC-ID, INESC-ID / Distributed Systems Group  
Discipline
Computer Science (401) 
Subdiscipline
14.09 Computer Engineering 

Abstract

Cloud Computing has emerged as a new paradigm for deploying, managing and offering services through a shared infrastructure. The foreseen benefits of Cloud Computing are very compelling both from a cloud consumer and from a cloud services provider perspective: freeing corporations from large IT capital investments via usage-based pricing schemes; leveraging the economies of scale for both services providers and users of the cloud; ease of deployment of services. One of the main challenges to materialize these perceived benefits is to identify innovative distributed programming models that simplify the development of Cloud-based services, allowing for ordinary programmers to take full advantage of the seemingly unbounded amount of computational power and storage available on demand in large scale Cloud infrastructures. This project aims at designing, building, and evaluating a novel platform for service implementation of Cloud-based services: Cloud-TM (Cloud-Transactional Memory). Cloud-TM offers a simple and intuitive programming model for large scale distributed applications that integrates the familiar notion of atomic transaction as a first-class programming language construct, sparing programmers from the burden of implementing low level, error-prone mechanisms (e.g. locking, persistence and fault-tolerance) and permitting major reductions in the time and cost of the development process. Cloud-TM will embed a set of autonomic mechanisms to simplify service monitoring and administration, a major source of costs in dynamic and elastic environments such as the cloud. These mechanisms aim at ensuring the achievement of user defined Quality of Service levels at minimum operational costs by automating the provisioning of resources from the cloud and self-tuning the middleware platform to achieve optimal efficiency in the utilization of resources.

Intellectual Merit

The research and development activities that will be carried out within this project have relations with the following main areas: - Programming Paradigms Cloud Computing - (Distributed) Transactional Memories - Persistence support - Replicated and Distributed databases In the following we highlight the progresses beyond the state of the art that will be brought about by this project. -- Programming paradigms for the Cloud --- This project aims at introducing a novel programming model explicitly aimed at simplifying the development of service-oriented applications deployed in Cloud Computing infrastructures. To this end, the Cloud-TM paradigm tackles one of the main issues in the development of applications/services deployed in large scale distributed platforms: the design of highly scalable, fault-tolerant schemes for ensuring the consistency of the application’s shared state. This is achieved by transparently extending the mainstream object-oriented programming model, exposing to the programmers a simple, innovative Distributed Transactional Memory interface which will hide both the issues related to concurrency and to distribution. Rather than forcing to the adoption of unfamiliar programming approaches, as it’s for instance the case of MapReduce, the Cloud-TM paradigm aims for a truly general purpose approach to the development of Cloud applications by integrating the transaction abstraction as a first-class construct of a widely adopted object-oriented programming language, such as Java. Rather than relying on a conventional relational database to ensure ACID transactional properties, Cloud-TM will integrate novel, highly scalable and adaptive transactional mechanisms optimized for elastic, large scale computing infrastructures. The development of these innovative mechanisms will actually contribute to advance also the state of the art in the recent areas of (Distributed) Transactional Memories, which are discussed next. -- Transactional Memories -- The Cloud-TM project aims at overcome the limitations of current TMs in several, complementary ways. First, to ensure optimal performances in the face of varying workload conditions, the Cloud-TM platform will embed autonomic mechanisms to autonomously adapt its contention management strategies. Second, in order to meet the scalability and reliability requirements of real-world service-oriented applications, we aim at enriching the traditional STM model to transparently leverage the resources of large scale Cloud Computing platforms, hiding the complexities related to implementing consistent, scalable and fault-tolerant replication schemes. Finally, to unleash the full power of the TM paradigm, Cloud-TM integrate novel algorithms for supporting parallel nesting capable of retaining the competitive performance of serial nesting STMs, no matter how deep nesting may be. -- Distributed Transactional Memories -- The project will design, develop and evaluate a DSTM platform that: I. Will integrate innovative data distribution, replication and caching aimed at minimizing the costs associated with remote network accesses. This will be done in a transparent manner for the programmers, based on the run-time monitoring and prediction of the applications’ data access patterns. The result will be a significant simplification and costs’ reduction of the development process, as programmers will be able to focus on implementing the real differentiating values of services, rather than crafting complex, low-level caching schemes. II. Will embed autonomic mechanisms for automatically scaling up and down the number of resources acquired from the cloud, so to match predefined QoS criteria with minimal operational costs. To maximize the efficiency while the number of nodes varies elastically to accommodate for traffic oscillations, the DSTM platform will transparently adapt the policies it adopts for local and global contention management, fault tolerance, data partitioning and caching. The achievement of this result will advance not only the state of the art in the area of performance evaluation by developing novel models able to capture the complex dynamics of DSTM systems. It will also lead to advances in the area of adaptive and self-stabilizing systems through the definition of novel mechanisms allowing to safely and efficiently transition between alternative implementations of fundamental services (such as caching and fault-tolerance) largely employed also in different contexts. III. Will include optimized interfaces towards highly-scalable object-oriented persistent storages, specifically designed to operate in Cloud Computing environments, such as Google’s BigTable and Amazon’s SimpleDB. Since, unlike in database environments, in a DSTM-based approach durability is only viewed as an optional requirement, the Cloud-TM platform will provide simple declarative mechanisms (e.g. at the programming language level or via external configuration files) to identify the portions of the application state that will have to be transparently persisted. -- Persistence Support -- Current support for persistence in cloud computing environments focuses on high-availability of the resources, on-demand scalability, flexibility of the persistent representation, and in simplifying the distribution of the resources through the cloud. On the other side, they do not focus yet on being programmer-friendly, they force application design to meet certain API limitations, and in some aspects it is currently a step back in some of the best object-oriented programming practices. They also impose an impedance mismatch and the programmer must explicitly code the persistency of the non-transient data implying a cost in development time. Last but not least, existing persistency mechanisms either cannot be composed with transactions, or just support the persistency of data local to a node in the context of a transaction. To address some of the limitations identified before, we propose a solution where the characteristics of the non-transient data of the application domain model is specified in a declarative way. This is achieved through the use of a Domain Modeling Language (DML) [Cachopo and Silva, 2006]. The DML is used to specify just the structural part of the persistent entities of the application domain model in a succinct and declarative Java-like syntax. It should be easy to use and allow the programmer to add specific behavior to the persistent domain entities. A compiler then receives the DML specification and automatically generates the code responsible for supporting the persistency of the non-transient entities. This way, the programmer does not have to code a single line concerning the persistency of data. Moreover, we will be able to automatically produce code for the persistency of the non-transient data that is best suited for specific requirements of the application. The proposed persistency component will take into account the local disk storage of each node for storing the persistent entities so that we do not waste resources. It will support the persistency of non-transient data accessed in the context of a transaction that is spread over the cloud. Finally, we aim at implementing a persistency mechanism that is self-tuning and auto-scaling in order to automatically chose the best algorithms and optimize the number of resources necessary to support the current workload of the application. -- Replicated and Distributed Databases -- The literature on replicated databases certainly represents a source of inspiration for the design of replication schemes for TMs. However, as also highlighted by our work in [Romano et al., 2008], typical workloads of TMs and databases show at least two key differences which make it highly desirable to develop replication schemes specifically tailored to meet the requirements of TMs. On one hand, the transaction execution time in STM-based systems is often several orders of magnitude smaller than for database applications. This leads to a corresponding amplification of the AB-based synchronization overhead with respect to the scenario of conventional database replication, raising the urge for novel replication schemes explicitly optimized for STM environments. Another important feature characterizing the workload of standard STM benchmarks is the much higher heterogeneity of their workload with respect to the case of traditional databases [Romano et al., 2008]. This motivates the need for novel, highly efficient replication schemes specifically designed to match the unique requirements of DSTMs. This is indeed an important goal of this project, which will actually lead to advances not only of the state of the art on DSTMs. In fact, we expect that the innovative replication mechanisms developed within the project will likely reveal attractive also in the more traditional context of database replication.

Broader Impacts

The innovative technologies that will be developed by this project aim at overcoming some of the major roadblocks to the adoption of the Cloud Computing model, namely: • the lack of programming paradigms that simplify the development of service-oriented applications (such as e-Commerce, online games, or social networks) to be deployed in the Cloud. The Cloud-TM platform aims at achieving very high scalability levels, and at strongly reducing the development costs and the time-to-market by freeing the programmers from the burden of implementing low level, error-prone mechanisms for, e.g., caching, replication and fault-tolerance. • the lack of autonomic mechanisms simplifying service monitoring and administration, a major source of costs in dynamic and elastic environments such as the Cloud. By self-optimizing the system depending on the current workload characteristics, Cloud-TM will strongly reduce not only the administration costs, but also the operational costs, by ensuring that the resources acquired from the Cloud infrastructure are never in excess and always optimally utilized. At the light of the above observations, in this project we aim at: • Deep technological advances in software/service engineering. New software technologies for improving scalability and predictability of distributed systems, improving responsiveness and throughput. A more competitive environment including infrastructure operators moved up the value chain with innovative service offerings on scalable infrastructure. • Lowered barriers for service providers, in particular SMEs, to develop services through standardised open (source) platforms and interfaces • A strengthened industry for software, software services and Web services, offering a greater number of more reliable and affordable services, enabled by flexible and resilient platforms for software/service engineering, design, development, management and interoperability. Technologies tailored to meet key societal and economical needs • Lower management costs of large networked systems through the ability to adapt to changing environments and patterns of use, and through a greater degree of, flexibility and reliability. • More efficient use of resources such as processing power, energy and bandwidth through autonomic decisions based on awareness

Scale of Use

We plan two main types of experiments: - large scale performance/scalability evaluation studies involving from 10 to 100-200 nodes in a single data center - inter-data center performance/scalability evaluation studies, involving 2-3 data centers. each of which using 10-20 nodes Each experimental session would run typically for one or two hours.

Results

Publications:
- Sebastiano Peluso, Pedro Ruivo, Paolo Romano, Francesco Quaglia and Luís Rodrigues, "When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Data Replication". Proc. of 32nd IEEE International Conference on Distributed Computing Systems (ICDCS), June 2012