ABSTRACT

Added to the above design considerations, when constructing cloud programs, special attention must be paid to various challenges like scalability, communication, heterogeneity, synchronization, fault tolerance, and scheduling. First, scalability is hard to achieve in large-scale systems (e.g., clouds) due to several reasons such as the inability of parallelizing all parts of algorithms, the high probability of load imbalance, and the inevitability of synchronization and communication overheads. Second, exploiting locality and minimizing network trafc are not easy to accomplish on (public) clouds since network topologies are usually unexposed. Third, heterogeneity caused by two common realities on clouds, virtualization environments and variety in datacenter components, impose difculties in scheduling tasks and masking hardware and software differences across cloud nodes. Fourth, synchronization mechanisms must guarantee mutual exclusive accesses as well as properties like avoiding deadlocks and transitive closures, which are highly likely in distributed settings. Fifth, fault-tolerance mechanisms, including task resiliency, distributed checkpointing and message logging should be incorporated since the likelihood of failures increases on large-scale (public) clouds. Finally, task locality, high parallelism, task elasticity, and service level objectives (SLOs) need to be addressed in task and job schedulers for effective programs’ executions.