Abstract
With improving efficiency and cost effectiveness of public cloud systems, there has been a growing trend to complement in-house cloud environment with them. We propose an open solution to the problem of distributing jobs in a hybrid environment, as a substitute to hadoop systems which are cloud specific. The proposed system, MultiStack, is a big data orchestration platform for deploying big data jobs across multiple cloud providers. The specific architecture elaborated in this paper uses Amazon Web Services as the public cloud provider and Openstack as the private cloud framework. Our solution supports complete Hadoop ecosystem tools - hive, pig, Hbase, oozie etc. and on-demand scaling of Hadoop clusters. The proposed framework aims at reducing the Job completion time on workloads along with decrease in cost using Spot Instance provisioning compared to on-demand provisioning. This is achieved by providing two modes of operation: Proactive scheduling and Reactive scheduling, which takes into account user providing job characteristics(eg memory, cpu, etc), quota limitation and business objectives. From our experiments, we conclude that Multistack is able to reduce average job completion time by 30-37% with minimal increase in cost.