Abstract
In utility computing, users access services which are delivered in a manner similar to metered traditional utilities such as water, gas and electricity. This model
is advantageous as it does not involve the initial cost to acquire computing resources because computation, storage, network and other services are available as metered services which can be provisioned as per the customers’ demands. These
services can be broadly classified as: Infrastructure-as-a-Service (IaaS), Platformasa-Service (PaaS) and Software-as-a-Service (SaaS). We are thus approaching a model where everything shall be offered as a service - XaaS. These modern data- centers and clouds are distinguished by a utility pricing model where customers are charged based on their utilization of computational resources, storage and transfer of data.
With the recent emergence of cloud computing based services on the Internet, MapReduce has emerged as the paradigm of choice for developing large scale data
intensive applications which are distributed in nature. MapReduce works best for embarrassingly parallel workloads, in which little or no effort is required to separate the problem into a number of parallel tasks which can run independently on a cluster of machines. This is often the case when a bigger problem can be divided in a number of smaller problems that can run in parallel and there exists no dependency between those parallel tasks. MapReduce is used by more than a 100 organizations
worldwide to perform tasks such as web crawling and indexing, social media monitoring, scientific data processing, data mining and machine learning.
The Apache Hadoop framework is the leading open source implementation of the MapReduce model. It provides a distributed file system called HDFS or Hadoop Distributed File System that facilitates high throughput access to application data. The data stored on HDFS is divided into smaller chunks of configurable size anddistributed across the cluster. HDFS creates multiple replicas of data to ensure data
availability at all times. Hadoop follows a rack aware data placement policy and ensures data availability under situations ranging from a single node failure to a
complete rack disconnect.
Hadoop is designed to work on cheap commodity hardware, is scalable and resilient against machine failures and works on clusters comprising a single node
to thousands of nodes. It is also adapted by a number of educational institutes for performing research where the budgets are even tighter. The cost to support such an infrastructure is an important factor for consideration while setting up the
cluster. Power consumption of datacenters has become a key factor contributing to the costs incurred by a service provider. This power related cost includes investment, operating expenses, cooling costs and environmental impacts. Also given
the scale at which these applications are deployed, minimizing power consumption of these clusters can significantly cut down operational costs and reduce their carbon footprint - thereby increasing the utility from a provider’s point of view. For High Performance Computing systems, where the main focus is on improving the performance at any cost, these energy related costs have increased significantly to
a point where they are able to surpass the actual hardware acquisition costs.
The first problem that we address in this thesis is: Energy conservation for clusters of nodes that run MapReduce jobs. This problem becomes more important as there is no separate power controller in MapReduce frameworks such as Hadoop. We attempt to reduce the energy consumption of datacenters that run MapReduce jobs by reconfiguring the cluster. We propose an algorithm that dynamically re- configures the cluster based on the current workload and turns cluster nodes on or off when the average cluster utilization rises above or falls below administrator specified thresholds, respectively. Our implementation creates a channel between the cluster’s power controller module and the underlying distributed file system to dynamically scale the number of nodes and adapt to the current service demands on the cluster.
We evaluate our algorithm on a variety of workloads and our results show that the proposed algorithm achieves substantial energy conservation, as compared to the default HDFS implementation. In our model, the amount of energy saved is proportional to the number of deactivated nodes. We implement the defaultrack aware replica placement policy followed by HDFS and incorporate the cluster reconfiguration decisions suggested by our algorithm to dynamically scale the number of nodes in the cluster. We demonstrate the scale up and scale down operations
of our algorithm and their corresponding energy savings and observe that the cluster intelligently reconfigures itself based on the workload imposed, proving the effectiveness of our algorithm. As expected, the energy conserved in case of low workload