Abstract
Traditional methods of computing and utilization of computing resources involved having dedicated
hardware. Each company or individual had to purchase hardware and set it up for each task. If the utility of the hardware and its requirements were fixed, then this approach made sense, however in most cases the utility and the requirements kept varying. Sometimes people required large amounts of resources for short amounts of time (like in the case of scientific computation) or sometimes people needed the ability to scale up and down the number of systems based on the demand and dedicated systems were not profitable in such a case as it involved high upfront costs and becomes a resource burden once the work is done. Resources can be of different types like computation(CPU), storage(distributed filesystem storage), network, key-value stores, databases, etc. Enter the realm of cloud computing. It provides a profitable way of renting resources at the time of need and disposing them off by releasing those resources to be used by others and we get charged only for what we use. With data centers spread across the globe and the help of cloud computing, we can localize resources closer to the end user to improve performance and experience.
Cloud Computing is based on the utilization of commodity hardware, one of the reasons that makes
it profitable. True homogeneity of the different resources will make it less profitable. Hence, more
than often, the systems are heterogeneous in nature. They will have different CPU clock rates, different
memory performances, different storage capablities (like tape, Solid-State Disks, Hard-Disks, etc.) ,
different network capabilities, different kinds of switches connecting them. Even if somehow, one manages to acquire truly homogeneous systems, the dynamic nature of today’s systems and softwares will
inherently make them heterogeneous in nature. There will be difference in the load and utilization of
the different resources and this might introduce heterogeneity in them. Some network links might be
over-utilized, some under-utilized, some disks might store more data ,some nodes might become hotspots
due to the uneven distribution of data/computation. This kind of heterogeneity is dynamic. We
need to develop systems that are aware of this type of inherent heterogeneity and enable it to adapt to
the constantly changing environment. Storage in a distributed environment has multiple attributes to optimize on. We have different kinds of storage , each with different amount of capacity and performance characteristics. We need to use
solid state drives and memory more judiciously compared to tapes or hard disks. Each one of these
devices have different power consumption, cooling requirements and network requirements. So how do
we decide where to place the data such that we optimize on a combination of these factors. We studied
tackling this problem in 2 different kinds of storages.
Distributed block based file systems such as Hadoop Distributed File System(HDFS)
Distributed Key-Value stores
We studied 2 different approaches for these two areas. For block based distributed file systems, we proposed
an approach of modelling the different parameters of heterogeneity such as latency, bandwidth,
power usage, etc as a distance measure in a multi-dimentional space and then performing dimensionality
reduction on it to finally reduce it to a simple optimization problem. For the Key-Value store, we concentrated on modifying how the partitioning and assignment of servers occur in consistent hashing and
optimized it on the different parameters by treating it as a bidding problem with the different systems in
the consistent hash ring being the bid items and the partitions being the bidders. In both cases, we developed a prototype and benchmarked them thoroughly using both standard benchmarks and different load scenarios to test their adaptive nature. They showed better performance and interesting load-balancing and placement characteristics.
The next area that we dealt with is networking. In a datacenter, the network contains data flows from
different kinds of applications. Each kind of flow has a requirement. For example, Youtube is sensitive
to bandwidth, VoIP traffic is sensitive to latency and network jitter, etc. In traditional networks, reacting
to these flows were very hard, as the protocols defined were meant to be generic. However, with the
advent of software defined networking, we are no longer restricted to these legacy protocols. We can
define protocols that are tailored to a network. We created a SDN controller based on openflow which
uses a pre-defined profile for each one of these data flows. We intelligently route the different flows from
different applications based on this information thereby optimizing on the combination of parameters
defined in the profile. We tested it under different load scenarios and showed how the heterogeneity
awareness h