To improve data availability and resilience MapReduce frame-
works use le systems that replicate data uniformly. However,
analysis of job logs from a large production cluster shows
wide disparity in data popularity. Machines and racks storing
popular content become bottlenecks; thereby increasing the
completion times of jobs accessing this data even when there
are machines with spare cycles in the cluster. To address this
problem, we present Scarlett, a system that replicates blocks
based on their popularity. By accurately predicting le popu-
larity and working within hard bounds on additional storage,
Scarlett causes minimal interference to running jobs. Trace
driven simulations and experiments in two popular MapRe-
duce frameworks (Hadoop and Dryad) show that Scarlett ef-
fectively alleviates hotspots and can speed up jobs by .
works use le systems that replicate data uniformly. However,
analysis of job logs from a large production cluster shows
wide disparity in data popularity. Machines and racks storing
popular content become bottlenecks; thereby increasing the
completion times of jobs accessing this data even when there
are machines with spare cycles in the cluster. To address this
problem, we present Scarlett, a system that replicates blocks
based on their popularity. By accurately predicting le popu-
larity and working within hard bounds on additional storage,
Scarlett causes minimal interference to running jobs. Trace
driven simulations and experiments in two popular MapRe-
duce frameworks (Hadoop and Dryad) show that Scarlett ef-
fectively alleviates hotspots and can speed up jobs by .