Apache Hadoop has been a popular parallel processing tool in this era of big data.
While practitioners have rewritten many conventional analysis algorithms to make them accustomed to Hadoop,
the I/O inefficiency of Hadoop-based programs has been repeatedly reported in the literature.
In this article, we address the problem of I/O inefficiency in Hadoop-based massive data analysis
by introducing our efficient modification of Hadoop.
We first incorporate a columnar data layout into the conventional Hadoop framework without any modification
of the Hadoop internals. We also provide an indexing capability into Hadoop to save many I/Os
while processing not only selection predicates but also star-join queries which are frequently used in
many analysis tasks.
Keyword
Parallel processing; MapReduce; Data layout; bitmap index