What is MapReduce?
MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers . The framework is inspired by map and reduce functions commonly used in functional programming, although their purpose in the MapReduce framework is not the same as their original forms. MapReduce libraries have been written in C++, Java, Python and other programming languages. Wikipedia rules!
This allows you to buy a bunch of cheap dell servers and gain access to enormous computing power at relatively low costs. Petabytes of data can be sorted through in a matter of hours with this technique on commodity servers. The framework has built in mechanism’s for redundancy and reliability that allow you to deploy cheaper hardware (i.e. you can skip on the raids).
In layman’s terms, it basically splits a very large process up into smaller chunks so that hundreds of machines can work on different parts simultaneously. By having many machines work on a problem together you get faster results.
Impact on Data Warehousing
Many large data warehousing systems like Netezza and Teradata have similar capabilities. However, they rely on higher cost hardware appliances where equivalent computing power may end up costing you millions more. Additionally, with MapReduce you can program in different languages other than SQL giving you greater flexibility.
A lot of pioneering work is going on at Yahoo, Google, and Facebook. In the past months we’ve seen announcements from commercial players like
Greenplum and
Aster Data. As you can see momentum is building behind MapReduce for very large database systems.
At the 3 internet firms the programs are around building a query interface for the new data warehousing system. Most of these languages are some variant of SQL. However, the most exciting implementation to date comes from Aster Data whose nCluster relational data warehouse is built on top of MapReduce with full SQL support. I think having full SQL support puts them at the head of the pack as it will be easy to integrate with existing enterprise systems.
Facebook’s
Hive is an OpenSource project that implements a data warehousing system on top of Yahoo’s Hadoop. The following is the best explanation/presentation of the inner workings of MapReduce as it relates to data warehousing that I’ve seen:
These are exciting times in BI. In most organizations with large data volumes performance continues to be one the top challenges. MapReduce wont solve all our problems, but it is an exciting step forward, of trying something different.
ajo BI Industry, Open Source Hadoop, Hive, MapReduce