MapReduce

MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

The core concepts are described in Dean and Ghemawat.

The Map

A map transform is provided to transform an input data row of key and value to an output key/value:

That is, for an input it returns a list containing zero or more (key,value) pairs:

The Reduce

A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.

The MapReduce Engine

The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can then be saved to the distributed filesystem, and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing different keys.

Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data stored on the filesystem -or any other supported filesystem, of which there is more than one.

Limitations

Will MapReduce/Hadoop solve my problems?

If you can rewrite your algorithms as Maps and Reduces, then yes. If not, then no.

It is not a silver bullet to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in parallel.

MapReduce (last edited 2011-06-05 22:48:08 by ToddLipcon)