Differences between revisions 3 and 4
Revision 3 as of 2006-08-07 18:16:45
Size: 1352
Editor: OwenOMalley
Comment:
Revision 4 as of 2009-09-20 23:55:07
Size: 1356
Editor: localhost
Comment: converted to 1.6 markup
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
This is the WordCount example completely translated into [http://python.org/ Python] and translated using [http://www.jython.org/Project/index.html Jython] into a Java jar file. This is the WordCount example completely translated into [[http://python.org/|Python]] and translated using [[http://www.jython.org/Project/index.html|Jython]] into a Java jar file.

WordCount Example in Python

This is the WordCount example completely translated into Python and translated using Jython into a Java jar file.

The program reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occured, separated by a tab. To create some input, take your a directory of text files and put it into DFS.

bin/hadoop dfs -put my-dir in-dir

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.

As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining each word into a single record.

To compile the example, build the Hadoop code and the python word count example:

ant
cd src/examples/python
./compile
cd ../../..

Note that you need to have jythonc and javac on your path for the compilation to work.

To run the example, the command syntax is:

bin/hadoop jar src/examples/python/wc.jar in-dir out-dir

The results of the word count will be in out-dir/part-*.

PythonWordCount (last edited 2009-09-20 23:55:07 by localhost)