Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In any case you should now find a "HamaStreaming" folder in your Hama home directory which contains several scripts.

Now we have to upload these scripts to HDFS:

No Format

hadoop/bin/hadoop fs -mkdir /tmp/PyStreaming/
hadoop/bin/hadoop fs -copyFromLocal HamaStreaming/* /tmp/PyStreaming/

Let's start by executing the usual Hello World application that already ships with streaming:

No Format
bin/hama pipes -streaming true -bspTasks 2 -interpreter python3.2 -cachefiles HamaStreaming/tmp/PyStreaming/*.py -output /tmp/pystream-out/ -program HamaStreaming/tmp/PyStreaming/BSPRunner.py -programArgs HamaStreaming/HelloWorldBSP.pyHelloWorldBSP

This will start 2 bsp tasks in streaming mode. In streaming a child process will be forked from the usual BSP Java task. In this case, this would yield to a new task that starts with python3.2, with the py files from HDFS. The noteworthy thing is actually, that you pass a runner class that takes care of all the protocol communication. Your user program is passed as the first program argument. This works because python will start the runner py in a work directory from the cache files. So they are implicitly included and the whole computation can work, this is why you don't have to provide a path with the HelloWorldBSP (note the py is not needed, because of the reflective import).

Hopefully you should see something along these lines:

No Format

12/09/17 19:06:31 INFO pipes.Submitter: Streaming enabled!
12/09/17 19:06:33 INFO bsp.BSPJobClient: Running job: job_201209171906_0001
12/09/17 19:06:40 INFO bsp.BSPJobClient: Job complete: job_201209171906_0001
12/09/17 19:06:40 INFO bsp.BSPJobClient: The total number of supersteps: 15
12/09/17 19:06:40 INFO bsp.BSPJobClient: Counters: 8
12/09/17 19:06:40 INFO bsp.BSPJobClient:   org.apache.hama.bsp.JobInProgress$JobCounter
12/09/17 19:06:40 INFO bsp.BSPJobClient:     LAUNCHED_TASKS=2
12/09/17 19:06:40 INFO bsp.BSPJobClient:   org.apache.hama.bsp.BSPPeerImpl$PeerCounter
12/09/17 19:06:40 INFO bsp.BSPJobClient:     SUPERSTEP_SUM=15
12/09/17 19:06:40 INFO bsp.BSPJobClient:     COMPRESSED_BYTES_SENT=3310
12/09/17 19:06:40 INFO bsp.BSPJobClient:     TIME_IN_SYNC_MS=2805
12/09/17 19:06:40 INFO bsp.BSPJobClient:     COMPRESSED_BYTES_RECEIVED=3310
12/09/17 19:06:40 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_SENT=60
12/09/17 19:06:40 INFO bsp.BSPJobClient:     TOTAL_MESSAGES_RECEIVED=30
12/09/17 19:06:40 INFO bsp.BSPJobClient:     TASK_OUTPUT_RECORDS=28

And now you can view the output of your job with:

No Format

hadoop/bin/hadoop fs -cat /tmp/pystream-out/part-00001

in my case this looks like this:

No Format

Hello from localhost:61002 in superstep 0	
Hello from localhost:61001 in superstep 0	
Hello from localhost:61001 in superstep 1
Hello from localhost:61002 in superstep 1
[...]
Hello from localhost:61001 in superstep 14	
Hello from localhost:61002 in superstep 14