...
In any case you should now find a "HamaStreaming" folder in your Hama home directory which contains several scripts.
Now we have to upload these scripts to HDFS:
No Format |
---|
hadoop/bin/hadoop fs -mkdir /tmp/PyStreaming/
hadoop/bin/hadoop fs -copyFromLocal HamaStreaming/* /tmp/PyStreaming/
|
Let's start by executing the usual Hello World application that already ships with streaming:
No Format |
---|
bin/hama pipes -streaming true -bspTasks 2 -interpreter python3.2 -cachefiles HamaStreaming/tmp/PyStreaming/*.py -output /tmp/pystream-out/ -program HamaStreaming/tmp/PyStreaming/BSPRunner.py -programArgs HamaStreaming/HelloWorldBSP.pyHelloWorldBSP |
This will start 2 bsp tasks in streaming mode. In streaming a child process will be forked from the usual BSP Java task. In this case, this would yield to a new task that starts with python3.2, with the py files from HDFS. The noteworthy thing is actually, that you pass a runner class that takes care of all the protocol communication. Your user program is passed as the first program argument. This works because python will start the runner py in a work directory from the cache files. So they are implicitly included and the whole computation can work, this is why you don't have to provide a path with the HelloWorldBSP (note the py is not needed, because of the reflective import).
Hopefully you should see something along these lines:
No Format |
---|
12/09/17 19:06:31 INFO pipes.Submitter: Streaming enabled!
12/09/17 19:06:33 INFO bsp.BSPJobClient: Running job: job_201209171906_0001
12/09/17 19:06:40 INFO bsp.BSPJobClient: Job complete: job_201209171906_0001
12/09/17 19:06:40 INFO bsp.BSPJobClient: The total number of supersteps: 15
12/09/17 19:06:40 INFO bsp.BSPJobClient: Counters: 8
12/09/17 19:06:40 INFO bsp.BSPJobClient: org.apache.hama.bsp.JobInProgress$JobCounter
12/09/17 19:06:40 INFO bsp.BSPJobClient: LAUNCHED_TASKS=2
12/09/17 19:06:40 INFO bsp.BSPJobClient: org.apache.hama.bsp.BSPPeerImpl$PeerCounter
12/09/17 19:06:40 INFO bsp.BSPJobClient: SUPERSTEP_SUM=15
12/09/17 19:06:40 INFO bsp.BSPJobClient: COMPRESSED_BYTES_SENT=3310
12/09/17 19:06:40 INFO bsp.BSPJobClient: TIME_IN_SYNC_MS=2805
12/09/17 19:06:40 INFO bsp.BSPJobClient: COMPRESSED_BYTES_RECEIVED=3310
12/09/17 19:06:40 INFO bsp.BSPJobClient: TOTAL_MESSAGES_SENT=60
12/09/17 19:06:40 INFO bsp.BSPJobClient: TOTAL_MESSAGES_RECEIVED=30
12/09/17 19:06:40 INFO bsp.BSPJobClient: TASK_OUTPUT_RECORDS=28
|
And now you can view the output of your job with:
No Format |
---|
hadoop/bin/hadoop fs -cat /tmp/pystream-out/part-00001
|
in my case this looks like this:
No Format |
---|
Hello from localhost:61002 in superstep 0
Hello from localhost:61001 in superstep 0
Hello from localhost:61001 in superstep 1
Hello from localhost:61002 in superstep 1
[...]
Hello from localhost:61001 in superstep 14
Hello from localhost:61002 in superstep 14
|