Differences between revisions 4 and 5
Revision 4 as of 2008-05-07 23:41:04
Size: 3293
Editor: PiSong
Comment:
Revision 5 as of 2009-09-20 23:38:08
Size: 3293
Editor: localhost
Comment: converted to 1.6 markup
No differences found!

Local Execution Mode using LocalJobRunner from Hadoop

Currently we have an additional Local execution engine which handles the local execution mode. The physical plan for this has to be built separately and its execution should be done differently. This also introduces different operators for local and map reduce.

Instead of this, we can reuse Hadoop's LocalJobRunner to execute the same Map Reduce Physical plans locally. So we compile the logical plan into a map reduce physical plan and create the jobcontrol object corresponding to the mapred plan. We just need to write a separate launcher which will submit the job to the LocalJobRunner instead of submitting to an external Job Tracker.

Pros

  • Code Reuse
  • No need to write and maintain
    • Different operators
    • Different logical to physical tranlators
    • Different launchers
  • The current framework does not have any progress reporting. With this approach we will have it at no extra cost.

Cons

  • Not sure how stable LocalJobRunner is.

  • Found some bugs in hadoop-15 on it which makes it practically useless for us right now.
  • These have been fixed however in hadoop-16
  • Not sure how this will affect Example generator

Pi Song had some interesting observations:

  • 1) Will the LocalJobRunner be invoked when processing the nested plan inside foreach? Currently, no. We have local versions of operators that are allowed inside the nested plan which can be used for running tuples through the plan. However, later if we intend to support a full blown foreach with arbitrary nesting and all operators supported, we can take two approaches:

    1. Have local version of all operators and just use the current model to run tuples through. This also entails that we would not have to change anything in the MRCompiler.

      ii. Change MRCompiler to process nested foreach as a blocking operator and recursilvely process it creating a list of dependent jobs. In this case, it probably would make more sense to run it in MapReduce itself and not locally for the nested plan. However, this can be a choice and the MapReduce Launcher can decide to execute these plans either locally by invoking the LocalJobRunner or the Hadoop Job Tracker based on the input size for the plans.

    2) Will the invocation of LocalJobRunner have some latency?

    • Definitely it does. As measured in hadoop 15, it has about 5 sec startup latency. Whether this affects depends on how and where we are using LocalJobRunner. If we strictly use it only when the user asks for local execution mode it should not matter. Also if the size of the data is at least in 10s of MBs, the LocalJobRunner performs better than streaming tuples through the plan of local operators.

I guess the choice is harder now :) The choice now depends on what we want to do for the full blown foreach. Since I would like to implement choice (ii), I would vote for using LocalJobRunner.

[pi] I think whether to do dynamic execution engine selection might not be a factor in this decision making process.

The main point is “Does LocalJobRunner perform as good as LocalEngine in most cases?”. My concern would be the case where we have a lot of small inner bags in our processing.

I vote (i) to neutralize your vote.

LocalJobRunner (last edited 2009-09-20 23:38:08 by localhost)