There are many reasons why one wants to limit the number of running tasks.
The most common reason is because a given job is consuming all of the available task slots, preventing other jobs from running. The easiest and best solution is to switch from the default FIFO scheduler to another scheduler, such as the FairShareScheduler or the CapacityScheduler. Both solve this problem in slightly different ways. Depending upon need, one may be a better fit than the other.
There is a job tunable called mapred.reduce.slowstart.completed.maps that sets the percentage of maps that must be completed before firing off reduce tasks. By default, this is set to 5% (0.05) which for most shared clusters is likely too low. Recommended values are closer to 80% or higher (0.80). Note that for jobs that have a significant amount of intermediate data, setting this value higher will cause reduce slots to take more time fetching that data before performing work.
In Hadoop terms, we call this a 'side-effect'.
One of the general assumptions of the framework is that there are not any side-effects. All tasks are expected to be restartable and a side-effect typically goes against the grain of this rule.
If a task absolutely must break the rules, there are a few things one can do:
The CapacityScheduler in 0.21 has a feature whereby one may use RAM-per-task to limit how many slots a given task takes. By careful use of this feature, one may limit how many concurrent tasks on a given node a job may take.
There are both job and server-level tunables that impact how many tasks are run concurrently.
There are two server tunables that determine how many tasks a given TaskTracker will run on a node:
These must be set in the mapred-site.xml file on the TaskTracker. After making the change, the TaskTracker must be restarted to see it. One should see the values increase (or decrease) on the JobTracker main page. Note that this is not set by your job.
Typically, the amount of maps per job is determined by Hadoop based upon the InputFormat and the block size in place. Using mapred.min.split.size and mapred.max.split.size settings, one can provide hints to the system that it should use a size that is different than the block size to determine what the min and max input size should be.
Currently, the number of reduces is determined by the job. mapred.reduce.tasks should be set by the job to the appropriate number of reduces. When using Pig, use the PARALLEL keyword.