User-defined partitioning
The partitioner is designed for determining how to distribute the input data among computing workers of a Bulk Synchronous Parallel processing. Remember, this is not related with output collection, unlike Map/Reduce's partition function.
Input data-partitioning works as following sequence:
- If user specified partition function, internally, "partitioning job" is ran as a pre-processing step.
- Each task of "partitioning job" reads its assigned data block and rewrite them to particular partition files.
- After prepartitioning done, launch the mapreduce job.
BSPJob job = new BSPJob(conf); ... job.setPartitioner(HashPartitioner.class); ...