Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Partition Function

  • NOTE: if when the number of splits exceeds the maximum number of tasks?.

In Hama BSP computing framework, the Partition function is used for obtaining scalability of a Bulk Synchronous Parallel processing, and determining how to distribute the slices of input data among BSP processors. Unlike Map/Reduce data processing model, many scientific algorithms based on Message-Passing Bulk Synchronous Parallel model often requires that a processor obtain “nearby or related” data from other processors in order to complete the computation. In this case, you can create your own Partition function for determining processor inter-communication and how to distribute the data.

Internals

Internally, File format input data partitioning works as following sequence:

...

In NoSQLs table input case (which supports range or random access by sorted key), partitions doesn't need to be rewritten. In addition, Scanner instead of basic 'region' or 'tablet' splits can be used for forcing the number of processors.

Partitioning in Graph module

The internals of the Graph module implemented on top of BSP framework, are pretty simple. Each GraphJobRunner processors reads assigned splits and converts to Vertex into memory (Currently disk-based vertices store and sequential vertex processing are not perfect).

Create your own Partitioner

...