This effort is still a "work in progress". Please feel free to add comments. BRBut please make the content less visible by using smaller fonts. – Edward J. Yoon
Overview
This is intended to explain and illustrate the concept of Hama. There are two main parts:
- How to store the matrices?
- How to perform matrix operations using MapReduce?
Building Block
[http://wiki.apache.org/hama-data/attachments/Architecture/attachments/block.png]
Store Dense/Sparse Matrices
To store the matrices, Hama use a [http://hadoop.apache.org/hbase/ Hbase] – Matrices are basically tables. They are ways of storing numbers and other things. Typical matrix has rows and columns. Actually called a 2-way matrix because it has two dimensions. For example, you might have respondents-by-attitudes. Of course, you might collect the same data on the same people at 5 points in time. In that case, you either have 5 different 2-way matrices, or you could think of it as a 3-way matrix, that is respondent-by-attitude-by-time.
– Just a thought, considering the depleted activity in HBase should we not explore ways to avoid HBase ? --Prasen
Represent a graph using adjacency matrix
Perform matrix operations
The Hadoop/Hbase is designed to efficiently process large data set by connecting many commodity computers together to work in parallel but, If there's a inter-node communication, the elapsed run time will be slower with more nodes. Consequently, an "effective" algorithm should avoid large amounts of communication.
Algorithms
Dense Matrix-Matrix multiplication
Blocking jobs:
- Collect the blocks to 'collectionTable' from A and B.
- A map task receives a row n as a key, and vector as its value
- emit (blockID, sub-vector)
- Reduce task combines block
- A map task receives a row n as a key, and vector as its value
Multiplication job:
- A map task receives a blockID n as a key, and two submatrices as its value
- Reduce task computes sum of blocks
Find the maximum absolute row sum of matrix
- https://issues.apache.org/jira/browse/HAMA-171
- A map task receives a row n as a key, and vector as its value
- emit (row, the sum of the absolute value of each entries)
- Reduce task select the maximum one