This effort is still a "work in progress". Please feel free to add comments. BRBut please make the content less visible by using smaller fonts. – Edward J. Yoon

Overview

This is intended to explain and to illustrate the concept of Hama. There are two main parts:

How to store the matrices?
How to perform matrix operations using MapReduce?

Building Block

[http://wiki.apache.org/hama-data/attachments/Architecture/attachments/block.png]

Store Matrices & Graphs

To store the matrices, Hama use a [http://hadoop.apache.org/hbase/ Hbase] – Matrices are basically tables. They are ways of storing numbers and other things. Typical matrix has rows and columns which is often called a 2-way matrix because it has two dimensions. For example, you might have respondents-by-attitudes. Of course, you might collect the same data on the same people at 5 points in time. In that case, you either have 5 different 2-way matrices, or you could think of it as a 3-way matrix, that is respondent-by-attitude-by-time.

– Just a thought, considering the depleted activity in HBase should we not explore ways to avoid HBase ? --Prasen
– Hbase seems activity at this time, However Yes. We should think about it. --Edward

Why Hbase?

Represent a graph using adjacency matrix

https://issues.apache.org/jira/browse/HAMA-162

Perform matrix operations

The Hadoop/Hbase is designed to efficiently process large data set by connecting many commodity computers together to work in parallel but, If there's a inter-node communication, the elapsed run time will be slower with more nodes. Consequently, an "effective" algorithm should avoid large amounts of communication.

Algorithms

Dense Matrix-Matrix addition

The addition of multiple matrices

https://issues.apache.org/jira/browse/HAMA-154

Matrix-Matrix multiplication

Dense Matrices multiplication

Iterative Approach

For i = 0 step 1 until N -1
  Job: Computes the i^th^ row of C = Matrix-Vector multiplication

Iterative job:

- A map task receives a row n of B as a key, and vector of row as its value
 - Multiplying by all columns of ith row of A
- Reduce task find and add the ith product

1st
+             +   +             +
| a11 a12 a13 |   | a11 a21 a31 |
| ... ... ... | X | a12 a22 a32 |
| ... ... ... |   | a13 a23 a33 |
+             +   +             +

2nd
+             +   +             +
| ... ... ... |   | a11 a21 a31 |
| a21 a22 a23 | X | a12 a22 a32 |
| ... ... ... |   | a13 a23 a33 |
+             +   +             +
....

Blocking Algorithm Approach

To mutliply two dense matrices A and B, We collect the blocks to 'collectionTable' firstly using map/reduce. Rows are named as c(i, j) with sequential number ((N^2 * i) + ((j * N) + k) to avoid duplicated records.

CollectionTable:

                            matrix A         matrix B
------------------------+-------------------------------             
block(0, 0)-0               block(0, 0)      block(0, 0)
block(0, 0)-1               block(0, 1)      block(1, 0)
block(0, 0)-2               block(0, 2)      block(2, 0)
...         N               ...
block(N-1, n-1)-(N^3-1)     block(N-1, N-1)  block(N-1, N-1)

Each row has a two sub matrices of a(i, k) and b(k, j) so that minimized data movement and network cost.

Blocking jobs:

Collect the blocks to 'collectionTable' from A and B.

- A map task receives a row n as a key, and vector of each row as its value
 - emit (blockID, sub-vector) pairs
- Reduce task merges block structures based on the information of blockID

Multiplication job:

- A map task receives a blockID n as a key, and two sub-matrices of A and B as its value
 - Multiply two sub-matrices: a[i][j] * b[j][k]
- Reduce task computes sum of blocks 
 - c[i][k] += multiplied blocks

– If A stored on Hbase and we collect the B to A, Can we reduce the run time of blocking job. --Edward
– Hmm, It's possible to blocking at once. --Edward

Sparse Matrices multiplication

http://issues.apache.org/jira/browse/HAMA-127

Find the maximum absolute row sum of dense/sparse matrix

Matrix.Norm.One is that find the maximum absolute row sum of matrix. Comparatively, it's a good fit with MapReduce model because doesn't need iterative jobs or table/file JOIN operations.

                                         j=n
The maximum absolute row sum =   max   ( sum | a_{i,j} | ) 
                               1<=i<=n   j=1


- A map task receives a row n as a key, and vector of each row as its value
 - emit (row, the sum of the absolute value of each entries)
- Reduce task select the maximum one

Compute the infinity norm

It's a maximum absolute column sum of dense/sparse matrix.

                                            i=n
The maximum absolute column sum =   max   ( sum | a_{i,j} | )
                                  1<=j<=n   i=1

- A map task receives a row n as a key, and vector of each row as its value
 - emit (column, the absolute value of each entries)
- Reduce task sum the absolute values of column, select the one maximum one.

Compute the transpose of dense/sparse matrix

The transpose of a matrix is another matrix in which the rows and columns have been reversed. The matrix must be square for this work.

http://issues.apache.org/jira/browse/HAMA-174

+             +    +             +
| a11 a12 a13 |    | a11 a21 a31 |
| a21 a22 a23 | => | a12 a22 a32 |
| a31 a32 a33 |    | a13 a23 a33 |
+             +    +             +

- A map task receives a row n as a key, and vector of each row as its value
 - emit (Reversed index, the entry with the given index)

ex) row = 1, { 1: a11, 2: a12, 3: a13 }

a11, a11
a21, a12
a31, a13
...

- Reduce task sets the reversed values

Page tree