This effort is still a "work in progress". Please feel free to add comments. BRBut please make the content less visible by using smaller fonts. – Edward J. Yoon
Overview
A parallel matrix computation package.
Package Structure
- org.apache.hama : Dense and structured sparse matrices
- org.apache.hama.algebra : Algebraic operations on map/reduce
- org.apache.hama.io : I/O operations with matrices and vectors
- org.apache.hama.mapred : Map/Reduce Input/Output Formats
- org.apache.hama.sparse : Unstructured sparse matrices
Sparse Matrix
NOTE:
- Sparse matrix operations cannot be optimized
- Sparse structures which are growable can exceed the initial bandwidth allocation, while those which are not growable are fixed, and over-allocation will cause an error
- Matrices which are column major typically perform better with column-oriented operations, and likewise for row major matrices. Matrix/vector multiplication is row-major, while transpose multiplication is column-major
Why sparse matrices?
- Many classes of problems result in matrices with a large number of zeros
- A sparse matrix is a special class of matrix that allows only the non-zero terms to be stored
- Reduction in the storage requirements for sparse matrices
- Significant speed improvement as many calculations involving zero elements are neglected
Storage of sparse matrices
We choosed HBase which column-oriented sparse table storage to reduce storage and complexity.
- Hama use column-oriented storage of matrices (HBase) , and so compressed column format is a natural choice of sparse storage
- Hama forces the elements of each column to be stored in increasing order of their row index
1 0 0 (1,1) = 1 0 3 1 (2,2) = 3 0 0 0 (2,3) = 1
See also: [http://labs.google.com/papers/bigtable-osdi06.pdf Bigtable], A Distributed Storage System for Structured Data
Pseudo code for sparse matrix addition
NOTE:
- There are no duplicates in the input.