This effort is still a "work in progress". Please feel free to add comments. BRBut please make the content less visible by using smaller fonts. – Edward J. Yoon

Overview

A parallel matrix computation package.

Package Structure

org.apache.hama : Dense and structured sparse matrices
org.apache.hama.algebra : Algebraic operations on map/reduce
org.apache.hama.io : I/O operations with matrices and vectors
org.apache.hama.mapred : Map/Reduce Input/Output Formats
org.apache.hama.sparse : Unstructured sparse matrices

Sparse Matrix

NOTE:

Sparse matrix operations cannot be optimized
Sparse structures which are growable can exceed the initial bandwidth allocation, while those which are not growable are fixed, and over-allocation will cause an error
Matrices which are column major typically perform better with column-oriented operations, and likewise for row major matrices. Matrix/vector multiplication is row-major, while transpose multiplication is column-major

Why sparse matrices?

Many classes of problems result in matrices with a large number of zeros
A sparse matrix is a special class of matrix that allows only the non-zero terms to be stored
Reduction in the storage requirements for sparse matrices
Significant speed improvement as many calculations involving zero elements are neglected

Storage of sparse matrices

We choosed HBase which column-oriented sparse table storage to reduce storage and complexity.

Hama use column-oriented storage of matrices (HBase) , and so compressed column format is a natural choice of sparse storage
Hama forces the elements of each column to be stored in increasing order of their row index

  1  0  0       (1,1) = 1           
  0  3  1       (2,2) = 3
  0  0  0       (2,3) = 1

See also: [http://labs.google.com/papers/bigtable-osdi06.pdf Bigtable], A Distributed Storage System for Structured Data

Pseudo code for sparse matrix addition