Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Distributed Sparse Matrix-Vector Multiplication on Hama

Introduction

In further description this article we will research problem explore the problem of sparse matrix-vector multiplication, which can be written in form u = Av. Most computational algoritms spend spends large percent of time for solving large systems of linear equations. In general, linear system of linear equations can be represented in matrix form Ax = b, where A is matrix with n rows and n columns, b - vector of size n, x - unknown solution vector which we are searching. Some approaches for solving linear systems has iterative nature. Assume, we know the initial approximation of x = x0. After that we represen represent our system in form xn = Bxn-1 + c, where c - vector of size n. After that we have to can found next approximations of x and repeat this till the convergence. In real world most of matrices contain relatively small number of non-zero items in comparison to total number of matrix items. Such matrices are called sparse matrices, matrices which filled most with non-zero items are called dense. Sparse matrices arise when each variable from the set is connected with small subset of variables (for example, differential equation of heat conduction). So, this page will describe the problem of sparse matrix vector multiplication (SpMV) with use of Bulk Synchronous Programming (BSP) model implemented in Apache Hama project. As shown above, SpMV can be used in different iterative solvers for system of linear equations. Bulk Synchronous model proposes it's own smart way of parallelization of programs. The We can specify input problem is separated by path for problem and number of peers. Framework reads the input and divides it between peers. Peers can be a processors, threads, separate machines, different items of cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization of all peers is made after each superstep. The implementation of BSP(Apache Hama) contains primitives for defining peer number, communication with other peers with different communication primitives, optimizations of communication between peers, also it inherits most of features and approaches of Hadoop project.

Problem description

As a sequential problem SpMV is almost trivial problem. But in case of parallel version we should think about some additional aspects:

  1. Partitioning of matrix and vector components. This means, that we should split the input matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm. Wise partitioning should be taken or made or data locality won't be approached and communication time will rise very much or we will get great load imbalance and algorithm will be inefficient. 2. Load balancing. This means that each peer must perform nearly the same amount of work, and none of them should idle. 3. We should keep communication in bounds. In case of paralel SpMV we should take partitioning wise to keep communication in appropriate bounds independently of sparsity patterns of input matrix and vectormust consider Hadoop and Hama approach for parallelization.

Implementation tips

  1. Order of distribution and representation. We have two choices in this aspect: represent matrix first and distribute later, or distribute matrix first and represent later. In first case (represent first, distribute later) all simple operations will be non-local and will bring some unnecessary overhead. In other case (distribute first, represent later) all local operations on processor remain local: algorithm first determines responsible processor and it performs operation locally. Thats why I prefer distribution first representation later approach. 2. Data transmission direction. Here we also have two choices: delivery vector component to processor which possesses non-zero matrix component or vice versa. In most cases a number of non-zero items in matrix is much larger than vector length, thats why we prefer transmission of vectorFramework splits the input file to peers automatically. So we don't need to perform mapping of matrix to peers manually. We only must define how matrix can be written to file and how it can be readed from it. If we create matrix, which consists from separate cells, framework will give some subset of cells to each peer. If we create matrix consisting from rows, framework will give subset of rows to each peer. The ways to influence on partitioning: creating different writables for matrices, overriding default partitioner class behavior. 2. We don't need to care about communication in case of row-wise matrix access. First of all, rows of matrix are splitted automatically by the framework. After that we can compute inner product of the vector and concrete matrix row, and the result can be directly printed to output, because it is one of the cells of result vector. In this case we assume, that peer's memory can fit two vectors. Even if we have million x million matrix and vector of size million, some megabytes will be enough to store them. Even if we split input vector the gain in memory will be insignificant.

Algorithm description

The generic algorithm will be divided in three superstepscontain one superstep, because no communication is needed: 0.

  1. Matrix and vector distribution. 2.

...

  1. FanoutCustom partitioning. 23. Local computation. 3. Fanin.

In Fanout phase all processors gets needed v components. In local computation phase local contribution to result vector is calculated. In Fanin phase all local contributions are sent to an owner of u. Most of efforts should be taken to choose right matrix and vector distribution which will improve the comunication volume of Fanout and Fanin phase. As base implementation of distribution I propose to create Cartesian (column mappings are not dependent of row mappings and vice versa) cyclic-block distribution with cyclical distribution of matrix diagonal. Also I assume that distr(u) != distr(v), which gives us more freedom in optimising vector distribution. This type of distribution has such advantages: it is simple, in fanin only communication with processor column is needed, in fanout only communication with processor row is needed, we can easily predict the productivity of algorithm. After matrix distribution we perform vector distribution in greedy way for each processor: processor grabs items until he reaches it's optimum state. After that stage some vector components can be unassigned (nearly 10%). We well distribute them in greedy fashion to. To support efficient local computation used data structure should provide indeces of rows and columns which have the non-zero item in them. Local computation must be performed with local indeces.

Dealing with large matrices

Current code contains classes which work with matrix in memory. That's why algorithm will fail in case of large matrices. So I propose some steps to modify SpMV algorithm to work with large matrices. First of all, simple matrix format based on work with file system will be created. Let's call this class FileMatrix. This format will give such possibilities:

  1. we can set matrix cell and it will be appended to file, without any check. 2. we can iterate through entire file for getting all matrix cells.

Such constraints are chosen because it is hard to imagine, how we can efficiently implement some matrix operations, for example, get cell with specified index. In the same time this constraints meets constraints of HDFS (large size of block, data will be written once and read many times, fast sequential reading of entire file). Creation of such class won't take much time, and it will be possible to store and read large matrices. The bottleneck in current algorithm in memory consumption - phase of matrix distribution. Array of local matrices is stored in memory. We can correct the situation in such way: input matrix is stored in file, after that we iterate through matrix and map its components to local matrices also presented as FileMatrix. From now we won't store array of local matrices in memory, in this step we assume that matrix, taken from local file can fit memory of local processor. But even this can be improved. When local matrix can't fit local processor memory we will invoke local SpMV algorithm on matrix parts. I propose to create class, which implements Mapper interface from linearalgebra package, and split local matrix recursively into chunks presented like FileMatrix until each chunk can fit local memory. I will call further this process two-phase mapping. After making such steps, we will avoid storing entire matrix in memory, now we assume that matrix can fit entire space of hard drives and can store components of local vector in memory. Also two-phase mapping can be used in RandomMatrixGenerator for large matrices.

Possible improvements

  1. . 4. Output of result vector. 5. Constructing of dense vector.

In setup stage every peer reads input dense vector from file. After that, framework will partition matrix rows by the algorithm provided in custom partitioner automatically. After that local computation is performed. We gain some cells of result vector in bsp procedure, and they are written to output file. Output file is reread to construct instance of dense vector for further computation.

Implementation

How to get

Implementation can be found in my GitHub repository Apache JIRA] and patch can be found in [https://issues.apache.org/jira/browse/HAMA-524 as soon as JIRA will become available. GitHub repository contains only classes related to SpMV. Before you start with SpMV make sure that you have followed this guide and set up environment variables and so on.

Optional additional setup

I considered two possible use cases of SpMV:

  1. Usage in pair with RandomMatrixGenerator. 2. Usage with arbitrary text files.

In this section you will see how to use SpMV in this two cases. I propose the following directory structure for the following examples

No Format

/user/hduser/spmv/matrix-seq
/user/hduser/spmv/matrix-txt
/user/hduser/spmv/result-seq
/user/hduser/spmv/result-txt
/user/hduser/spmv/vector-seq
/user/hduser/spmv/vector-txt

Suffix seq denotes that directory contains sequence files. Suffix txt denotes that directory contains human-readable text files.

Also I defined some shell variables in .bashrc file of hadoop user to simplify following code snippets.

No Format

export HAMA_EXAMPLES=$HAMA_HOME/hama-examples*.jar
export SPMV=/user/hduser/spmv

First variable allows fast access to jar with hama examples, which plased in hama home directory, second variable is prefix in HDFS for tests in this tutorial. If you not defined this variables just substitute appropriate values into following scripts.

Representation of matrices in text format

It was decided to allow users to work with SpMV through text files. So in this section I will describe text format for matrices. I decided to represent all matrices and vectors as follows: each row of the matrix is represented by row index, length of the row, number of non-zero items, pairs of index and value. All values inside rows are separated by whitespace, rows are separated by newline. Vectors are represented as matrix rows with arbitrary row index(not used). So, for example:

No Format

[1 0 2]    3 2 0 1 2 2
[0 0 0]  = 3 0
[0 5 1]    3 2 1 5 2 1

Usage with RandomMatrixGenerator

RandomMatrixGenerator as a SpMV works with sequence file format. So, to multiply random matrix with random vector we will do the following: generate matrix and vector; convert matrix, vector and result to text file; view matrix, vector and result. This sequence is described by the following code snippet:

No Format

1:    hadoop dfs -rmr $SPMV/*/*
2:    hama jar $HAMA_EXAMPLES rmgenerator $SPMV/matrix-seq 6 6 0.4 4
3:    hama jar $HAMA_EXAMPLES rmgenerator $SPMV/vector-seq 1 6 0.9 4
4:    hama jar $HAMA_EXAMPLES spmv $SPMV/matrix-seq $SPMV/vector-seq $SPMV/result-seq 4
5:    hadoop dfs -rmr $SPMV/result-seq/part
6:    hama jar $HAMA_EXAMPLES matrixtotext $SPMV/matrix-seq $SPMV/matrix-txt
7:    hama jar $HAMA_EXAMPLES matrixtotext $SPMV/vector-seq $SPMV/vector-txt
8:    hama jar $HAMA_EXAMPLES matrixtotext $SPMV/result-seq $SPMV/result-txt
9:    hadoop dfs -cat /user/hduser/spmv/matrix-txt/*
   0	 6 3 5 0.24316243288531214 2 0.638622414091597 3 0.5480468710898891
   3	 6 2 5 0.5054043538570098 2 0.03911646523753309
   1	 6 3 4 0.5077528966368161 5 0.5780340816354201 3 0.4626752204959449
   4	 6 2 1 0.6512355661856207 4 0.08804976645891671
   2	 6 2 4 0.7200271909735554 1 0.3510851368183805
   5	 6 2 2 0.5848717104309032 3 0.0889791409798859

10:   hadoop dfs -cat /user/hduser/spmv/vector-txt/*
   0	 6 6 0 0.3365077672167889 1 0.17498609722570935 2 0.32806410950648845 3 0.6016567879100464 4 0.786158850847722 5 0.6856872945972037
11:   hadoop dfs -cat /user/hduser/spmv/result-txt/*
   0	 6 6 0 0.7059786044267415 1 1.0738967463653346 2 0.6274907669206862 3 0.35938205240905363 4 0.18317827331814918 5 0.24541032101100438

We got the expected result. So, now we will explain the meaning of each line in code snippet above.

Line 1: Clean up of directories related to SpMV tests.

Line 2-3: Generation of input matrix and vector. In this example we test 6x6 matrix and 1x6 vector multiplication

Line 4: SpMV algorithm.

Line 5: Deletion of part files from output directory at line 4. NOTE: matrixtotext will fail if this step will not be performed, because result-seq will containg part folder and matrixtotext don't know how to deal with it yet.

Line 6-8: Convertion of input matrix, input vector and result to text format.

Line 9-11: Showing the result.

Usage with arbitrary text files

SpMV works with SequenceFile, so we need to provide tools to convert input and output of SpMV between sequence file format and text format. These tools are matrixtoseq and matrixtotext. This programs are included in example driver, so they can be launched like any other example. matrixtoseq converts matrix, represented in text file to sequence file format. Also this program gives choice to choose target writable: DenseVectorWritable and SparseVectorWritable.

No Format

Usage: matrixtoseq <input matrix dir> <output matrix dir> <dense|sparse> [number of tasks (default max)]

matrixtotext converts matrix from sequence file format to text file.

No Format

Usage: matrixtotext <input matrix dir> <output matrix dir> [number of tasks (default max)]

Now let's show some example. To use SpMV in this mode you should provide text files in appropriate format, as described above. Imagine that you need to multiply

No Format

[1 0 6 0]   [2]   [38] 
[0 4 0 0] * [3] = [12] 
[0 2 3 0]   [6]   [24] 
[3 0 0 5]   [0]   [6]

First of all, you should create appropriate text files for input matrix and input vector. For input matrix file should look like

No Format

0 4 2 0 1 2 6
1 4 1 1 4
2 4 2 1 2 2 3
3 4 2 0 3 3 5

For vector file should be look like

No Format

0 4 3 0 2 1 3 2 6

After that you should copy these files to HDFS. If you don't feel comfortable with HDFS please see this tutorial. After you have copied input matrix into matrix-txt and input vector into vector-txt, we are ready to start. The following code snippet shows, how you can multiply matrices in this mode. Explanations will be given below.

No Format

1: hama jar $HAMA_EXAMPLES matrixtoseq $SPMV/matrix-txt $SPMV/matrix-seq sparse 4
2: hama jar $HAMA_EXAMPLES matrixtoseq $SPMV/vector-txt $SPMV/vector-seq dense 4
3: hama jar $HAMA_EXAMPLES spmv $SPMV/matrix-seq $SPMV/vector-seq $SPMV/result-seq 4
4: hadoop dfs -rmr $SPMV/result-seq/part
5: hama jar $HAMA_EXAMPLES matrixtotext $SPMV/result-seq $SPMV/result-txt
6: hadoop dfs -cat $SPMV/result-txt/*
    0	 4 4 0 38.0 1 12.0 2 24.0 3 6.0

Line 1: Converting input matrix to sequence file format, internally consisting of SparseVectorWritable.

Line 2: Converting input vector to sequence file format, internally consisting of DenseVectorWritable.

Line 3: SpMV algorithm.

Line 4: We delete part files from output directory. NOTE: matrixtotext will fail if this step will not be performed, because result-seq will containg part folder and matrixtotext don't know how to deal with it yet.

Line 5: Convertion of result vector to text format.

Line 6: Output of result vector. You can see that we gained an expected vector.

Possible improvements

  1. Significant improvement in total time of algorithm can be achieved by creating custom partitioner class. It will give us load balancing and therefore better efficiency. This is the main possibility for optimization, because we decided, that using of row-wise matrix access i acceptable. Maybe it can be achieved by reordering of input or by customizing partitioning algorithm of frameworkImplementation of Mondrian distribution. In most cases it gives better results in comparison with cyclic-block Cartesian scheme. 2. Implement algorithm for detecting matrix sparsity patterns. This will give us a possibility to define, for example, if matrix is random sparse matrix, or Laplacian matrix. In case of random sparse matrices we can use distribution patterns which are independent of matrix sparsity pattern. In case of Laplacian matrices we diamond distribution can give better result. 3. In phase of vector distribution when some vectors remain unassigned we can use graph algoritms to determine the owner of vector component.

Literature

  1. Rob H. Bisseling - Parallel Scientific computation. (chapter 4). 2. Steve Rennich - Block SpMV on GPU.