SpMV

Distributed Sparse Matrix-Vector Multiplication on Hama

Introduction

In further description we will research problem in form u = Av. Most computational algoritms spends large percent of time for solving large systems of linear equations. In general, system of linear equations can be represented in matrix form Ax = b, where A is matrix with n rows and n columns, b - vector of size n, x - unknown solution vector which we are searching. Some approaches for solving linear systems has iterative nature. Assume, we know the initial approximation of x = x0. After that we represent our system in form xn = Bxn-1 + c, where c - vector of size n. After that we have to found next approximations of x till the convergence. In real world most of matrices contain relatively small number of non-zero items in comparison to total number of matrix items. Such matrices are called sparse matrices, matrices which filled most with non-zero items are called dense. Sparse matrices arise when each variable from the set is connected with small subset of variables (for example, differential equation of heat conduction). So, this page will describe the problem of sparse matrix vector multiplication(SpMV) with use of Bulk Synchronous Programming(BSP) model implemented in Apache Hama project. As shown above, SpMV can be used in different iterative solvers for system of linear equations. Bulk Synchronous model proposes it's own smart way of parallelization of programs. We can specify input path for problem and number of peers. Framework reads the input and divides it between peers. Peers can be a processors, threads, separate machines, different items of cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization of all peers is made after each superstep. The implementation of BSP(Apache Hama) contains primitives for defining peer number, communication with other peers with different communication primitives, optimizations of communication between peers, also it inherits most of features and approaches of Hadoop project.

Problem description

As a sequential problem SpMV is almost trivial problem. But in case of parallel version we should think about some additional aspects:

Partitioning of matrix and vector components. This means that we should split the input matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm. Wise partitioning should be taken or communication time will rise very much or we will get great load imbalance and algorithm will be inefficient. 2. Load balancing. This means that each peer must perform nearly the same amount of work, and none of them should idle. 3. We must consider Hadoop and Hama approach for parallelization.

Implementation tips

Framework splits the input file to peers automatically. So we don't need to perform mapping of matrix to peers manually. We only must define how matrix can be written to file and how it can be readed from it. If we create matrix, which consists from separate cells, framework will give some subset of cells to each peer. If we create matrix consisting from rows, framework will give subset of rows to each peer. The ways to influence on partitioning: creating different writables for matrices as described above, overriding default partitioner class behavior. 2. We don't need to care about communication in case of row-wise matrix access. First of all, rows of matrix are splitted automatically by the framework. After that we can compute inner product of the vector and concrete matrix row, and the result can be directly printed to output, because it is one of the cells of result vector. In this case we assume, that peer's memory can fit two vectors. Even if we have million x million matrix and vector of size million, some megabytes will be enough to store them. Even if we split input vector the gain in memory will be insignificant.

Algorithm description

The generic algorithm will contain one superstep, because no communication is needed:
0. Matrix and vector distribution.

Custom partitioning. 2. Local computation. 3. Output of result vector.

In setup stage every peer reads input dense vector from file. After that, framework will partition matrix rows by the algorithm provided in custom partitioner automatically. After that local computation is performed. We gain some cells of result vector, and they are written to output file.

Possible improvements

Significant improvement in total time of algorithm can be achieved by creating custom partitioner class. It will give us load balancing and therefore better efficiency. This is the main possibility for optimization, because we decided, that using of row-wise matrix access i acceptable. Maybe it can be achieved by reordering of input or by customizing partitioning algorithm of framework.

Literature

Rob H. Bisseling - Parallel Scientific computation. (chapter 4). 2. Steve Rennich - Block SpMV on GPU.

Page tree