SpMV

Distributed Sparse Matrix-Vector Multiplication on Hama

Introduction

In further description we will research problem in form u = Av. Most computational algoritms spend large percent of time for solving systems of linear equations. In general, linear system of equations can be represented in matrix form Ax = b, where A is matrix with n rows and n columns, b - vector of size n, x - unknown solution vector which we are searching. Some approaches for solving linear systems has iterative nature. Assume, we know the initial approximation of x = x0. After that we represen our system in form xn = Bxn-1 + c, where c - vector of size n. After that we have to found next approximations of x till the convergence. In real world most of matrices contain relatively small number of non-zero items in comparison to total number of matrix items. Such matrices are called sparse matrices, matrices which filled most with non-zero items are called dense. So this page will describe the problem of sparse matrix vector multiplication(SpMV) with use of Bulk Synchronous Programming(BSP) model implemented in Apache Hama project. As shown above, SpMV can be used in different iterative solvers for system of linear equations. Bulk Synchronous model proposes it's own smart way of parallelization of programs. The input problem is separated by peers. Peers can be a processors, threads, separate machines, different items of cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization of all peers is made after each superstep. The implementation of BSP(Apache Hama) contains primitives for defining peer number, communication with other peers with different communication primitives, optimizations of communication between peers.

Problem description

As a sequential problem SpMV is almost trivial problem. But in case of parallel version we should think about some additional aspects:

Partitioning of matrix and vector components. This means that we should split the input matrix and vectors by peers, if we want to have benefits from usage of parallel algorithm. Wise partitioning should be taken or communication time will rise very much and algorithm will be inefficient. 2. Load balancing. This means that each peer must perform nearly the same amount of work, and none of them should idle. 3. We should keep communication in bounds. In case of paralel SpMV we should take partitioning wise to keep communication in appropriate bounds independently of sparsity patterns of input matrix and vector.

Implementation tips

Order of distribution and representation. We have two choices in this aspect: represent matrix first and distribute later, or distribute matrix first and represent later. In first case (represent first, distribute later) all simple operations will be non-local and will bring some unnecessary overhead. In other case (distribute first, represent later) all local operations on processor remain local: algorithm first determines responsible processor and it performs operation locally. Thats why I prefer distribution first representation later approach. 2. Data transmission direction. Here we also have two choices: delivery vector component to processor which possesses non-zero matrix component or vice versa. In most cases a number of non-zero items in matrix is much larger than vector length, thats why we prefer transmission of vector.

Algorithm description

The generic algorithm will be divided in three supersteps:
0. Matrix and vector distribution.

Fanout. 2. Local computation. 3. Fanin.

In Fanout phase all processors gets needed v components. In local computation phase local contribution to result vector is calculated. In Fanin phase all local contributions are sent to an owner of u. Most of efforts should be taken to choose right matrix and vector distribution which will improve the comunication volume of Fanout and Fanin phase. As base implementation of distribution I propose to create Cartesian (column mappings are not dependent of row mappings and vice versa) cyclic-block distribution with cyclical distribution of matrix diagonal. Also I assume that distr(u) != distr(v), which gives us more freedom in optimising vector distribution. This type of distribution has such advantages: it is simple, in fanin only communication with processor column is needed, in fanout only communication with processor row is needed, we can easily predict the productivity of algorithm. After matrix distribution we perform vector distribution in greedy way for each processor: processor grabs items until he reaches it's optimum state. After that stage some vector components can be unassigned (nearly 10%). We well distribute them in greedy fashion to. To support efficient local computation used data structure should provide indeces of rows and columns which have the non-zero item in them. Local computation must be performed with local indeces.

Possible improvements

Implementation of Mondrian distribution. In most cases it gives better results in comparison with cyclic-block Cartesian scheme. 2. Implement algorithm for detecting matrix sparsity patterns. This will give us a possibility to define, for example, if matrix is random sparse matrix, or Laplacian matrix. In case of random sparse matrices we can use distribution patterns which are independent of matrix sparsity pattern. In case of Laplacian matrices we diamond distribution can give better result. 3. In phase of vector distribution when some vectors remain unassigned we can use graph algoritms to determine the owner of vector component.

Literature

Rob H. Bisseling - Parallel Scientific computation. (chapter 4). 2. Steve Rennich - Block SpMV on GPU.

Page tree