In further description we will research problem in form u = Av. Most computational algoritms spend large percent of time for solving systems of linear equations. In general, linear system of equations can be represented in matrix form Ax = b, where A is matrix with n rows and n columns, b - vector of size n, x - unknown solution vector which we are searching. Some approaches for solving linear systems has iterative nature. Assume, we know the initial approximation of x = x0. After that we represen our system in form xn = Bxn-1 + c, where c - vector of size n. After that we have to found next approximations of x till the convergence. In real world most of matrices contain relatively small number of non-zero items in comparison to total number of matrix items. Such matrices are called sparse matrices, matrices which filled most with non-zero items are called dense. So this page will describe the problem of sparse matrix vector multiplication(SpMV) with use of Bulk Synchronous Programming(BSP) model implemented in Apache Hama project. As shown above, SpMV can be used in different iterative solvers for system of linear equations. Bulk Synchronous model proposes it's own smart way of parallelization of programs. The input problem is separated by peers. Peers can be a processors, threads, separate machines, different items of cloud. BSP algorithm is divided in sequence of supersteps. Barrier synchronization of all peers is made after each superstep. The implementation of BSP(Apache Hama) contains primitives for defining peer number, communication with other peers with different communication primitives, optimizations of communication between peers.
As a sequential problem SpMV is almost trivial problem. But in case of parallel version we should think about some additional aspects:
The generic algorithm will be divided in three supersteps:
0. Matrix and vector distribution.
In Fanout phase all processors gets needed v components. In local computation phase local contribution to result vector is calculated. In Fanin phase all local contributions are sent to an owner of u. Most of efforts should be taken to choose right matrix and vector distribution which will improve the comunication volume of Fanout and Fanin phase. As base implementation of distribution I propose to create Cartesian (column mappings are not dependent of row mappings and vice versa) cyclic-block distribution with cyclical distribution of matrix diagonal. Also I assume that distr(u) != distr(v), which gives us more freedom in optimising vector distribution. This type of distribution has such advantages: it is simple, in fanin only communication with processor column is needed, in fanout only communication with processor row is needed, we can easily predict the productivity of algorithm. After matrix distribution we perform vector distribution in greedy way for each processor: processor grabs items until he reaches it's optimum state. After that stage some vector components can be unassigned (nearly 10%). We well distribute them in greedy fashion to. To support efficient local computation used data structure should provide indeces of rows and columns which have the non-zero item in them. Local computation must be performed with local indeces.
Current code contains classes which work with matrix in memory. That's why algorithm will fail in case of large matrices. So I propose some steps to modify SpMV algorithm to work with large matrices. First of all, simple matrix format based on work with file system will be created. Let's call this class FileMatrix. This format will give such possibilities:
Such constraints are chosen because it is hard to imagine, how we can efficiently implement some matrix operations, for example, get cell with specified index. In the same time this constraints meets constraints of HDFS (large size of block, data will be written once and read many times, fast sequential reading of entire file). Creation of such class won't take much time, and it will be possible to store and read large matrices. The bottleneck in current algorithm in memory consumption - phase of matrix distribution. Array of local matrices is stored in memory. We can correct the situation in such way: input matrix is stored in file, after that we iterate through matrix and map its components to local matrices also presented as FileMatrix. From now we won't store array of local matrices in memory, in this step we assume that matrix, taken from local file can fit memory of local processor. But even this can be improved. When local matrix can't fit local processor memory we will invoke local SpMV algorithm on matrix parts. I propose to create class, which implements Mapper interface from linearalgebra package, and split local matrix recursively into chunks presented like FileMatrix until each chunk can fit local memory. After that local chunks will be verified. I will call further this process two-phase mapping. After making such steps, we will avoid storing entire matrix in memory, now we assume that matrix can fit entire space of hard drives and can store components of local vector in memory. Also two-phase mapping can be used in RandomMatrixGenerator for large matrices.