Sparse Matrix Dot Product Parallel Algorithm Design

This page is for documentation and discussion of parallel algorithms on GPU for dot product involving sparse matrices.
dot(dense, csr.T) = dense
- Lemma: csr.T is equivalent to interpreting the same csr input as a csc matrix
- Algorithm: Treat the input as a csc matrix, use one GPU thread per column, scan for every row in dense matrix, output to corresponding position in output dense matrix
dot(dense, csr) = dense
- Take One: Transpose the right-hand side csr matrix (in parallel fashion), then apply the above algorithm
- Take Two: Change csr to coo through a histogrammming and exclusive scan on the indices vector of input csr, then sort all elements by column, then apply the above algorithm
- Take Three(flaky): One GPU thread per row on original input, for each element in the row, multiply by the corresponding element in the dense matrix, then atomically add to corresponding destination in output (flaky due to float32 additions do not obey commutative law).

Page tree