# Lab 5: Parallel Matrix Multiplication

Submission deadline: 2359, Sunday, April 15, 2018.

(The deadline is pre-extended to Sunday :))

## Setup

The skeleton code from Lab 5 is available on cs2030-i under the directory ~cs2030/lab5. There are three files:

• MatrixMultiplication.java is the file that you need to edit to parallelize the matrix multiplication inside;
• LabFive.java is the main file;
• Matrix.java is the matrix class

To run LabFive, you should:

 1 java LabFive 

where $2^n$ is the dimension of the matrices. For instance, to run with matrices of size 32x32,

 1 java LabFive 5 

## Background: Matrix Multiplication

Matrix multiplication is a fundamental operation on matrices with many applications in physics, engineering, mathematics, and computer science.

Given a matrix $A$ of $n \times m$ ($n$ rows, $m$ columns), and a matrix $B$ of $m \times p$, the matrix produce $C = AB$ is an $n \times p$ matrix, where elements $c_{ij}$ in $C$ is given by: $c_{ij} = \sum_{k=1}^m a_{ik}b_{kj}$._

In this lab, we are interested in parallelizing the following divide-and-conquer algorithm for matrix multiplication:

Let $A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}$, $B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}$, $C = \begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{pmatrix}$

where $A_{11}, A_{12}$ etc. are block partitioned matrices of equal sizes.

If $C = AB$, then:

$C = \begin{pmatrix} C_{11} & C_{12} \\ C_{21} & C_{22} \end{pmatrix} = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix} \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix} = \begin{pmatrix} A_{11}B_{11} + A_{12}B_{21} & A_{11}B_{12} + A_{12}B_{22} \\ A_{21}B_{11} + A_{22}B_{21} & A_{21}B_{12} + A_{22}B_{22} \end{pmatrix}$

Please implement the above divide-and-conquer algorithm as a RecursiveTask and submit it to ForkJoinPool for execution. For simplicity, we only need to handle square matrices of size $2^n$ for $n$ up to 11.

A skeleton file MatrixMultiplication.java has been given to you. The class MatrixMultiplication is mostly empty, except that it inherits from RecursiveTask<Matrix>. You should fill in the class with necessary fields and method.

The file Matrix.java is also given to you. It implements a matrix with double values,
and stores the values of the matrix in a 2D double array m. It also stores the dimension of the matrix in a field dimension. This file includes two methods to multiply two matrices, one sequentially with triple for loops, and another (also sequentially) with the recursive divide-and-conquer algorithms. There is a method to compare if two matrices are equals.

A method multiplyParallely exists in the class Matrix but at the moment it calls the non-parallel version of multiplyRecursively. You should change this to appropriately invoke the parallel version of MatrixMultiplication.

You are still required to

### Running on sunfire

You need to get the code and submit your solution on cs2030-i per usual. The VM cs2030-i, however, has only one processor and therefore it is not much fun to run parallel programs on it.

Fortunately, the host sunfire, has 256 processors1. We will run your submitted solution on sunfire, and you can test your code on sunfire as well. You can ssh into sunfire just like cs2030-i.

On sunfire, you should be able to get a speed up of at least 70 (i.e., parallel version is 70x faster than the sequential version) when running with $n=11$.

Note that sunfire still uses Java 8. You won't need features from Java 9 or 10 in this lab in any case.

## Tips:

• Try with small matrices first. Make sure the code is correct before you go for the larger matrices.
• Try running locally on your own machine first to make sure the code is correct and does not throw exceptions. Optimize for speed up on your machine, before moving the code to sunfire to measure the final speed up.
• The method in LabFive that measures the time runs a given multiplication three times before taking the average. It can take around 100-120 seconds to multiply sequentially. When you debug for correctness and optimize for speed up for the parallel version, there is no need to multiply sequentially three times. Feel free to change this part of the code while you are debugging/optimizing your code.
• For matrices of dimension $2^{10}$ and $2^{11}$, you need to run java with the argument -Xmx<size> to increase the heap memory to size on sunfire. -Xmx1g increases the heap memory to up to 1GB, and should work well for both cases.
• If you grow impatient while waiting and want to stop the running process, type Control-C in your ssh window. You may have to wait up to a few seconds (depending on how slow is sunfire) for the process to stop.

## Challenges:

Challenges for this lab include:

• Finding the right fork threshold so that you gain the maximum speed up for a matrix of size $2^{11} \times 2^{11}$;
• Not spawning too many tasks that block, which will in turns lead to too many compensation threads being created in ForkJoinPool, and a RejectedExecutionException being thrown: Caused by: java.util.concurrent.RejectedExecutionException: Thread limit exceeded replacing blocked worker. The order of fork and join becomes important here.
• Not creating too many unnecessary copies of the matrices (you will ran out of heap space even with -Xmx1g
• Not letting multiple tasks update the same matrix in place. Such side effects may lead to a incorrect result.
• Waiting patiently for the multiplication to complete :)

We will still grade your code based on your implementation (not the resulting speed up), but as a guideline, using matrix of size $2^{11}$ as the benchmark on sunfire with 1GB of heap space, you get:

• 2 mark if you parallelize the code but achieve a slow down in performance.
• 3 marks if your speed up is aronud 1 - 30x
• 4 marks if your speed up is around 30x - 50x
• 5 marks if your speed up is around 50x - 60x
• 6 marks if your speed up is beyond 60x

You can get -0.5 mark deduction for a serious violation of style.

If you matrix multiplication output is incorrect or your code throws an exception, you will get deduction depending on how serious your bug is. Possible bugs, include:

• Incorrect order of fork and join (-3)
• Letting multiple tasks update the same location of the matrix at the same time (-2)
• Careless mistakes in referring to matrices or its indices (-1)

## Submission

When you are ready to submit your lab, on cs2030-i, run the script

 1 ~cs2030/submit5 

which will copy all files matching *.java (and nothing else) from your ~/lab5 directory on cs2030-i to an internal grading directory. We will test compile and test run with a tiny sample input to make sure that your submission is OK.

You can submit multiple times, but only the most recent submission will be graded.

May the fork()s be with you.

1. You can find out by calling Runtime.getRuntime().availableProcessors(); in Java.