Large-scale neural data analysis with Spark
by Richard Alex Hofer (rhofer) and Owen Kahn (okahn)
Our goal for this project is to implement distributed matrix inverse.
Note: This was not our original project
Our writeup is avaiable here
Our progress report is available here
Matrices are important. Lots of frameworks built on top of Spark provide distributed matrices and a few operations, but it appears that none of them offer a matrix inverse which can be very useful. This is likely a sign that it is a very difficult operation to implement and we probably should have paid more attention to this.
Matrix inverse does not distribute easily and computing any part of it generally requires access to many parts of the original matrix. This communication makes its implementation difficult when working with distributed matrices. It also does not lend itself well to Spark primitives.
We will test on Amazon EC2 because it is easy to deploy Thunder there and it allows us to easily test on many different cluster configurations.
A correct implementation of distributed matrix inverse which is faster than using a single machine (which would be forced to read and write from disk).
Ideally we would like to be competitive with the implementation provided by the paper we our working from.
As discussed in Resources we will use Amazon EC2.
April 25
May 2
May 3
May 6
May 8
May 10
We have a correct implementation of inverse which falls apart if run on a larger cluster due to unidentified overhead internal to Spark. We're attempting to figure out why this occurs, but are having difficulties since we cannot find anyone who is doing a similar type of computation.