-
Notifications
You must be signed in to change notification settings - Fork 5
MatrixMultiplication
The objective of this simple tutorial is to illustrate a user's basic interaction with the AllScale API based on a well known example code. The codes presented in this tutorial are snippets extracted from the (not much longer) full implementation of the covered development steps. These steps can be obtained from the associated Matrix Multiplication tutorial sources directory.
This tutorial covers the following steps:
- Sequential Implementation - an initial reference version
- Fixing the Data Structures - adapting AllScale data structures
- Parallelization - distributing computation to parallel resources
- Tuning - exemplary performance improvement step
A basic, sequential C-style implementation of a matrix multiplication operation in C++ could be as simple as the following code example:
double** mm(double** a, double** b) {
double** c = createMatrix();
for(int i=0; i<N; ++i) {
for(int j=0; j<N; ++j) {
c[i][j] = 0;
for(int k=0; k<N; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return c;
}
A full version of this example can be obtained from here.
Although functionally correct, this version does not utilize C++ capabilities to abstract data structures and to implicitly manage the lifetime of objects automatically. This leads to the need of a large amount of boiler plate code, as can be seen in the full code version, that skilled C++ developers would avoid through the proper utilization of containers.
Thus, a more advanced -- and much more compact -- version of the algorithm could look like this:
#include <array>
const int N = 100;
using Matrix = std::array<std::array<double,N>,N>;
Matrix operator*(const Matrix& a, const Matrix& b) {
Matrix c;
for(int i=0; i<N; ++i) {
for(int j=0; j<N; ++j) {
c[i][j] = 0;
for(int k=0; k<N; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return c;
}
A full version of this example can be obtained from here.
In this version, matrices are encoded through nested instantiations of the std::array
container. This way, memory management is done automatically by the implementation of this container. This significantly reduces the amount of boilerplate code required for handling matrices, as can be seen by comparing the full code examples.
We will use this example code as the base line reference to illustrate the process of parallelizing the given algorithm using the AllScale API.
Although already much more compact than a C-style implementation of a matrix multiplication, the second version above still exhibits some subtle issues:
- the memory of matrices is allocated on the stack; thus, for larger matrices the limitations of the stack will be quickly exceeded, causing the execution to crash
- due to their internal representation, the content of arrays needs to be copied whenever being passed by value as a parameter or returned from a function; while sometimes being optimized out by the compiler, in general this may cause unnecessary computational work
Both issues are covered by placing the elements of a matrix on the heap instead of the stack, and only retaining a pointer to its location on the stack. Consequently, only pointers need to be passed to and from function calls.
Unfortunately the C++ standard library does not provide such a container, yet users may quite easily define such a container on their own, through composing standard library constructs. However, since the need for a statically sized multi-dimensional Grid
is a common case, the AllScale API is offering an implementation of such a heap-located statically sized multi-dimensional array -- the StaticGrid
.
The following code example illustrates its utilization:
#include "allscale/api/user/data/static_grid.h"
using namespace allscale::api::user;
const int N = 100;
using Matrix = data::StaticGrid<double,N,N>;
// computes the product of two matrices
Matrix operator*(const Matrix& a, const Matrix& b) {
Matrix c;
for(int i=0; i<N; ++i) {
for(int j=0; j<N; ++j) {
c[{i,j}] = 0;
for(int k=0; k<N; ++k) {
c[{i,j}] += a[{i,k}] * b[{k,j}];
}
}
}
return c;
}
A full version of this example can be obtained from here.
The StaticGrid<double,N,N>
container constitutes a two-dimensional array of doubles, allocated on the heap, making move-operations O(1). The number of dimensions may thereby be freely chosen. For instance, StaticGrid<double,N>
could be utilized as a vector of N elements, while StaticGrid<double,N,N,N>
provides a cube of NxNxN doubles.
Note the subtle difference in addressing elements of a StaticGrid
compared to the syntax used for nested std::array
s. While for the array example two subscript operator ([]
) calls are required, the StaticGrid
enables access to its elements through a single call, accepting a pair of indices.
Besides compensating for the issues of the std::array
approach pointed out above, the switch to StaticGrid
has one additional side effect: StaticGrid
implements the Data Item concept, and may thus be automatically distributed throughout a distributed memory system if accessed by parallel code facilitating distributed memory parallelism. However, so far, our matrix multiplication code is still sequential.
To parallelize our operation, the most direct way is to parallelize the outermost for
loop by replacing it with a pfor
loop. Fortunately, the AllScale API provides one of those, such that we can do so by writing:
#include "allscale/api/user/algorithm/pfor.h"
#include "allscale/api/user/data/static_grid.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 100;
using Matrix = data::StaticGrid<double,N,N>;
// computes the product of two matrices
Matrix operator*(const Matrix& a, const Matrix& b) {
Matrix c;
// in parallel, for each resulting element ...
pfor(0,N,[&](int i) {
for(int j=0; j<N; ++j) {
c[{i,j}] = 0;
for(int k=0; k<N; ++k) {
c[{i,j}] += a[{i,k}] * b[{k,j}];
}
}
});
return c;
}
A full version of this example can be obtained from here.
Note the additional include of the pfor.h
header, providing the implementation of the pfor
operator replacing the outermost for loop. Thereby, the original body of the loop is passed in the form of a lambda to the pfor
, which is invoking the body for each iterator i
value between 0
and N
.
That's it! The code is now parallel. No additional work is required. In particular no setup for groups of threads or devices is need. Furthermore, pfor
s or any other parallel construct of the AllScale API may be nested, enabling their free utilization within libraries. pfor
will automatically fall back to a sequential implementation if nested in a parallel code that already exhibits enough parallelism to saturate a system's concurrent resources.
Naturally, there are various ways of improving the performance of computing the product of matrices. To complete this tutorial, we would like to show one of them: a variation of tiling.
Currently, the outermost loop of the matrix multiplication example is parallelized. However, also the middle loop may be parallelized, enabling an arbitrary ordering of computation steps in both iteration spaces. To express this, both loops might be parallelized using a pfor
as illustrated above. However, this would introduce unnecessary barriers being waited for at the end of each inner parallel loop.
Fortunately, pfor
supports the definition of iteration spaces of higher dimensions, enabling us to write
#include "allscale/api/user/algorithm/pfor.h"
#include "allscale/api/user/data/static_grid.h"
using namespace allscale::api::user;
using namespace allscale::api::user::algorithm;
const int N = 100;
using Matrix = data::StaticGrid<double,N,N>;
using Point = allscale::utils::Vector<int,2>;
Matrix operator*(const Matrix& a, const Matrix& b) {
Matrix c;
// in parallel, for each resulting element ...
pfor(Point{N,N},[&](const Point& p) {
c[p] = 0;
for(int k=0; k<N; ++k) {
c[p] += a[{p.x,k}] * b[{k,p.y}];
}
});
return c;
}
A full version of this example can be obtained from here.
In this version, the two outer loops are collapsed into a single, 2-dimensional for loop covering the Cartesian product of the iteration spaces of the loops it replaced. Furthermore, the iteration space will not be processed by linear scanning through it as the original loops did, but by recursively sub-dividing it along alternating dimensions -- leading to a tiling-like localization of data accesses.
Part of the AllScale project - http://www.allscale.eu