Name: Heat Simulation
Contact Person: OmpSs@FPGA Team, [email protected]
License Agreement: GPL 3.0
Platform: OmpSs@FPGA+IMP+OMPIF
The Heat benchmark simulates the propagation of heat on a 2D surfate following this formula on each position of the matrix:
A
is the matrix, with dimensions
The left figure shows the initial matrix with the padding rows and columns. Blue color means temperature 0, and the more read, the higher the temperature is. Thus, initially the matrix is set to 0, and the padding rows and columns have the inital values with temperatures different from 0. In the right image we see an animation of the heat propagation over time. Every frame shows the matrix in a different timestep. Both images are at different resolutions to improve visualization. The left one is 32x32, while the animation is set at 256x256.
This benchmark uses a Gauss-Seidel method, which means the same matrix is used as input and output. Therefore, in order to parallelize this applications we have to look at the dependencies the sequential version creates. As the formula tells, each position is calculated with the value of the previous step for the right and down neighbors, and the value of the current step of the up and left neighbors.
This creates a pattern of dependencies that acts as a wave. First, only the top-leftmost element can be updated, then, its right and down neighbors, and so on. The following video shows this effect with several steps overlapping:
The numbers show the anti-diagonal identifier. All elements of the same anti-diangonal can always be updated in parallel. The colors represent the elements that are updated for each step. Therefore, at the same time, only the even or odd anti-diagonals can be updated in parallel, and each anti-diagonal updates the values of a different step.
There is only one kernel, which updates the position sequentially in a fixed-size block of the matrix. Therefore, the same wave pattern applies to the tasks: a task depends on the four tasks that update the neighbor blocks.
This one is very similar to classic MPI implementations of the Gauss-Seidel Heat. The idea is to distribute the matrix evenly accross all ranks by consecutive rows. Then, we add a single extra row at the top and bottom to hold the values calculated by the other ranks. Therefore, before executing a task, the rank may need a row with the values that of the current or previous iteration. If these are calculated by other rank, the there is a data movement using OMPIF_Send/Recv. We can see this better in the following image:
Here we see how we split the matrix in three segments and distribute each tile to a different rank. Then, before starting the current step, every rank needs rows from their bottom neighbor. Each OMPIF_Send/Recv only sends the row of a block, therefore there are as many tasks as blocks in a single row. After the data exchange, the top rank can start, and as soon as the blocks are updated, it can send the updated rows to the bottom neighbors.
Although the formula to update the matrix is simple, you will see that file computeBlock.cpp
is far from being understandable, at least on the first read.
The code just implements the update of a single 256x256 block.
However, in order to get some performance we have to get the code a bit messy.
Everything important is in function computeBlock_moved
.
You will see this variable declaration
constexpr int ITER_LAT = 76;
constexpr int ROWPADDING = 1;
constexpr int LPADDING = 64 / sizeof(double);
constexpr int RPADDING = 1;
double local[BS + ROWPADDING*2][BS + LPADDING + RPADDING];
local
stores the block in BRAM, which is must faster to access and lets you configure the amount of ports for parallel reads/writes.
BS
is the block size, in this case 256.
However, local
has extra rows and columns, defined by the ROWPADDING
, LPADDING
and RPADDING
constants.
The padding is not symmetric due to how the code accesses memory.
A common technique to increase memory bandwidth is to use a high bit-width data bus to read and write memory, so we can read/write multiple data elements in parallel.
For example, in the Heat case we use a port of 256-bit (the width of the Alveo U55C HBM), which is equivalent to loading 4 doubles.
However, to do that we access memory through an pointer to an ap_uint<256>
type, therefore the addresses must be aligned to 32 bytes.
This is why we add a padding of 64 bytes on the left and right columns of the matrix, so we can support memory bus widths up to 512 bits (the usual case of FPGA DDR4 memories).
For the local
array, we only need to add 64 bytes of padding in the left column, because we read by rows and there is no alignment constraint for BRAMs.
Both matrix and local
are stored by consecutive rows, but the row size of the matrix (set by the user at runtime) is bigger than local
(256).
This prevents us from doing a single consecutive copy when loading a block in local
.
The next image shows the memory layout of both local
and the input matrix.
The left figure is the block stored in local
, while the right figure is the whole matrix.
The orange rows are the contents of the block that will be updated by the task.
The green rows are part of the padding of the matrix, while the blue rows are part of the padding of local
, but are not in the padding of the matrix, thus they are updated by another task.
In this example, the code will do 9 loads, one for each row including padding, and after updating the block, it will store 7 rows without padding.
Now lets see what the ITER_LAT
is about.
Up until now we have seen how a block is loaded from memory to BRAMs in local
, and stored back to memory.
The update part is divided in three loops: upper_triangle_no_collapse
, middle_collapse
, lower_triangle_no_collapse
.
The main idea to get performance is to pipeline the loop that updates positions, and unroll it if possible so we can update several positions every cycle.
However, we have seen in the description section that there are many dependencies between neighbor elements, so a simple double loop won't work.
To simplify the code, the following examples do not use padding and use square matrices (which is true in our case because the block size is always square), so the
for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j)
A[i][j] = ...
If we try to pipeline the innermost loop, we will get a carried dependency warning, because the results of one iteration are needed by the next one, preventing the loop from reaching II=1. For that, we need to traverse the block in anti-diagonal order, since all elements in the same anti-diagonal can be updated in parallel:
upper_triangle:
for (int d = 1; d <= n; ++d)
for (int i = d-1, j = 0; i >= 0 || j < d; --i, ++j)
A[i][j] = ...
lower_triangle:
for (int d = 1; d < n; ++d)
for (int i = n-1, j = d; i >= d || j < n; --i, ++j)
A[i][j] = ...
This image shows how the upper (red) and lower (green) triangles formed by the anti-diagonals.
The numbers show the anti-diagonal (
The first loop updates all the anti-diagonals for the upper triangle (including the main anti-diagnoal when
for (int d = 1; d < 2*n; ++d)
for (int i = d <= n ? d-1 : n-1,
j = d <= n ? 0 : d-n;
i >= 0 && j < n; --i, ++j)
A[i][j] = ...
Now we can pipeline the innermost loop ij
with II=1, however this loop can't get flattened because it's not a trivial loop (the iteration count is not constant).
Thus, between two iterations of the outermost d
loop we have to pay the latency of the ij
loop.
We can flatten manually both loops into one, however there is a dependency between iterations when d
is different.
The loop has a latency
On a certain diagonal
This image shows the starting cycle of each iteration for diagonal 3 of the upper triangle and 2 of the lower triangle.
We can see that for the first case, at cycle 3 we need the value of
The condition ITER_LAT
in the code, this is why there are three loops.
In summary, the following image shows which parts of the block are updated by each loop.
However, there is one last plot twist.
We can in fact unroll manually the loops to get double performance.
This is a tradeoff, because if we calculate two elements per cycle, the amount of anti-diagonals we can calculate in the collapsed loop gets reduced, and ITER_LAT=2L
.
It depends on the actual latency
Sadly, there is no official support in the clang compiler for OMPIF and IMP, so we have to split manually the FPGA and the host part. The host code is under src
. You will see a solver.c
and solver_dummy.c
.
The first is just the sequential version of the Heat benchmark, and the second is used to generate the executable, since the compiler needs the definition of a task even if it is not a CPU task.
In execution time, the code will search for the FPGA accelerators, which are implemented in hls under the hls
directory.
So, first you need an OmpSs@FPGA installation with broadcaster support for both in Nanos6 and clang. Also, you need the AIT version that implements Ethernet subsystem and OMPIF support. For the moment these versions are not official nor public, you can ask for access to [email protected].
In summary, to compile the host executable run make
.
To generate the bitstream run make ait
By default, it generates the bitstream from the HLS phase, but you can change the start and finish phases with the Makefile variables TO_STEP
and FROM_STEP
.
Besides that, there are other variables:
- FPGA_CLOCK: Frequency in MHz at which the accelerators will run.
- FPGA_MEMPORT_WIDTH: Data bit-width of the memory port for all the accelerators. More bit-width may provide more bandwidth (depending on the FPGA memory path), at the cost of more resource usage. IMPORTANT This variable is intended to be used in the task pragma. Since there is no compiler support, if you want to change the port width you have to modify the hls code
computeBlock.cpp
. By default it is 256. - BLOCK_SIZE: Size of one of the dimensions of the block matrices, which are square. IMPORTANT The block size affects both the host and hls code, so if you modify the variable, you have to apply the changes manually in both
solve.cpp
andcomputeBlock.cpp
. - NUM_CBLOCKS: Number of accelerators that compute a single block. IMPORTANT If you want to change this variable, change the
num_instances
field of theait_extracted.json
file. Usually this file is generated by clang, but since we do not have that support with OMPIF or IMP, we have to modify the file manually.