Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 118 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,124 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Yu Sun
* [LinkedIn](https://www.linkedin.com/in/yusun3/)
* Tested on: Tested on: Windows 10 , i7-6700HQ CPU @ 2.60GHz × 8 , GeForce GTX 960M/PCIe/SSE2, 7.7GB Memory (Personal Laptop)

### (TODO: Your README)
## Introduction

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
In this project, stream compaction is implemented using the traditional CPU approach, a naive CUDA based scan algorithm and a more efficient algorithm that performs the scan in place. This project can be used as a base work for many useful things like path tracer.

Stream compaction usually includes two processes: scan followed by compaction.
For example, given an array and some condition, we create an boolean array indicating whether the conditions is met, and perform a exclusive scan on it, resulting in
a result that stores the indices where the condition is met in the original array.

Below is a visualization of how scan and compaction works:
![](img/stream_scan)


For CPU implementation, scan is basically just an for loop iterating through all the elements and produce the outputs.
![](img/cpu.png)

For the naive GPU implementation, scan is done using two ping-pong buffers since the algorithm requires inplace updates of the array.
![](img/naive.png)
![](img/naive_al.png)

For the efficient GPU implementation, scan is done smartly using up-sweep and down-sweep, which reduces the amount of computation significantly.
![](img/up.png)
![](img/down.png)
I tried to optimize the efficient algorithm by launching with different grid dimension that is best for each iteration, but in the end it's still slower than naive implementation when not using shared memory.

For performance comparison, the built-in scan algorithm from thrust library is also used.

I also implement the algorithm with shared memory instead of calling for loops from the host to see if it can speed up computations.


## Performance Analysis @ BlockSize = 128
The performance of these three different algorithms with or w/o shared memory are shown in the diagram below.
![](img/pw2.png)
![](img/npt.png)

As can be seen from the diagram, when the array size is small, there is not much performance gain by using the GPU. Also, it is very interesting to note that
the efficient scan algorithm is not actually performing better than the naive algorithm when not utilizing shared memory.

My reasoning behind the first phenomenon is that essentially when the array size if small, we do not gain many parallelsim by using the algorithm. The memory read from global
memory is so heavy and we don't have enough parallelsim to hide this fact. It is not very obvious from my plot but from reasoning and looking at the performance from efficient shared memory, I would predict that as size of the array grows, the memory latency will hidden by the parallel computation so the GPU implementation will be faster.

The second phenomenon shows that if not carefully designed, a smarter algorithm could actually perform less well. There are a lot of branches and idling threads in the algorithm that the block computation power is not fully used at all! Even if I tried to eliminate the idling threads by changing the grid dimension it's still slower than the naive algorithm. Changing the algorithm by using stride can resolve this issue since the threads will be accessing consecutive memory addresses. But I don't have enough time to implement them.

Also, as it can be seen, implementation with shared memory generally perform better than the one without shared memory. This is because the algorithm reused the computation quite often and by using shared memory, we can improve the performance by having to do many global reads. However, one needs to be careful to not deplet the resources on GPU. For example, my current algorithm wouldn't work if I have an array size that's bigger than the blockSize.


## Output from test Program

```
****************
** SCAN TESTS **
****************
[ 2 32 25 35 9 44 38 24 8 2 35 4 24 ... 30 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.000395ms (std::chrono Measured)
[ 0 2 34 59 94 103 147 185 209 217 219 254 258 ... 694 724 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0ms (std::chrono Measured)
[ 0 2 34 59 94 103 147 185 209 217 219 254 258 ... 609 636 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.02448ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.023744ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.125024ms (CUDA Measured)
[ 0 2 34 59 94 103 147 185 209 217 219 254 258 ... 694 724 ]
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.098976ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.0272ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.011232ms (CUDA Measured)
passed
==== naive scan with shared memory, power of two ====
elapsed time: 0.01696ms (CUDA Measured)
passed
==== naive scan with shared memory, non-power-of-two ====
elapsed time: 0.008ms (CUDA Measured)
passed
==== efficient scan with shared memory, power of two ====
elapsed time: 0ms (CUDA Measured)
passed
==== efficient scan with shared memory, non-power-of-two ====
elapsed time: 0ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 1 1 3 1 3 3 1 1 2 3 1 1 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0ms (std::chrono Measured)
[ 1 1 1 3 1 3 3 1 1 2 3 1 1 ... 1 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0ms (std::chrono Measured)
[ 1 1 1 3 1 3 3 1 1 2 3 1 1 ... 3 1 ]
passed
==== cpu compact with scan ====
elapsed time: 0.000791ms (std::chrono Measured)
[ 1 1 1 3 1 3 3 1 1 2 3 1 1 ... 1 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 2.36675ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 1.8215ms (CUDA Measured)
[ 1 1 1 3 1 3 3 1 1 2 3 1 1 ... 3 1 ]
passed
Press any key to continue . . .
```

Binary file added img/cpu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/down.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/naive_al.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/npt.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/pw2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/stream_scan.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/up.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 35 additions & 4 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/naive_sm.h>
#include <stream_compaction/efficient_sm.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
// If you want to increase the size of array, make sure also increase BlockSize in algorithm implemented with shared memory
const int SIZE = 1 << 7;
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -71,7 +73,7 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
Expand All @@ -95,6 +97,35 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("naive scan with shared memory, power of two");
StreamCompaction::NaiveSM::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("naive scan with shared memory, non-power-of-two");
StreamCompaction::NaiveSM::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("efficient scan with shared memory, power of two");
StreamCompaction::NaiveSM::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("efficient scan with shared memory, non-power-of-two");
StreamCompaction::NaiveSM::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
// printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);


printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
Expand Down Expand Up @@ -144,7 +175,7 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient compact, non-power-of-two");
count = StreamCompaction::Efficient::compact(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(count, c, true);
printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
Expand Down
4 changes: 4 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@ set(SOURCE_FILES
"efficient.cu"
"thrust.h"
"thrust.cu"
"naive_sm.cu"
"naive_sm.h"
"efficient_sm.cu"
"efficient_sm.h"
)

cuda_add_library(stream_compaction
Expand Down
10 changes: 8 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index < n){
bools[index] = (idata[index] != 0);
}
}

/**
Expand All @@ -32,7 +35,10 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index < n && bools[index]){
odata[indices[index]] = idata[index];
}
}

}
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
#include <chrono>
#include <stdexcept>

#define blockSize 128

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

Expand Down
50 changes: 37 additions & 13 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
#include <cstdio>
#include "cpu.h"

#include "common.h"
#include "common.h"

namespace StreamCompaction {
namespace CPU {
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}

/**
Expand All @@ -19,7 +19,10 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
if (n == 0 || idata == NULL || odata == NULL) return;
odata[0] = 0;
for (int i = 1; i < n; i++) odata[i] = odata[i - 1] + idata[i - 1];

timer().endCpuTimer();
}

Expand All @@ -29,10 +32,17 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
if (n == 0 || idata == NULL || odata == NULL) return 0;
timer().startCpuTimer();
int counter = 0;
for (int i = 0; i < n; i++){
if (idata[i] != 0){
odata[counter] = idata[i];
counter ++;
}
}
timer().endCpuTimer();
return -1;
return counter;
}

/**
Expand All @@ -41,10 +51,24 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
if (n == 0 || idata == NULL || odata == NULL) return 0;
timer().startCpuTimer();
int* position = new int[n];
for (int i = 0; i < n; i++){
odata[i] = (idata[i] != 0);
}
position[0] = 0;
for (int i = 1; i < n; i++) position[i] = position[i - 1] + odata[i - 1];

int k = 0;
for (int i = 0; i < n; i++){
if (idata[i] != 0){
k++;
odata[position[i]] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return k;
}
}
}
Loading