CIS565-Fall-2018 · sunfish2010 · Sep 11, 2018 · Sep 11, 2018 · Sep 12, 2018 · Sep 14, 2018
diff --git a/README.md b/README.md
@@ -3,12 +3,124 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yu Sun 
+* [LinkedIn](https://www.linkedin.com/in/yusun3/)
+* Tested on: Tested on: Windows 10 , i7-6700HQ CPU @ 2.60GHz × 8 , GeForce GTX 960M/PCIe/SSE2, 7.7GB Memory (Personal Laptop)
 
-### (TODO: Your README)
+## Introduction
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+In this project, stream compaction is implemented using the traditional CPU approach, a naive CUDA based scan algorithm and a more efficient algorithm that performs the scan in place. This project can be used as a base work for many useful things like path tracer. 
+
+Stream compaction usually includes two processes: scan followed by compaction. 
+For example, given an array and some condition, we create an boolean array indicating whether the conditions is met, and perform a exclusive scan on it, resulting in
+a result that stores the indices where the condition is met in the original array. 
+
+Below is a visualization of how scan and compaction works:
+![](img/stream_scan)
+
+
+For CPU implementation, scan is basically just an for loop iterating through all the elements and produce the outputs.
+![](img/cpu.png)
+
+For the naive GPU implementation, scan is done using two ping-pong buffers since the algorithm requires inplace updates of the array.
+![](img/naive.png)
+![](img/naive_al.png)
+
+For the efficient GPU implementation, scan is done smartly using up-sweep and down-sweep, which reduces the amount of computation significantly. 
+![](img/up.png)
+![](img/down.png)
+I tried to optimize the efficient algorithm by launching with different grid dimension that is best for each iteration, but in the end it's still slower than naive implementation when not using shared memory.
+
+For performance comparison, the built-in scan algorithm from thrust library is also used.
+
+I also implement the algorithm with shared memory instead of calling for loops from the host to see if it can speed up computations.
+
+
+## Performance Analysis @ BlockSize = 128
+The performance of these three different algorithms with or w/o shared memory are shown in the diagram below.
+![](img/pw2.png)
+![](img/npt.png)
+
+As can be seen from the diagram, when the array size is small, there is not much performance gain by using the GPU. Also, it is very interesting to note that
+the efficient scan algorithm is not actually performing better than the naive algorithm when not utilizing shared memory. 
+
+My reasoning behind the first phenomenon is that essentially when the array size if small, we do not gain many parallelsim by using the algorithm. The memory read from global 
+memory is so heavy and we don't have enough parallelsim to hide this fact. It is not very obvious from my plot but from reasoning and looking at the performance from efficient shared memory, I would predict that as size of the array grows, the memory latency will hidden by the parallel computation so the GPU implementation will be faster. 
+
+The second phenomenon shows that if not carefully designed, a smarter algorithm could actually perform less well. There are a lot of branches and idling threads in the algorithm that the block computation power is not fully used at all! Even if I tried to eliminate the idling threads by changing the grid dimension it's still slower than the naive algorithm. Changing the algorithm by using stride can resolve this issue since the threads will be accessing consecutive memory addresses. But I don't have enough time to implement them.
+
+Also, as it can be seen, implementation with shared memory generally perform better than the one without shared memory. This is because the algorithm reused the computation quite often and by using shared memory, we can improve the performance by having to do many global reads. However, one needs to be careful to not deplet the resources on GPU. For example, my current algorithm wouldn't work if I have an array size that's bigger than the blockSize.
+
+
+## Output from test Program
+
+```
+****************
+  ** SCAN TESTS **
+****************
+    [   2  32  25  35   9  44  38  24   8   2  35   4  24 ...  30   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 0.000395ms    (std::chrono Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 694 724 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 609 636 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 0.02448ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 0.023744ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 0.125024ms    (CUDA Measured)
+    [   0   2  34  59  94 103 147 185 209 217 219 254 258 ... 694 724 ]
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 0.098976ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.0272ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.011232ms    (CUDA Measured)
+    passed
+==== naive scan with shared memory, power of two ====
+   elapsed time: 0.01696ms    (CUDA Measured)
+    passed
+==== naive scan with shared memory, non-power-of-two ====
+   elapsed time: 0.008ms    (CUDA Measured)
+    passed
+==== efficient scan with shared memory, power of two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+==== efficient scan with shared memory, non-power-of-two ====
+   elapsed time: 0ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   1   3 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   1 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 0.000791ms    (std::chrono Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   1   3 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.36675ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 1.8215ms    (CUDA Measured)
+    [   1   1   1   3   1   3   3   1   1   2   3   1   1 ...   3   1 ]
+    passed
+Press any key to continue . . .
+```
 
diff --git a/img/cpu.png b/img/cpu.png
diff --git a/img/down.png b/img/down.png
diff --git a/img/naive.png b/img/naive.png
diff --git a/img/naive_al.png b/img/naive_al.png
diff --git a/img/npt.png b/img/npt.png
diff --git a/img/pw2.png b/img/pw2.png
diff --git a/img/stream_scan.png b/img/stream_scan.png
diff --git a/img/up.png b/img/up.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,9 +11,11 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/naive_sm.h>
+#include <stream_compaction/efficient_sm.h>
 #include "testing_helpers.hpp"
-
-const int SIZE = 1 << 8; // feel free to change the size of array
+// If you want to increase the size of array, make sure also increase BlockSize in algorithm implemented with shared memory
+const int SIZE = 1 << 7; 
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -71,7 +73,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
+    printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);
@@ -95,6 +97,35 @@ int main(int argc, char* argv[]) {
     //printArray(NPOT, c, true);
     printCmpResult(NPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, power of two");
+    StreamCompaction::NaiveSM::scan(SIZE, c, a);
+    printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("naive scan with shared memory, non-power-of-two");
+    StreamCompaction::NaiveSM::scan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::NaiveSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("efficient scan with shared memory, power of two");
+    StreamCompaction::NaiveSM::scan(SIZE, c, a);
+    printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("efficient scan with shared memory, non-power-of-two");
+    StreamCompaction::NaiveSM::scan(NPOT, c, a);
+    printElapsedTime(StreamCompaction::EfficientSM::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    // printArray(NPOT, c, true);
+    printCmpResult(NPOT, b, c);
+
+
     printf("\n");
     printf("*****************************\n");
     printf("** STREAM COMPACTION TESTS **\n");
@@ -144,7 +175,7 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient compact, non-power-of-two");
     count = StreamCompaction::Efficient::compact(NPOT, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
+    printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
     system("pause"); // stop Win32 console from closing on exit

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -9,6 +9,10 @@ set(SOURCE_FILES
     "efficient.cu"
     "thrust.h"
     "thrust.cu"
+    "naive_sm.cu"
+    "naive_sm.h"
+    "efficient_sm.cu"
+    "efficient_sm.h"
     )
 
 cuda_add_library(stream_compaction

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,10 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < n){
+                bools[index] = (idata[index] != 0);
+            }
         }
 
         /**
@@ -32,7 +35,10 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index < n && bools[index]){
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -10,6 +10,8 @@
 #include <chrono>
 #include <stdexcept>
 
+#define blockSize 128
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -1,15 +1,15 @@
 #include <cstdio>
 #include "cpu.h"
 
-#include "common.h"
+#include "common.h"
 
 namespace StreamCompaction {
     namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-	        static PerformanceTimer timer;
-	        return timer;
+        using StreamCompaction::Common::PerformanceTimer;
+        PerformanceTimer& timer()
+        {
+	        static PerformanceTimer timer;
+	        return timer;
         }
 
         /**
@@ -19,7 +19,10 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
 	        timer().startCpuTimer();
-            // TODO
+	        if (n == 0 || idata == NULL || odata == NULL) return;
+	        odata[0] = 0;
+            for (int i = 1; i < n; i++) odata[i] = odata[i - 1] + idata[i - 1];
+
 	        timer().endCpuTimer();
         }
 
@@ -29,10 +32,17 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-            // TODO
+            if (n == 0 || idata == NULL || odata == NULL) return 0;
+            timer().startCpuTimer();
+	        int counter = 0;
+            for (int i = 0; i < n; i++){
+                if (idata[i] != 0){
+                    odata[counter] = idata[i];
+                    counter ++;
+                }
+            }
 	        timer().endCpuTimer();
-            return -1;
+            return counter;
         }
 
         /**
@@ -41,10 +51,24 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
-	        timer().startCpuTimer();
-	        // TODO
+            if (n == 0 || idata == NULL || odata == NULL) return 0;
+            timer().startCpuTimer();
+	        int* position = new int[n];
+	        for (int i = 0; i < n; i++){
+                odata[i] = (idata[i] != 0);
+            }
+            position[0] = 0;
+            for (int i = 1; i < n; i++) position[i] = position[i - 1] + odata[i - 1];
+
+            int k = 0;
+	        for (int i = 0; i < n; i++){
+	            if (idata[i] != 0){
+                    k++;
+	                odata[position[i]] = idata[i];
+	            }
+	        }
 	        timer().endCpuTimer();
-            return -1;
+            return k;
         }
     }
 }