diff --git a/README.md b/README.md
index 41b91f0..e339ed3 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,83 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
+### Sample Rasterization
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
+#### ```resolution```: 900X900 ```GLTF model```: cesiummilktruck ```shader``` : blinn_phong perspective corrected bilinear textureed
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/truck.gif)
 
-* (TODO) YOUR NAME HERE
-	* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+### Introduction
 
-### (TODO: Your README)
+  Rasterization is an efficient rendering technique commonly used in computer graphics and especially in games,simmilar to path and ray tracing, it basically does one thing: transforming the 3d object into 2d screen. 
+  Different from raytracing or pathtracing, however, in rasterization, we will not track rays' further interaction with geometry anymore, instead, we will only cast the rays from each screen pixels into the scene, and get the color, depth, specular, etc results, and use these to simulate the scene, so as a consequence, rasterization is much more efficient, but is harder to get to an realistic result.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+### Features:
 
+- Basic features:
+  - Vertex shading
+  - Primitive assembly 
+  - Rasterization
+  - Fragment shading
+  - A depth buffer for storing and depth testing fragments
+  - Fragment-to-depth-buffer writing (with atomics for race avoidance)
+  -  simple lighting scheme including togglable Lambert and Blinn-Phong
+- Extra:
+  - UV texture mapping with bilinear texture filtering and perspective correct texture coordinates
+  - Support for rasterizing additional primitives with toggle, including line, points
+  - correct color interpolation on a primitive
+  - * tried SSAO, but result is not accurate
+  
+## Debug view:
+
+albedo buffer|depth buffer|
+------------|--------
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/diffuse.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/depth.gif)  
+
+normal buffer|specular buffer|
+------------|--------
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/normal.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/spec.gif)
+
+#### combined:
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/res.gif)
+
+## Support for other primitives:
+
+point|line|triangle
+-----|----|-----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/p.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/line.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/lamm.gif)
+
+
+
+
+## perspective correct texture coordinates
+
+in the following to comparisions, the right image both shows what we will get when we use simple linear interpolation to aquire stuffs like normal, albedo, and depth, the result is apparently wrong and looks wierd, the reason for this is that when we are doing interpolation, we are only using the barycentric values in triangle vertices and the triangle value we want to interpolate, we haven't taken depth(z) information into consideration, which is really important for correctly transforming 3d data into 2d screen (depth information can't be lost), so what we should do instead is to use both the baryalue and z value to compute our result.
+
+corrected duck|not-corrected duck
+-----|----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/yes.gif) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/non.gif)
+
+corrected checkboard|not-corrected checkboard
+-----|----
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/perspcorrect.JPG) | ![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/notcorrected.JPG)
+
+## correct color interpolation between points on a primitive
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/colorinterp.gif)
+
+# Performance analysis
+
+## break down of pipeline time consumption:
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/imgg.JPG)
+
+as can be seen from the above graph, for each model, vertex transfrom almost cost the same amount of time,as it is bascically a parallel data copying process, for the same reason, primitive assembly time is the same in spite of different models, apparently, most time is used to do the rasterization operation, because we will have a lot of iterative checks for each thread, moreover, I put the bilinear filtering inside the triangle rasterization kernal, so it might bring the time consumption even higher, finally, rendering is also the same for all the models, this is simply because the shader are just too simple.....
+
+## with and without perspective correction:
+
+![](https://github.com/LanLou123/Project4-CUDA-Rasterizer/raw/master/renders/pcomp.JPG)
+
+the above image is tested using duck with and without perspective correction, it shows that, with perspective correction, we have some decrease in rasterization efficiency, this might because we have to do extra computation with z values in order to interpolate stuff.
 
 ### Credits
 
diff --git a/renders/colorinterp.gif b/renders/colorinterp.gif
new file mode 100644
index 0000000..6af376b
Binary files /dev/null and b/renders/colorinterp.gif differ
diff --git a/renders/depth.gif b/renders/depth.gif
new file mode 100644
index 0000000..a6d640b
Binary files /dev/null and b/renders/depth.gif differ
diff --git a/renders/diffuse.gif b/renders/diffuse.gif
new file mode 100644
index 0000000..be2b554
Binary files /dev/null and b/renders/diffuse.gif differ
diff --git a/renders/imgg.JPG b/renders/imgg.JPG
new file mode 100644
index 0000000..455ec92
Binary files /dev/null and b/renders/imgg.JPG differ
diff --git a/renders/lambert.gif b/renders/lambert.gif
new file mode 100644
index 0000000..21becd4
Binary files /dev/null and b/renders/lambert.gif differ
diff --git a/renders/lamm.gif b/renders/lamm.gif
new file mode 100644
index 0000000..588353f
Binary files /dev/null and b/renders/lamm.gif differ
diff --git a/renders/line.gif b/renders/line.gif
new file mode 100644
index 0000000..8233751
Binary files /dev/null and b/renders/line.gif differ
diff --git a/renders/non.gif b/renders/non.gif
new file mode 100644
index 0000000..0a8c458
Binary files /dev/null and b/renders/non.gif differ
diff --git a/renders/normal.gif b/renders/normal.gif
new file mode 100644
index 0000000..8e29b2b
Binary files /dev/null and b/renders/normal.gif differ
diff --git a/renders/notcorrected.JPG b/renders/notcorrected.JPG
new file mode 100644
index 0000000..7effe31
Binary files /dev/null and b/renders/notcorrected.JPG differ
diff --git a/renders/p.gif b/renders/p.gif
new file mode 100644
index 0000000..76e3c6f
Binary files /dev/null and b/renders/p.gif differ
diff --git a/renders/pcomp.JPG b/renders/pcomp.JPG
new file mode 100644
index 0000000..dec96e2
Binary files /dev/null and b/renders/pcomp.JPG differ
diff --git a/renders/perspcorrect.JPG b/renders/perspcorrect.JPG
new file mode 100644
index 0000000..e222965
Binary files /dev/null and b/renders/perspcorrect.JPG differ
diff --git a/renders/res.gif b/renders/res.gif
new file mode 100644
index 0000000..715c069
Binary files /dev/null and b/renders/res.gif differ
diff --git a/renders/spec.gif b/renders/spec.gif
new file mode 100644
index 0000000..07c57ab
Binary files /dev/null and b/renders/spec.gif differ
diff --git a/renders/truck.gif b/renders/truck.gif
new file mode 100644
index 0000000..5099cd1
Binary files /dev/null and b/renders/truck.gif differ
diff --git a/renders/yes.gif b/renders/yes.gif
new file mode 100644
index 0000000..0bb17a2
Binary files /dev/null and b/renders/yes.gif differ
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a57f69f..1ed0953 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -2,9 +2,11 @@ set(SOURCE_FILES
     "rasterize.cu"
     "rasterize.h"
     "rasterizeTools.h"
+    "common.h"
+    "common.cu"
     )
 
 cuda_add_library(src
     ${SOURCE_FILES}
-    OPTIONS -arch=sm_20
+    OPTIONS -arch=sm_60
     )
diff --git a/src/common.cu b/src/common.cu
new file mode 100644
index 0000000..73fbd3a
--- /dev/null
+++ b/src/common.cu
@@ -0,0 +1,51 @@
+#include "common.h"
+
+void checkCUDAErrorFn(const char *msg, const char *file, int line) {
+    cudaError_t err = cudaGetLastError();
+    if (cudaSuccess == err) {
+        return;
+    }
+
+    fprintf(stderr, "CUDA error");
+    if (file) {
+        fprintf(stderr, " (%s:%d)", file, line);
+    }
+    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+    exit(EXIT_FAILURE);
+}
+
+
+namespace StreamCompaction {
+    namespace Common {
+
+        /**
+         * Maps an array to an array of 0s and 1s for stream compaction. Elements
+         * which map to 0 will be removed, and elements which map to 1 will be kept.
+         */
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
+            // TODO
+			int idx = blockIdx.x*blockDim.x + threadIdx.x;
+			if (idx < n)
+			{
+				if (idata[idx])
+					bools[idx] = 1;
+			}
+        }
+
+        /**
+         * Performs scatter on an array. That is, for each element in idata,
+         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+         */
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices) {
+            // TODO
+			int idx = blockIdx.x*blockDim.x + threadIdx.x;
+			if (idx < n)
+			{
+				if (bools[idx])
+					odata[indices[idx]] = idata[idx];
+			}
+        }
+
+    }
+}
diff --git a/src/common.h b/src/common.h
new file mode 100644
index 0000000..996997e
--- /dev/null
+++ b/src/common.h
@@ -0,0 +1,132 @@
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <cstdio>
+#include <cstring>
+#include <cmath>
+#include <algorithm>
+#include <chrono>
+#include <stdexcept>
+
+#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
+#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+
+/**
+ * Check for CUDA errors; print and exit if there was a problem.
+ */
+void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1);
+
+inline int ilog2(int x) {
+    int lg = 0;
+    while (x >>= 1) {
+        ++lg;
+    }
+    return lg;
+}
+
+inline int ilog2ceil(int x) {
+    return x == 1 ? 0 : ilog2(x - 1) + 1;
+}
+
+namespace StreamCompaction {
+    namespace Common {
+        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+
+        __global__ void kernScatter(int n, int *odata,
+                const int *idata, const int *bools, const int *indices);
+
+	    /**
+	    * This class is used for timing the performance
+	    * Uncopyable and unmovable
+        *
+        * Adapted from WindyDarian(https://github.com/WindyDarian)
+	    */
+	    class PerformanceTimer
+	    {
+	    public:
+		    PerformanceTimer()
+		    {
+			    cudaEventCreate(&event_start);
+			    cudaEventCreate(&event_end);
+		    }
+
+		    ~PerformanceTimer()
+		    {
+			    cudaEventDestroy(event_start);
+			    cudaEventDestroy(event_end);
+		    }
+
+		    void startCpuTimer()
+		    {
+			    if (cpu_timer_started) { throw std::runtime_error("CPU timer already started"); }
+			    cpu_timer_started = true;
+
+			    time_start_cpu = std::chrono::high_resolution_clock::now();
+		    }
+
+		    void endCpuTimer()
+		    {
+			    time_end_cpu = std::chrono::high_resolution_clock::now();
+
+			    if (!cpu_timer_started) { throw std::runtime_error("CPU timer not started"); }
+
+			    std::chrono::duration<double, std::milli> duro = time_end_cpu - time_start_cpu;
+			    prev_elapsed_time_cpu_milliseconds =
+				    static_cast<decltype(prev_elapsed_time_cpu_milliseconds)>(duro.count());
+
+			    cpu_timer_started = false;
+		    }
+
+		    void startGpuTimer()
+		    {
+			    if (gpu_timer_started) { throw std::runtime_error("GPU timer already started"); }
+			    gpu_timer_started = true;
+
+			    cudaEventRecord(event_start);
+		    }
+
+		    void endGpuTimer()
+		    {
+			    cudaEventRecord(event_end);
+			    cudaEventSynchronize(event_end);
+
+			    if (!gpu_timer_started) { throw std::runtime_error("GPU timer not started"); }
+
+			    cudaEventElapsedTime(&prev_elapsed_time_gpu_milliseconds, event_start, event_end);
+			    gpu_timer_started = false;
+		    }
+
+		    float getCpuElapsedTimeForPreviousOperation() //noexcept //(damn I need VS 2015
+		    {
+			    return prev_elapsed_time_cpu_milliseconds;
+		    }
+
+		    float getGpuElapsedTimeForPreviousOperation() //noexcept
+		    {
+			    return prev_elapsed_time_gpu_milliseconds;
+		    }
+
+		    // remove copy and move functions
+		    PerformanceTimer(const PerformanceTimer&) = delete;
+		    PerformanceTimer(PerformanceTimer&&) = delete;
+		    PerformanceTimer& operator=(const PerformanceTimer&) = delete;
+		    PerformanceTimer& operator=(PerformanceTimer&&) = delete;
+
+	    private:
+		    cudaEvent_t event_start = nullptr;
+		    cudaEvent_t event_end = nullptr;
+
+		    using time_point_t = std::chrono::high_resolution_clock::time_point;
+		    time_point_t time_start_cpu;
+		    time_point_t time_end_cpu;
+
+		    bool cpu_timer_started = false;
+		    bool gpu_timer_started = false;
+
+		    float prev_elapsed_time_cpu_milliseconds = 0.f;
+		    float prev_elapsed_time_gpu_milliseconds = 0.f;
+	    };
+    }
+}
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..39df543 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -17,7 +17,7 @@
 //-------------------------------
 //-------------MAIN--------------
 //-------------------------------
-
+int starttime;
 int main(int argc, char **argv) {
     if (argc != 2) {
         cout << "Usage: [gltf file]. Press Enter to exit" << endl;
@@ -57,6 +57,7 @@ int main(int argc, char **argv) {
     // Launch CUDA/GL
     if (init(scene)) {
         // GLFW main loop
+		starttime = GetTickCount();
         mainLoop();
     }
 
@@ -97,16 +98,18 @@ void mainLoop() {
 //---------RUNTIME STUFF---------
 //-------------------------------
 float scale = 1.0f;
-float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.0f;
+float x_trans = 0.0f, y_trans = 0.0f, z_trans = -10.f;
 float x_angle = 0.0f, y_angle = 0.0f;
 void runCuda() {
     // Map OpenGL buffer object for writing from CUDA on a single GPU
     // No data is moved (Win & Linux). When mapped to CUDA, OpenGL should not use this buffer
     dptr = NULL;
+	int timert = GetTickCount() - starttime;
 
+	//y_angle = 0.001*timert;
 	glm::mat4 P = glm::frustum<float>(-scale * ((float)width) / ((float)height),
 		scale * ((float)width / (float)height),
-		-scale, scale, 1.0, 1000.0);
+		-scale, scale, 3, 1000.0);
 
 	glm::mat4 V = glm::mat4(1.0f);
 
@@ -382,13 +385,13 @@ void mouseMotionCallback(GLFWwindow* window, double xpos, double ypos)
 	if (mouseState == ROTATE)
 	{
 		//rotate
-		x_angle += (float)s_r * diffy;
+		x_angle += -(float)s_r * diffy;
 		y_angle += (float)s_r * diffx;
 	}
 	else if (mouseState == TRANSLATE)
 	{
 		//translate
-		x_trans += (float)(s_t * diffx);
+		x_trans += (float)(-s_t * diffx);
 		y_trans += (float)(-s_t * diffy);
 	}
 }
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..4ccb206 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -9,6 +9,7 @@
 #include <cmath>
 #include <cstdio>
 #include <cuda.h>
+#include <random>
 #include <cuda_runtime.h>
 #include <thrust/random.h>
 #include <util/checkCUDAError.h>
@@ -17,6 +18,9 @@
 #include "rasterize.h"
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
+#include<curand.h>
+#include<curand_kernel.h>
+#include "common.h"
 
 namespace {
 
@@ -34,19 +38,25 @@ namespace {
 		Triangle = 3
 	};
 
+	enum DrawMode {
+		pmode = 1,
+		lmode = 2,
+		tmode = 3
+	};
+
 	struct VertexOut {
 		glm::vec4 pos;
-
+		glm::vec3 actualnorm;
 		// TODO: add new attributes to your VertexOut
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
-
+		glm::vec4 actualpos;
 		 glm::vec3 eyePos;	// eye space position used for shading
 		 glm::vec3 eyeNor;	// eye space normal used for shading, cuz normal will go wrong after perspective transformation
 		// glm::vec3 col;
 		 glm::vec2 texcoord0;
 		 TextureData* dev_diffuseTex = NULL;
-		// int texWidth, texHeight;
+		 int texWidth, texHeight;
 		// ...
 	};
 
@@ -62,10 +72,12 @@ namespace {
 		// The attributes listed below might be useful, 
 		// but always feel free to modify on your own
 
-		// glm::vec3 eyePos;	// eye space position used for shading
-		// glm::vec3 eyeNor;
-		// VertexAttributeTexcoord texcoord0;
-		// TextureData* dev_diffuseTex;
+		 glm::vec3 eyePos;	// eye space position used for shading
+		 glm::vec3 eyeNor;
+		 VertexAttributeTexcoord texcoord0;
+		 TextureData* dev_diffuseTex;
+		 int texWidth, texHeight;
+		 glm::vec3 fragmentPos;
 		// ...
 	};
 
@@ -100,6 +112,7 @@ namespace {
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
 
+int N = 40;
 
 static int width = 0;
 static int height = 0;
@@ -109,8 +122,24 @@ static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
+static glm::vec3 *dev_randpts = NULL;
+
+curandState* devStates;
+
+
+DrawMode mode = tmode;
+//#define SSAO
+#define shader 0
+#define color_interp  0
+#define normaldebug  0;
+#define depthdebug  0;
+#define timer 1
+#define perspectcorrect 0
+
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
 
+static int * dev_mutex = NULL;
+
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
  */
@@ -133,20 +162,105 @@ void sendImageToPBO(uchar4 *pbo, int w, int h, glm::vec3 *image) {
     }
 }
 
+__device__ __host__ glm::vec3 getBilinearFilteredPixelColor(TextureData* tex,glm::vec2 UV, int texwidth, int texheight) {
+	float u = UV.x * texwidth - 0.5;
+	float v = UV.y * texheight - 0.5;
+	int x = floor(u);
+	int y = floor(v);
+	double u_ratio = u - x;
+	double v_ratio = v - y;
+	double u_opposite = 1 - u_ratio;
+	double v_opposite = 1 - v_ratio;
+	int uvidx1 = 3 * (x + y*texwidth);
+	int uvidx2 = 3 * (x + 1 + y*texwidth);
+	int uvidx3 = 3 * (x + (y + 1)*texwidth);
+	int uvidx4 = 3 * (x + 1 + (y + 1)*texwidth);
+	double r = (tex[uvidx1] * u_opposite + tex[uvidx2] * u_ratio) * v_opposite +
+		(tex[uvidx3] * u_opposite + tex[uvidx4] * u_ratio) * v_ratio;
+	double g = (tex[uvidx1+1] * u_opposite + tex[uvidx2+1] * u_ratio) * v_opposite +
+		(tex[uvidx3+1] * u_opposite + tex[uvidx4+1] * u_ratio) * v_ratio;
+	double b = (tex[uvidx1+2] * u_opposite + tex[uvidx2+2] * u_ratio) * v_opposite +
+		(tex[uvidx3+2] * u_opposite + tex[uvidx4+2] * u_ratio) * v_ratio;
+	glm::vec3 colval(r, g, b);
+	colval /= 255.f;
+	return colval;
+}
+
+__device__ __host__ glm::vec3 generateRand()
+{
+	
+	thrust::uniform_real_distribution<float> randomFLTs(0.0, 1.0);
+	thrust::default_random_engine generator;
+	glm::vec3 sample(randomFLTs(generator)*2.0 - 1.0,
+		randomFLTs(generator)*2.0 - 1.0,
+		randomFLTs(generator));
+	sample = glm::normalize(sample);
+	sample *= randomFLTs(generator);
+	return sample;
+}
+
+ __host__ glm::vec3 generateRand1()
+{
+	std::random_device rd;
+	std::uniform_real_distribution<float> randomFLTs(0.0, 1.0);
+	std::default_random_engine generator(rd());
+	glm::vec3 sample(randomFLTs(generator)*2.0 - 1.0,
+		randomFLTs(generator)*2.0 - 1.0,
+		randomFLTs(generator));
+	sample = glm::normalize(sample);
+	sample *= randomFLTs(generator);
+	return sample;
+}
+
 /** 
 * Writes fragment colors to the framebuffer
 */
+
 __global__
-void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer) {
+void render(int w, int h, Fragment *fragmentBuffer, glm::vec3 *framebuffer,DrawMode mode,curandState* globalstate,glm::vec3 *randpts, glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal ) {
     int x = (blockIdx.x * blockDim.x) + threadIdx.x;
     int y = (blockIdx.y * blockDim.y) + threadIdx.y;
     int index = x + (y * w);
 
-    if (x < w && y < h) {
-        framebuffer[index] = fragmentBuffer[index].color;
+	if (x < w && y < h) {
+		glm::vec3 outcol;
+		if (mode == lmode||mode == pmode)
+		{
 
-		// TODO: add your fragment shader code here
+			outcol = fragmentBuffer[index].color;
+		}
+		else if(mode == tmode)
+		{
+#if perspectcorrect == 1
+			glm::vec3 lightdir(0.5, 0.5, 1);
+#else
+			glm::vec3 lightdir(-0.5, -0.5, -1);
+#endif
+			float intensity = glm::dot(lightdir, fragmentBuffer[index].eyeNor);
+			intensity = glm::clamp(intensity, 0.f, 0.8f);
+#if shader==0
+			outcol = glm::vec3(0);
+			if (fragmentBuffer[index].color != glm::vec3(0)) outcol += glm::vec3(0.2f);
+			outcol += intensity*fragmentBuffer[index].color;
+#elif shader == 1
+			float spec = 0.f;
+			if (intensity > 0.1f)
+			{
+				glm::vec3 viewdir = -fragmentBuffer[index].eyePos;
+				glm::vec3 half = glm::normalize(lightdir + viewdir);
+				float theta = glm::clamp(glm::dot(half, fragmentBuffer[index].eyeNor), 0.f, 1.f);
+				spec = glm::pow(theta, 20);
+			}
+			outcol = intensity*fragmentBuffer[index].color + spec*glm::vec3(1.f);
+#elif shader == 2
+			outcol = fragmentBuffer[index].color;
+#endif
+		}
+		
 
+
+		// TODO: add your fragment shader code here
+		framebuffer[index] = outcol;
     }
 }
 
@@ -163,9 +277,23 @@ void rasterizeInit(int w, int h) {
     cudaMalloc(&dev_framebuffer,   width * height * sizeof(glm::vec3));
     cudaMemset(dev_framebuffer, 0, width * height * sizeof(glm::vec3));
     
+	cudaMalloc(&dev_randpts, 40 * sizeof(glm::vec3));
+	cudaMemset(dev_randpts, 0, 40 * sizeof(glm::vec3));
+
+	glm::vec3 arr[40];
+	for (int i = 0; i < 40; ++i)
+	{
+		arr[i] = generateRand1();
+		std::cout << arr[i].x<<","<< arr[i]. y<<","<< arr[i]. z<<std::endl;
+	}
+	cudaMemcpy(dev_randpts, arr, 40 * sizeof(glm::vec3), cudaMemcpyHostToDevice);
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(dev_mutex);
+	cudaMalloc(&dev_mutex, width*height * sizeof(int));
+
+	cudaMalloc(&devStates, N * sizeof(curandState));
 	checkCUDAError("rasterizeInit");
 }
 
@@ -182,6 +310,19 @@ void initDepth(int w, int h, int * depth)
 	}
 }
 
+__global__
+void initMutex(int w, int h, int * mutex)
+{
+	int x = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int y = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	if (x < w && y < h)
+	{
+		int index = x + (y * w);
+		mutex[index] = 0;
+	}
+}
+
 
 /**
 * kern function with support for stride to sometimes replace cudaMemcpy
@@ -639,9 +780,26 @@ void _vertexTransformAndAssembly(
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
 
+		glm::vec4 MVPpos = MVP*glm::vec4(primitive.dev_position[vid], 1.0f);
+		primitive.dev_verticesOut[vid].actualpos = glm::vec4(primitive.dev_position[vid], 1.0f);
+		primitive.dev_verticesOut[vid].actualnorm = primitive.dev_normal[vid];
+		MVPpos /= MVPpos.w;
+		MVPpos.x = 0.5f*(float)width*(MVPpos.x + 1.0f);
+		MVPpos.y = 0.5f*(float)height*(-MVPpos.y + 1.0f);
+		primitive.dev_verticesOut[vid].pos = MVPpos;
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
-		
+		primitive.dev_verticesOut[vid].eyeNor = glm::normalize(MV_normal*primitive.dev_normal[vid]);
+		glm::vec4 tmpvec = MV*glm::vec4(primitive.dev_position[vid],1.0f);
+		//tmpvec /= tmpvec.w;
+		primitive.dev_verticesOut[vid].eyePos = glm::vec3(tmpvec);
+		if (primitive.dev_diffuseTex)
+		{
+			primitive.dev_verticesOut[vid].dev_diffuseTex = primitive.dev_diffuseTex;
+			primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+			primitive.dev_verticesOut[vid].texHeight = primitive.diffuseTexHeight;
+			primitive.dev_verticesOut[vid].texWidth = primitive.diffuseTexWidth;
+		}
 	}
 }
 
@@ -660,13 +818,12 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
-
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 		// TODO: other primitive types (point, line)
 	}
@@ -674,16 +831,327 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 }
 
 
+//__device__ float fatomicMin(float *addr, float value)
+
+//{
+//
+//	float old = *addr, assumed;
+//
+//	if (old <= value) return old;
+//
+//	do
+//
+//	{
+//
+//		assumed = old;
+//
+//		old = atomicCAS((unsigned int*)addr, __float_as_int(assumed), __float_as_int(value));
+//
+//	} while (old != assumed)
+//
+//		return old;
+//
+//}
+
+//perspective correction reference : https://www.scratchapixel.com/lessons/3d-basic-rendering/rasterization-practical-implementation/perspective-correct-interpolation-vertex-attributes
+__host__ __device__ float getZatperspCorrected(glm::vec3 bary, glm::vec3 tri[3])
+{
+	return 1.f / (bary.x / tri[0].z + bary.y / tri[1].z + bary.z / tri[2].z);
+}
+
+
+__host__ __device__ glm::vec3 getperspCorrectedInterp(glm::vec3 bary, glm::vec3 a[3], glm::vec3 tri[3], float zcorrected)
+{
+	return zcorrected*(a[0] * bary.x / tri[0].z + a[1] * bary.y / tri[1].z + a[2] * bary.z / tri[2].z);
+}
+
+__host__ __device__ glm::vec3 getnonperspCorrectedInterp(glm::vec3 bary, glm::vec3 a[3], glm::vec3 tri[3])
+{
+	return (a[0] * bary.x  + a[1] * bary.y  + a[2] * bary.z );
+}
+
+
+__global__ void _TraingleRasterizer(int w, int h, Fragment* fragmentbuffer, Primitive* primitives, int *depth, int numPrimitives, int *mutex, glm::mat4 MVP, glm::mat4 MV, glm::mat3 MV_normal, glm::vec3 *randpts )
+{
+	int pid = (blockIdx.x*blockDim.x) + threadIdx.x;
+	if (pid < numPrimitives)
+	{
+		Primitive& curP = primitives[pid];
+		//caculate triangle bounding box:
+		glm::vec3 tri[3];
+		glm::vec3 triNor[3];
+		glm::vec3 trieyePos[3];
+		glm::vec3 triUV[3];
+		glm::vec3 col[3]{glm::vec3(1,0,0),glm::vec3(0,1,0),glm::vec3(0,0,1)};
+		triNor[0] = curP.v[0].eyeNor;
+		triNor[1] = curP.v[1].eyeNor;
+		triNor[2] = curP.v[2].eyeNor;
+
+		trieyePos[0] = curP.v[0].eyePos;
+		trieyePos[1] = curP.v[1].eyePos;
+		trieyePos[2] = curP.v[2].eyePos;
+
+		triUV[0] = glm::vec3(curP.v[0].texcoord0, 0);
+		triUV[1] = glm::vec3(curP.v[1].texcoord0, 0);
+		triUV[2] = glm::vec3(curP.v[2].texcoord0, 0);
+
+		tri[0] = glm::vec3(curP.v[0].pos);
+		tri[1] = glm::vec3(curP.v[1].pos);
+		tri[2] = glm::vec3(curP.v[2].pos);
+		
+
+		AABB curBB = getAABBForTriangle(tri);
+		for (int i = curBB.min.x; i < curBB.max.x; ++i)
+		{
+			for (int j = curBB.min.y; j < curBB.max.y; ++j)
+			{
+				if (i<0 || i>w || j<0 || j>h) continue;
+				glm::vec3 baryCoord = calculateBarycentricCoordinate(tri, glm::vec2(i, j));
+				if (isBarycentricCoordInBounds(baryCoord))
+				{	
+					int ScreenIdx = i + j*w;
+#if perspectcorrect == 1
+					int depthval = (int)(getZatperspCorrected(baryCoord, tri)*INT_MAX);
+					float correctedZ = getZatperspCorrected(baryCoord, tri);
+					glm::vec3 curnormal = getperspCorrectedInterp(baryCoord, triNor, tri, correctedZ);
+					curnormal = glm::normalize(curnormal);
+					glm::vec3 cureyePos = getperspCorrectedInterp(baryCoord, trieyePos, tri, correctedZ);
+					glm::vec3 fragpos = getperspCorrectedInterp(baryCoord, tri, tri, correctedZ);
+					glm::vec3 interpcol = getperspCorrectedInterp(baryCoord, col, tri, correctedZ);
+#ifdef SSAO
+					__shared__ glm::vec3 sharedfrags[16];
+					int occlusion = 0;
+					int samplecount = 40;
+					float sampleradius = 5;
+					glm::vec3 centernor;
+					centernor = curnormal;
+
+					glm::vec3 tangent;
+					glm::vec3 c1 = glm::cross(centernor, glm::vec3(0, 0, 1.0));
+					glm::vec3 c2 = glm::cross(centernor, glm::vec3(0, 1.0, 0));
+					if (glm::length(c1) > glm::length(c2))
+					{
+						tangent = c1;
+					}
+					else
+						tangent = c2;
+					tangent = glm::normalize(tangent);
+					glm::vec3 bitangent = glm::cross(tangent, centernor);
+					glm::mat3 TBN(tangent, bitangent, centernor);
+					for (int i = 0; i < samplecount; ++i)
+					{
+						glm::vec3 samp = randpts[i];
+						samp = TBN*samp;
+						samp = cureyePos + samp*sampleradius;
+						glm::vec4 sample = glm::vec4(samp, 1);
+						glm::mat4 p = glm::inverse(MV)*MVP;
+						sample *= p;
+						sample /= sample.w;
+						sample.x = 0.5f*(float)w*(sample.x + 1.0f);
+						sample.y = 0.5f*(float)h*(-sample.y + 1.0f);
+						int curidx = sample.x + sample.y*w;
+						glm::vec3 baryCoord1 = calculateBarycentricCoordinate(tri, glm::vec2(sample.x, sample.y));
+
+						occlusion += (samp.z > cureyePos.z ? 1.0 : 0);
+					}
+					float resocclusion = 1 - (occlusion / samplecount);
+					 
+#endif // SSAO
+#else
+					int depthval = (int)(getZAtCoordinate(baryCoord, tri)*INT_MAX);
+					glm::vec3 curnormal = glm::normalize(getnonperspCorrectedInterp(baryCoord, triNor, tri));
+					glm::vec3 cureyePos = getnonperspCorrectedInterp(baryCoord, trieyePos, tri);
+					glm::vec3 fragpos = getnonperspCorrectedInterp(baryCoord, tri, tri);
+					glm::vec3 interpcol = getnonperspCorrectedInterp(baryCoord, col, tri);
+#endif
+					//glm::vec3 interpcol = getnonperspCorrectedInterp(baryCoord, col, tri);
+					
+					bool isSet;
+					do {
+						isSet = (atomicCAS(&mutex[ScreenIdx], 0, 1) == 0);
+						if (isSet)
+						{
+							if (depth[ScreenIdx] >= depthval)
+							{
+								depth[ScreenIdx] = depthval;
+								fragmentbuffer[ScreenIdx].eyeNor = curnormal;
+								fragmentbuffer[ScreenIdx].eyePos = cureyePos;
+								fragmentbuffer[ScreenIdx].fragmentPos = fragpos;
+#if normaldebug == 1
+
+								fragmentbuffer[ScreenIdx].color = glm::normalize(fragmentbuffer[ScreenIdx].eyeNor);
+#elif depthdebug == 1
+								fragmentbuffer[ScreenIdx].color = (1/correctedZ)*glm::vec3(1.f) / 38.f;
+#elif color_interp == 1
+
+								fragmentbuffer[ScreenIdx].color = interpcol;
+#else
+								if (curP.v[0].dev_diffuseTex != NULL)
+								{
+#if perspectcorrect == 1
+									glm::vec2 UV = glm::vec2(getperspCorrectedInterp(baryCoord, triUV, tri, correctedZ));
+#else
+									glm::vec2 UV = glm::vec2(getnonperspCorrectedInterp(baryCoord, triUV, tri));
+#endif
+									fragmentbuffer[ScreenIdx].texcoord0 = UV;
+									fragmentbuffer[ScreenIdx].dev_diffuseTex = curP.v[0].dev_diffuseTex;
+									fragmentbuffer[ScreenIdx].texHeight = curP.v[0].texHeight;
+									fragmentbuffer[ScreenIdx].texWidth = curP.v[0].texWidth;
+									fragmentbuffer[ScreenIdx].color = getBilinearFilteredPixelColor(fragmentbuffer[ScreenIdx].dev_diffuseTex, fragmentbuffer[ScreenIdx].texcoord0
+										, fragmentbuffer[ScreenIdx].texWidth, fragmentbuffer[ScreenIdx].texHeight);
+									
+								}
+								else
+								{
+									fragmentbuffer[ScreenIdx].color = glm::vec3(1.0f);
+								}
+								//fragmentbuffer[ScreenIdx].color *= resocclusion;
+#endif // color_interp
+							}
+							mutex[ScreenIdx] = 0;
+						}
+					} while (!isSet);
+				}
+			}
+		}
+	}
+	
+}
+
+__host__ __device__ void kern_swap(float &a, float &b)
+{
+	float tmp = a;
+	a = b;
+	b = tmp;
+}
+
+__global__ void _PointRasterizer(int w, int h, Fragment* fragmentbuffer, Primitive* primitives, int *depth, int numPrimitives, int *mutex)
+{
+	int pid = (blockIdx.x*blockDim.x) + threadIdx.x;
+	if (pid < numPrimitives)
+	{
+		Primitive& curP = primitives[pid];
+		glm::vec3 tri[3];
+		tri[0] = glm::vec3(curP.v[0].pos);
+		tri[1] = glm::vec3(curP.v[1].pos);
+		tri[2] = glm::vec3(curP.v[2].pos);
+		for (int i = 0; i < 3; ++i)
+		{
+			int curpx = tri[i].x;
+			int curpy = tri[i].y;
+			int screenidx = curpx + curpy*w;
+			fragmentbuffer[screenidx].color = glm::vec3(1.0f);
+		}
+		
+	}
+}
+
+
+__global__ void _LineRasterizer(int w, int h, Fragment* fragmentbuffer, Primitive* primitives, int *depth, int numPrimitives, int *mutex)
+{
+	int pid = (blockIdx.x*blockDim.x) + threadIdx.x;
+	if (pid < numPrimitives)
+	{
+		Primitive& curP = primitives[pid];
+		glm::vec3 trin[3];
+		trin[0] = glm::vec3(curP.v[0].pos);
+		trin[1] = glm::vec3(curP.v[1].pos);
+		trin[2] = glm::vec3(curP.v[2].pos);
+		glm::vec3 tri[3][2];
+		tri[0][0] = glm::vec3(curP.v[0].pos);
+		tri[0][1] = glm::vec3(curP.v[1].pos);
+		tri[1][0] = glm::vec3(curP.v[1].pos);
+		tri[1][1] = glm::vec3(curP.v[2].pos);
+		tri[2][0] = glm::vec3(curP.v[2].pos);
+		tri[2][1] = glm::vec3(curP.v[0].pos);
+		for (int i = 0; i < 3; ++i)
+		{
+			glm::vec3 startp = tri[i][0];
+			glm::vec3 endp = tri[i][1];
+			float x1 = startp.x; float x2 = endp.x;
+			float y1 = startp.y; float y2 = endp.y;
+			if (x1<0 || x1>w || x2<0 || x2>w || y1<0 || y1>h || y2<0 || y2>h) continue;
+			const bool steep = (fabs(y2 - y1) > fabs(x2 - x1));
+			if (steep)
+			{
+				kern_swap(x1, y1);
+				kern_swap(x2, y2);
+			}
+
+			if (x1 > x2)
+			{
+				kern_swap(x1, x2);
+				kern_swap(y1, y2);
+			}
+
+			const float dx = x2 - x1;
+			const float dy = fabs(y2 - y1);
+
+			float error = dx / 2.0f;
+			const int ystep = (y1 < y2) ? 1 : -1;
+			int y = (int)y1;
+
+			const int maxX = (int)x2;
+
+			for (int x = (int)x1; x<maxX; x++)
+			{
+				int screenidx;
+				glm::vec2 rcoord;
+				if (steep)
+				{
+					rcoord = glm::vec2(y, x);
+					screenidx = y + x*w;
+				}
+				else
+				{
+					rcoord = glm::vec2(x, y);
+					screenidx = x + y*w;
+				}
+				glm::vec3 baryCoord = calculateBarycentricCoordinate(trin, rcoord);
+				//if(!isBarycentricCoordInBounds(baryCoord)) continue;
+				float testdepth = (getZatperspCorrected(baryCoord, trin));
+				int depthval = (int)(testdepth*INT_MAX);
+				bool isSet;
+				do {
+					isSet = (atomicCAS(&mutex[screenidx], 0, 1) == 0);
+					if (isSet)
+					{
+						if (depth[screenidx] > depthval)
+						{
+							depth[screenidx] = depthval;
+							fragmentbuffer[screenidx].color =/*(1-glm::clamp(testdepth,0.f,1.f))* */glm::vec3(1.f);
+						}
+						mutex[screenidx] = 0;
+					}
+				} while (!isSet);
+				error -= dy;
+				if (error < 0)
+				{
+					y += ystep;
+					error += dx;
+				}
+			}
+
+		}
+	}
+}
 
 /**
  * Perform rasterization.
  */
 void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const glm::mat3 MV_normal) {
+	StreamCompaction::Common::PerformanceTimer vertexTranstimer;
+	StreamCompaction::Common::PerformanceTimer primitiveassemblytimer;
+	StreamCompaction::Common::PerformanceTimer rasterizertimer;
+	StreamCompaction::Common::PerformanceTimer rendertimer;
+
     int sideLength2d = 8;
     dim3 blockSize2d(sideLength2d, sideLength2d);
     dim3 blockCount2d((width  - 1) / blockSize2d.x + 1,
 		(height - 1) / blockSize2d.y + 1);
 
+
+
 	// Execute your rasterization pipeline here
 	// (See README for rasterization pipeline outline.)
 
@@ -695,40 +1163,73 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 		auto it = mesh2PrimitivesMap.begin();
 		auto itEnd = mesh2PrimitivesMap.end();
 
+		float vertextranstimer = 0.f;
+		float primitiveassembletimer = 0.f;
 		for (; it != itEnd; ++it) {
 			auto p = (it->second).begin();	// each primitive
 			auto pEnd = (it->second).end();
 			for (; p != pEnd; ++p) {
 				dim3 numBlocksForVertices((p->numVertices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
 				dim3 numBlocksForIndices((p->numIndices + numThreadsPerBlock.x - 1) / numThreadsPerBlock.x);
-
+				vertexTranstimer.startGpuTimer();
 				_vertexTransformAndAssembly << < numBlocksForVertices, numThreadsPerBlock >> >(p->numVertices, *p, MVP, MV, MV_normal, width, height);
+				vertexTranstimer.endGpuTimer();
+				vertextranstimer += vertexTranstimer.getGpuElapsedTimeForPreviousOperation();
 				checkCUDAError("Vertex Processing");
+				
 				cudaDeviceSynchronize();
+				primitiveassemblytimer.startGpuTimer();
 				_primitiveAssembly << < numBlocksForIndices, numThreadsPerBlock >> >
 					(p->numIndices, 
 					curPrimitiveBeginId, 
 					dev_primitives, 
 					*p);
+				primitiveassemblytimer.endGpuTimer();
+				primitiveassembletimer += primitiveassemblytimer.getGpuElapsedTimeForPreviousOperation();
 				checkCUDAError("Primitive Assembly");
+				
 
 				curPrimitiveBeginId += p->numPrimitives;
 			}
 		}
+#if timer == 1
+		std::cout << "vertex transform time:" << vertextranstimer << std::endl;
+		std::cout << "primitive assembly time:" << primitiveassembletimer << std::endl;
+#endif
 
 		checkCUDAError("Vertex Processing and Primitive Assembly");
 	}
 	
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
 	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
-
 
+	initMutex << <blockCount2d, blockSize2d >> > (width, height, dev_mutex);
+	
+	int ScanNumPrimitives = totalNumPrimitives;
+	dim3 numTreadsPerBlock(128);
+	dim3 numBlocksScan((ScanNumPrimitives + numTreadsPerBlock.x - 1)/numTreadsPerBlock.x);
 
+	// TODO: rasterize
+	rasterizertimer.startGpuTimer();
+	if(mode==tmode)
+	_TraingleRasterizer << <numBlocksScan, numTreadsPerBlock >> > (width, height, dev_fragmentBuffer, dev_primitives, dev_depth, totalNumPrimitives,dev_mutex, MVP, MV, MV_normal,dev_randpts);
+	if(mode==lmode)
+	_LineRasterizer << <numBlocksScan, numTreadsPerBlock >> > (width, height, dev_fragmentBuffer, dev_primitives, dev_depth, totalNumPrimitives, dev_mutex);
+	if(mode==pmode)
+	_PointRasterizer << <numBlocksScan, numTreadsPerBlock >> > (width, height, dev_fragmentBuffer, dev_primitives, dev_depth, totalNumPrimitives, dev_mutex);
+	rasterizertimer.endGpuTimer();
+#if timer ==1
+	std::cout << "rasterization time:" << rasterizertimer.getGpuElapsedTimeForPreviousOperation() << std::endl;
+#endif
     // Copy depthbuffer colors into framebuffer
-	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer);
+	rendertimer.startGpuTimer();
+	render << <blockCount2d, blockSize2d >> >(width, height, dev_fragmentBuffer, dev_framebuffer,mode,devStates,dev_randpts, MVP, MV, MV_normal);
+	rendertimer.endGpuTimer();
+#if timer == 1
+	std::cout << "render time" << rendertimer.getGpuElapsedTimeForPreviousOperation() << std::endl;
+#endif
 	checkCUDAError("fragment shader");
+	
     // Copy framebuffer into OpenGL buffer for OpenGL previewing
     sendImageToPBO<<<blockCount2d, blockSize2d>>>(pbo, width, height, dev_framebuffer);
     checkCUDAError("copy render result to pbo");
@@ -750,10 +1251,9 @@ void rasterizeFree() {
 			cudaFree(p->dev_normal);
 			cudaFree(p->dev_texcoord0);
 			cudaFree(p->dev_diffuseTex);
-
 			cudaFree(p->dev_verticesOut);
 
-			
+
 			//TODO: release other attributes and materials
 		}
 	}
@@ -772,5 +1272,11 @@ void rasterizeFree() {
 	cudaFree(dev_depth);
 	dev_depth = NULL;
 
+	cudaFree(dev_mutex);
+	dev_mutex = NULL;
+
+	cudaFree(devStates);
+
+	cudaFree(dev_randpts);
     checkCUDAError("rasterize Free");
 }
diff --git a/util/CMakeLists.txt b/util/CMakeLists.txt
index c995fae..9438a2c 100644
--- a/util/CMakeLists.txt
+++ b/util/CMakeLists.txt
@@ -8,5 +8,5 @@ set(SOURCE_FILES
 
 cuda_add_library(util
    ${SOURCE_FILES}
-   OPTIONS -arch=sm_52
+   OPTIONS -arch=sm_60
    )