GitHub - uluyek/Project1-CUDA-Flocking: CIS 5650 Project 1

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

Keyu Lu
Tested on: Windows 10, Dell Oman, NVIDIA GeForce RTX 2060

Flocking Results

Flocking Scene: N_FOR_VIS = 5,000, scene_scale=100.0f, DT = 0.2f

Flocking Scene: N_FOR_VIS = 500,000, scene_scale=300.0f, DT = 2.0f

Flocking Scene: N_FOR_VIS = 5000,000, scene_scale=500.0f, DT = 0.5f

Performance Analysis

Boids Count Impact on Performance

Naive: Performance drops sharply as boid count increases due to O(N^2) complexity.

Uniform Grid: Less severe performance drop with more boids due to reduced comparisons.

Coherent Grid: Similar to the uniform grid but maintains better performance due to optimized memory access.

Block Size and Block Count Effects

Increasing block size generally improves performance until a threshold, after which there are diminishing returns. This is consistent across all implementations and is likely due to the limits of GPU thread management and optimal thread occupancy.

Cell Width and Neighbor Checking

The 8-cell neighbor search outperforms the 27-cell approach as it aligns better with the localized interaction range of boids, avoiding unnecessary computations for distant cells that do not impact the immediate behavior of the boids, leading to improved performance.

Extra Credit: Shared-Memory Optimization

Implementation

For the extra credit, I implemented a shared-memory optimization to enhance the nearest neighbor search within the naive approach of the boid simulation. The naive approach's performance was improved by using shared memory for the computations involved in updating boid velocities.

The implementation conditional can be observed in the following snippet:

#if USE_SHARED_MEM
kernUpdateVelocityBruteForceShared <<< fullBlocksPerGrid, blockSize, sizeof(glm::vec3) * blockSize * 2 >>> (N, dev_pos, dev_vel1, dev_vel2);
#else
kernUpdateVelocityBruteForce <<< fullBlocksPerGrid, blockSize >>> (N, dev_pos, dev_vel1, dev_vel2);
#endif

This section of the code utilizes the preprocessor directive USE_SHARED_MEM to switch between using shared memory (kernUpdateVelocityBruteForceShared) and not using it (kernUpdateVelocityBruteForce).

Performance Analysis

As demonstrated, the use of shared memory has a significant impact on the frames per second (FPS) achieved by the simulation under the naive setting:

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
cmake		cmake
external		external
images		images
shaders		shaders
src		src
.gitignore		.gitignore
500k 300 2.0 demo gif.gif		500k 300 2.0 demo gif.gif
5m 500 0.5.gif		5m 500 0.5.gif
CMakeLists.txt		CMakeLists.txt
FPS for 8 Cell Vs. 27 Cell.png		FPS for 8 Cell Vs. 27 Cell.png
FPS for Naive Implementation with Shared Memory On_Off.png		FPS for Naive Implementation with Shared Memory On_Off.png
Framerate Change with Increasing Block Size.png		Framerate Change with Increasing Block Size.png
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
Numbers of Boids VS. FPS (with Visualization).png		Numbers of Boids VS. FPS (with Visualization).png
Numbers of Boids VS. FPS.png		Numbers of Boids VS. FPS.png
README.md		README.md
flocking demo.gif		flocking demo.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Flocking Results

Performance Analysis

Boids Count Impact on Performance

Block Size and Block Count Effects

Cell Width and Neighbor Checking

Extra Credit: Shared-Memory Optimization

Implementation

Performance Analysis

About

Uh oh!

Releases

Packages

Languages

uluyek/Project1-CUDA-Flocking

Folders and files

Latest commit

History

Repository files navigation

Flocking Results

Performance Analysis

Boids Count Impact on Performance

Block Size and Block Count Effects

Cell Width and Neighbor Checking

Extra Credit: Shared-Memory Optimization

Implementation

Performance Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages