Skip to content

uluyek/Project1-CUDA-Flocking

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

  • Keyu Lu
  • Tested on: Windows 10, Dell Oman, NVIDIA GeForce RTX 2060

Flocking Results

Flocking Scene: N_FOR_VIS = 5,000, scene_scale=100.0f, DT = 0.2f
Flocking Scene: N_FOR_VIS = 500,000, scene_scale=300.0f, DT = 2.0f
Flocking Scene: N_FOR_VIS = 5000,000, scene_scale=500.0f, DT = 0.5f

Performance Analysis

Boids Count Impact on Performance

Naive: Performance drops sharply as boid count increases due to O(N^2) complexity.

Uniform Grid: Less severe performance drop with more boids due to reduced comparisons.

Coherent Grid: Similar to the uniform grid but maintains better performance due to optimized memory access.

Block Size and Block Count Effects

Increasing block size generally improves performance until a threshold, after which there are diminishing returns. This is consistent across all implementations and is likely due to the limits of GPU thread management and optimal thread occupancy.

Cell Width and Neighbor Checking

The 8-cell neighbor search outperforms the 27-cell approach as it aligns better with the localized interaction range of boids, avoiding unnecessary computations for distant cells that do not impact the immediate behavior of the boids, leading to improved performance.

Extra Credit: Shared-Memory Optimization

Implementation

For the extra credit, I implemented a shared-memory optimization to enhance the nearest neighbor search within the naive approach of the boid simulation. The naive approach's performance was improved by using shared memory for the computations involved in updating boid velocities.

The implementation conditional can be observed in the following snippet:

#if USE_SHARED_MEM
kernUpdateVelocityBruteForceShared <<< fullBlocksPerGrid, blockSize, sizeof(glm::vec3) * blockSize * 2 >>> (N, dev_pos, dev_vel1, dev_vel2);
#else
kernUpdateVelocityBruteForce <<< fullBlocksPerGrid, blockSize >>> (N, dev_pos, dev_vel1, dev_vel2);
#endif

This section of the code utilizes the preprocessor directive USE_SHARED_MEM to switch between using shared memory (kernUpdateVelocityBruteForceShared) and not using it (kernUpdateVelocityBruteForce).

Performance Analysis

As demonstrated, the use of shared memory has a significant impact on the frames per second (FPS) achieved by the simulation under the naive setting:

About

CIS 5650 Project 1

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 45.5%
  • C++ 36.2%
  • CMake 16.5%
  • Other 1.8%