Master’s Thesis Project — ITMO University
A research prototype for Asynchronous Software Tessellation, developed as part of a Master’s thesis at ITMO University. It demonstrates a GPU-based procedural tessellation algorithm integrated with asynchronous compute to overlap tessellation workloads with other rendering tasks (shadow mapping, post-processing, etc.) in DirectX 12.
- Compute‐Shader Tessellation
- Procedural subdivision via binary‐rule keys stored in GPU buffers.
- Both uniform and view‐dependent adaptive LOD.
- Frustrum culling
- Asynchronous Compute
- Overlaps tessellation compute with shadow‐map and post‐processing passes.
- Camera‐motion prediction to drive next‐frame tessellation.
- Graphics Techniques
- Deferred rendering.
- Shadow mapping (Cascaded Shadow Maps).
- Post‐processing: Motion Blur, Bloom, Chromatic Aberration, Tone Mapping.
- Interactive UI
- ImGui overlay to tweak LOD, toggle async/sync, enable/disable effects.
- Frame-time display and live profiling markers.
This section illustrates the overall application workflow and the detailed rendering sequence for a single frame utilizing asynchronous compute.
The implementation was evaluated across eight GPU configurations (GTX 1050 Ti, GTX 1080 Ti, RTX 2080 Ti, RTX 3060 Laptop, RTX 3080, RTX 3090, RTX 4070 Ti, RTX 4090) and five application scenarios:
- Config 1: High-detail procedural terrain.
- Config 2: High-detail 3D model.
- Config 3: Low-detail 3D model.
- Config 4: Low-detail procedural terrain.
- Config 5: High-detail procedural terrain with simplified pixel shader.
Each scenario included cascaded shadow maps, multiple colored point lights, full post‑processing effects, and a rotating camera.
Asynchronous tessellation modes yielded a 10–30% reduction in frame time, with AsyncAll often the top performer. Parallelizing tessellation alongside shadow‐map generation maximized GPU utilization and reduced idle periods.
The chart below, captured in NVIDIA Nsight Graphics, illustrates the distribution of GPU and memory resources for Config 1 in Direct mode (A) versus AsyncShadowMap mode (B).
AsyncAll and AsyncShadowMap modes were most stable, delivering 5–20% gains despite a heavy 2.5 million‑triangle load. AsyncPostProcess underperformed due to high VRAM pressure from simultaneous texture sampling.
With only ~300 k vertices (low tessellation), overheads from queue synchronization offset compute gains, yet AsyncAll still led marginal improvements.
While AsyncPostProcess saw minimal speedup, running tessellation alongside shadow rendering proved most effective. Overall, AsyncAll averaged a 2–10% frame time reduction due to fewer synchronization barriers.
Simplifying the pixel shader reduced G‑buffer costs, shifting the bottleneck to compute. As a result, asynchronous gains varied but AsyncAll remained the safest choice for consistent improvements (20–60% gains).
Q: Why were mesh shaders not considered in this work and what impact could they have on the proposed approach?
A: I did not consider mesh shaders because my main hypothesis was focused on using asynchronous compute. I aimed to optimize software tessellation algorithms executed in compute shaders. If mesh shaders were used, tessellation would have to run in the mesh stage (which executes in the graphics queue, not the compute queue), invalidating my hypothesis about asynchronous compute. Additionally, there is already an implementation of software tessellation in three variants—using compute shaders, using mesh shaders, and using hardware tessellation—allowing a performance comparison of these three approaches.
Q: Why were GPU-Driven Pipeline techniques and the reasons for their adoption not examined and analyzed in this work? I would also like to evaluate these techniques in the context of a GPU-Driven Pipeline.
A: I did not detail GPU-Driven Pipeline techniques in the thesis (though I indirectly mentioned them in the document and presentation) because the tessellation algorithm itself inherently uses GPU-Driven Pipeline methods: for example, frustum culling runs entirely on the GPU without CPU interaction, and I use Indirect Draw, which also offloads rendering tasks from the CPU. Overall, the entire algorithm can run independently on the GPU. The only data transferred from CPU to GPU is the camera’s position; otherwise, the algorithm uses GPU-Driven Pipeline methods. It is also important that when testing the algorithm, I measured GPU time using special GPU counters (as shown in the charts), so time spent synchronizing with the CPU was not included and would not affect the results.
Q: Why was DirectX 12 chosen instead of Vulkan?
A: I chose DirectX 12 because Windows is currently the most popular gaming OS, and DirectX 12 is the best graphics API for Windows.
Q: How are the GPU blocks loaded when running the tessellation algorithm in synchronous and asynchronous modes?
A: On NVIDIA Turing architecture, a pixel shader runs on an SM and requests SRV/UAV resources via the L1Tex (texture cache and texture pipeline). A miss in L1 sends the request to L2, and then to VRAM if needed. Finally, the pixel shader writes color using the CROP (Color Raster Operation) block. In my tessellation implementation, mathematical operations heavily utilize SM cores, so it makes sense to overlap tessellation with work that mainly uses rasterization modules (CROP, PROP, RASTER, etc.). In asynchronous mode, VRAM load also increases due to cache misses, which is a potential area for improvement. Thus, the main goal in asynchronous mode was to reduce the number of idle warps and balance load across GPU modules without conflicts. Here’s a great talk on optimizing GPU workloads: https://www.gdcvault.com/play/1026202/Optimizing-DX12-DXR-GPU-Workloads
- Windows 10/11
- Visual Studio 2022 (or later) with Desktop Development with C++ workload
- DirectX 12 SDK (installed via Windows SDK)
All dependencies are included. Just:
- Open
AsyncComputeTessellation.slnin Visual Studio 2022. - Select x64 and Debug/Release configuration.
- Build (
Build>Build Solution). - Run (
Debug>Start Without DebuggingorCtrl+F5).
- Direct3D12 User-Mode Heap Synchronization (Microsoft)
- Advanced API Performance: Async Compute and Overlap (NVIDIA)
- Khoury J. et al. Adaptive GPU tessellation with compute shaders
- Khoury, J. GPU Tessellation with Compute Shaders: Master Thesis. EPFL, 2018. 57 pp.
- Optimizing DX12/DXR GPU Workloads (GDC)
- AMD Dives Deep on Asynchronous Shading (AnandTech)
- Life in the Triangle: NVIDIA’s Logical Pipeline
- Concurrent Execution & Asynchronous Queues (GPUOpen)
- Harness Powerful Shader Insights Using Shader Debug Info with NVIDIA Nsight Graphics











