Skip to content

RealTimeChris/benchmarksuite

Repository files navigation

Benchmark Suite

Hello and welcome to bnch_swt or "Benchmark Suite". This is a collection of classes/functions for the purpose of benchmarking CPU and GPU performance.

The following operating systems and compilers are officially supported:

Compiler Support


MSVC GCC CLANG NVCC

Operating System Support


Windows Linux Mac

Quickstart Guide for rtc-benchmarksuite

This guide will walk you through setting up and running benchmarks using rtc-benchmarksuite.

Table of Contents

Installation

Method 1: vcpkg + CMake (Recommended)

Step 1: Add to vcpkg.json

Create or update your vcpkg.json in your project root:

{
  "name": "your-project-name",
  "version": "1.0.0",
  "dependencies": [
    "rtc-benchmarksuite"
  ]
}

Step 2: Configure CMake

In your CMakeLists.txt:

cmake_minimum_required(VERSION 3.20)
project(YourProject LANGUAGES CXX CUDA)  # Add CUDA if using GPU benchmarks

# Set C++ standard
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# For CUDA support (optional)
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

# Find the package
find_package(rtc-benchmarksuite CONFIG REQUIRED)

# Create your executable
add_executable(your_benchmark main.cpp)

# Link against rtc-benchmarksuite (header-only, just sets up includes)
target_link_libraries(your_benchmark PRIVATE rtc-benchmarksuite::rtc-benchmarksuite)

# If using CUDA
set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)

Step 3: Configure with vcpkg toolchain

# Configure
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake

# Build
cmake --build build --config Release

Step 4: Include in your code

#include <bnch_swt/index.hpp>

int main() {
    // Your benchmarks here
    return 0;
}

Method 2: Manual Installation

If not using vcpkg, you can include rtc-benchmarksuite as a header-only library:

Step 1: Clone the repository

git clone https://github.com/RealTimeChris/benchmarksuite.git

Step 2: Add to CMake

# Add as subdirectory
add_subdirectory(path/to/benchmarksuite)

# Or set include directory
target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)

Step 3: Include headers

#include <bnch_swt/index.hpp>

Requirements

To use rtc-benchmarksuite, ensure you have a C++23 (or later) compliant compiler.

For CPU Benchmarking:

  • MSVC 2022 or later
  • GCC 13 or later
  • Clang 16 or later

For GPU/CUDA Benchmarking:

  • NVIDIA CUDA Toolkit 11.0 or later
  • NVCC compiler
  • CUDA-capable GPU

Platform-Specific Notes

Windows:

  • Use Visual Studio 2022 or later
  • For CUDA: Install CUDA Toolkit from NVIDIA

Linux:

  • Install build essentials: sudo apt-get install build-essential
  • For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer

macOS:

  • Install Xcode Command Line Tools
  • CUDA support not available on Apple Silicon (M1/M2/M3)

Verification

Verify your installation with a simple test:

#include <bnch_swt/index.hpp>
#include <iostream>

int main() {
    std::cout << "rtc-benchmarksuite successfully installed!" << std::endl;
    return 0;
}

Basic Example

The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:

// Define benchmark functions as structs with static impl() methods
struct glz_to_chars_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values, 
                                        std::vector<std::string>& test_values_00,
                                        std::vector<std::string>& test_values_01) {
        uint64_t bytes_processed = 0;
        char newer_string[30]{};
        for (uint64_t x = 0; x < test_values.size(); ++x) {
            std::memset(newer_string, '\0', sizeof(newer_string));
            auto new_ptr = glz::to_chars(newer_string, test_values[x]);
            bytes_processed += test_values_00[x].size();
            test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
        }
        return bytes_processed;
    }
};

struct jsonifier_to_chars_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
                                        std::vector<std::string>& test_values_00,
                                        std::vector<std::string>& test_values_01) {
        uint64_t bytes_processed = 0;
        char newer_string[30]{};
        for (uint64_t x = 0; x < test_values.size(); ++x) {
            std::memset(newer_string, '\0', sizeof(newer_string));
            auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]);
            bytes_processed += test_values_00[x].size();
            test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
        }
        return bytes_processed;
    }
};

int main() {
    constexpr uint64_t count = 512;
    
    // Setup test data
    std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
    std::vector<std::string> test_values_00;
    std::vector<std::string> test_values_01(count);
    
    for (uint64_t x = 0; x < count; ++x) {
        test_values_00.emplace_back(std::to_string(test_values[x]));
    }
    
    // Define benchmark stage with 200 total iterations, 25 measured, CPU benchmarking
    using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", 200, 25, 
                                                 bnch_swt::benchmark_types::cpu>;
    
    // Run benchmarks
    benchmark::run_benchmark<"glz::to_chars", glz_to_chars_benchmark>(test_values, test_values_00, test_values_01);
    benchmark::run_benchmark<"jsonifier::to_chars", jsonifier_to_chars_benchmark>(test_values, test_values_00, test_values_01);
    
    // Print results with comparison
    benchmark::print_results(true, true);
    
    return 0;
}

Creating Benchmarks

To create a benchmark:

  1. Define your benchmark functions as structs with a static impl() method that returns uint64_t (bytes processed)
  2. Use bnch_swt::benchmark_stage with appropriate template parameters
  3. Call run_benchmark with your benchmark struct and any required arguments

Benchmark Stage

The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:

// Full template signature
template<bnch_swt::string_literal stage_name,              // Required: benchmark stage name
         uint64_t max_execution_count = 200,               // Total iterations (warmup + measured)
         uint64_t measured_iteration_count = 25,           // Iterations to measure
         bnch_swt::benchmark_types benchmark_type = bnch_swt::benchmark_types::cpu,  // CPU or CUDA
         bool clear_cpu_cache_between_each_iteration = false,  // Cache clearing flag
         bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{}  // Custom metric name
>
struct benchmark_stage;

// Common usage examples
using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">;  // Uses defaults: 200 total, 25 measured, CPU
using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;
using custom_metric = bnch_swt::benchmark_stage<"compression", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;

Template Parameters

  • stage_name (required): String literal identifying the benchmark stage
  • max_execution_count (default 200): Total number of iterations including warmup
  • measured_iteration_count (default 25): Number of iterations to measure for final metrics
  • benchmark_type (default cpu): bnch_swt::benchmark_types::cpu or bnch_swt::benchmark_types::cuda
  • clear_cpu_cache_between_each_iteration (default false): Whether to clear CPU caches between iterations
  • metric_name (default empty): Custom metric name for specialized benchmarks (e.g., compression ratios)

Methods

  • run_benchmark<name, function_type>(args...): Executes the benchmark function's impl() method with the provided arguments

    • name: String literal identifying this specific benchmark within the stage
    • function_type: Struct type with a static impl() method
    • For CPU: run_benchmark<name, function_type>(args...) where args are forwarded to impl()
    • For CUDA: run_benchmark<name, function_type>(grid, block, shared_mem, bytes_processed, args...) where:
      • grid: dim3 specifying grid dimensions
      • block: dim3 specifying block dimensions
      • shared_mem: uint64_t bytes of shared memory
      • bytes_processed: uint64_t bytes processed for throughput calculation
      • args...: Additional arguments forwarded to kernel impl()
    • Returns: performance_metrics<benchmark_type> object
  • print_results(show_comparison = true, show_metrics = true): Displays performance metrics and comparisons

    • show_comparison: Whether to show head-to-head comparisons between benchmarks
    • show_metrics: Whether to show detailed hardware counter metrics
  • get_results(): Returns a sorted vector of all performance_metrics for programmatic access

Benchmark Function Requirements

Benchmark functions must be defined as structs with a static impl() method:

For CPU benchmarks:

struct my_cpu_benchmark {
    BNCH_SWT_HOST static uint64_t impl(/* your parameters */) {
        // Your CPU code to benchmark
        uint64_t bytes_processed = /* calculate bytes */;
        return bytes_processed;  // Must return bytes processed
    }
};

For CUDA benchmarks:

struct my_cuda_benchmark {
    BNCH_SWT_DEVICE static void impl(/* your parameters */) {
        // Your CUDA kernel code (runs on device)
        // This code will be wrapped in a kernel launch by the framework
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        // ... your kernel logic
    }
};

Key differences:

  • CPU: impl() returns uint64_t (bytes processed) and uses BNCH_SWT_HOST
  • CUDA: impl() returns void, uses BNCH_SWT_DEVICE, and contains kernel code (not a kernel launch)
  • CUDA: Bytes processed is passed as a parameter to run_benchmark(), not returned from impl()
  • CUDA: The framework automatically wraps your impl() in a kernel launch with the specified grid/block dimensions

CPU vs GPU Benchmarking

As of v1.0.0, rtc-benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_types enum.

CPU Benchmarks

// Define CPU benchmark function
struct cpu_computation_benchmark {
    BNCH_SWT_HOST static uint64_t impl(const std::vector<float>& input, std::vector<float>& output) {
        // Your CPU computation here
        for (size_t i = 0; i < input.size(); ++i) {
            output[i] = std::sqrt(input[i] * input[i] + 1.0f);
        }
        // Return bytes processed for throughput calculation
        return input.size() * sizeof(float);
    }
};

// Create CPU benchmark stage (200 total iterations, 25 measured, CPU type)
using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", 200, 25, bnch_swt::benchmark_types::cpu>;

// Setup data
constexpr size_t data_size = 1024 * 1024;
std::vector<float> input(data_size, 1.0f);
std::vector<float> output(data_size);

// Run the benchmark
cpu_stage::run_benchmark<"my-cpu-function", cpu_computation_benchmark>(input, output);

// Print results
cpu_stage::print_results();

GPU/CUDA Benchmarks

// Define CUDA kernel benchmark
struct cuda_kernel_benchmark {
    BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) {
        // Your CUDA kernel code here
        // This runs inside the kernel, NOT as a kernel launch
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < size) {
            data[idx] = data[idx] * 2.0f;  // Example operation
        }
    }
};

// Create CUDA benchmark stage
using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;

// Setup
constexpr uint64_t data_size = 1024 * 1024;
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));

// Configure kernel launch parameters
dim3 grid{256, 1, 1};
dim3 block{256, 1, 1};
uint64_t shared_memory = 0;
uint64_t bytes_processed = data_size * sizeof(float);

// Run CUDA benchmark
// Parameters: grid, block, shared_mem, bytes_processed, then your kernel args
cuda_stage::run_benchmark<"my-cuda-kernel", cuda_kernel_benchmark>(
    grid, block, shared_memory, bytes_processed, 
    gpu_data, data_size
);

cuda_stage::print_results();
cudaFree(gpu_data);

Important: For CUDA benchmarks, the impl() method contains the kernel code itself (not a kernel launch). The benchmarking framework wraps it in a kernel launch using the provided grid/block dimensions.

Mixed CPU/GPU Benchmarking

You can benchmark CPU and GPU implementations side-by-side:

constexpr uint64_t data_size = 1024 * 1024;

// CPU benchmark function
struct cpu_process_benchmark {
    BNCH_SWT_HOST static uint64_t impl(std::vector<float>& cpu_data) {
        // Process data on CPU
        for (size_t i = 0; i < cpu_data.size(); ++i) {
            cpu_data[i] = cpu_data[i] * 2.0f;
        }
        return cpu_data.size() * sizeof(float);
    }
};

// GPU benchmark function (kernel code, NOT kernel launch)
struct gpu_process_benchmark {
    BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) {
        int idx = blockIdx.x * blockDim.x + threadIdx.x;
        if (idx < size) {
            gpu_data[idx] = gpu_data[idx] * 2.0f;
        }
    }
};

// Setup test data
std::vector<float> cpu_data(data_size);
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));

// CPU version
using cpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cpu>;
cpu_test::run_benchmark<"cpu-version", cpu_process_benchmark>(cpu_data);

// GPU version  
using gpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cuda>;
dim3 grid{(data_size + 255) / 256, 1, 1};
dim3 block{256, 1, 1};
gpu_test::run_benchmark<"gpu-version", gpu_process_benchmark>(
    grid, block, 0, data_size * sizeof(float),
    gpu_data, data_size
);

// Print both results for comparison
cpu_test::print_results();
gpu_test::print_results();

cudaFree(gpu_data);

This allows direct performance comparison between CPU and GPU implementations of the same algorithm.

Cache Clearing Option

For more accurate CPU benchmarks, you can enable cache clearing between iterations:

// Enable cache clearing (5th template parameter)
using cache_cleared = bnch_swt::benchmark_stage<"cache-test", 200, 25, bnch_swt::benchmark_types::cpu, true>;

This is useful when benchmarking memory-bound operations where you want to measure cold cache performance.

Custom Metrics

You can specify custom metric names for specialized benchmarks that don't measure traditional throughput:

// Compression benchmark with custom metric name
using compression_bench = bnch_swt::benchmark_stage<"compression-test", 200, 25, 
                                                     bnch_swt::benchmark_types::cpu, 
                                                     false, 
                                                     "compression-ratio">;

struct compress_benchmark {
    BNCH_SWT_HOST static uint64_t impl(const std::vector<uint8_t>& input) {
        auto compressed = compress_data(input);
        // Return custom metric value (e.g., compression ratio * 1000)
        return (input.size() * 1000) / compressed.size();
    }
};

compression_bench::run_benchmark<"my-compressor", compress_benchmark>(input_data);
compression_bench::print_results();

When a custom metric name is provided, the results will display your custom metric instead of standard MB/s throughput.

Running Benchmarks

With vcpkg + CMake (recommended):

# Configure with vcpkg toolchain
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release

# Build
cmake --build build --config Release

# Run
./build/your_benchmark  # Linux/macOS
.\build\Release\your_benchmark.exe  # Windows

Manual CMake build:

cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
./build/your_benchmark

For CUDA benchmarks, ensure CUDA is enabled:

cmake -B build -S . \
  -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_CUDA_ARCHITECTURES=86  # Adjust for your GPU architecture
  
cmake --build build --config Release

Common CMake Options

  • -DCMAKE_BUILD_TYPE=Release - Build optimized release version
  • -DCMAKE_CUDA_ARCHITECTURES=86 - Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)
  • -DCMAKE_CXX_COMPILER=clang++ - Specify C++ compiler
  • -DCMAKE_CUDA_COMPILER=nvcc - Specify CUDA compiler

Complete Project Example

Project structure:

my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
    ├── cpu_benchmark.hpp
    └── gpu_benchmark.cuh

CMakeLists.txt:

cmake_minimum_required(VERSION 3.20)
project(MyBenchmark LANGUAGES CXX CUDA)

# C++23 required
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

# CUDA 20 for GPU benchmarks
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)

# Find rtc-benchmarksuite
find_package(rtc-benchmarksuite CONFIG REQUIRED)

# Create executable
add_executable(my_benchmark 
    main.cpp
    benchmarks/cpu_benchmark.hpp
    benchmarks/gpu_benchmark.cuh
)

# Link rtc-benchmarksuite
target_link_libraries(my_benchmark PRIVATE 
    rtc-benchmarksuite::rtc-benchmarksuite
)

# CUDA properties
set_target_properties(my_benchmark PROPERTIES
    CUDA_SEPARABLE_COMPILATION ON
    CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

# Optimization flags
if(MSVC)
    target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2)
else()
    target_compile_options(my_benchmark PRIVATE -O3 -march=native)
endif()

vcpkg.json:

{
  "name": "my-benchmark",
  "version": "1.0.0",
  "dependencies": [
    "rtc-benchmarksuite"
  ]
}

Output and Results

Performance Metrics for: int-to-string-comparisons-1
Metrics for: benchmarksuite::internal::to_chars
Total Iterations to Stabilize                               : 394
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 5785.25
Frequency (GHz)                                             : 4.83
Throughput (MB/s)                                           : 84.58
Throughput Percentage Deviation (+/-%)                      : 8.36
Cycles per Execution                                        : 27921.20
Cycles per Byte                                             : 54.53
Instructions per Execution                                  : 52026.00
Instructions per Cycle                                      : 1.86
Instructions per Byte                                       : 101.61
Branches per Execution                                      : 361.45
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 97.03
Cache Misses per Execution                                  : 74.68
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize                               : 421
Measured Iterations                                         : 20
Bytes Processed                                             : 512.00
Nanoseconds per Execution                                   : 6480.30
Frequency (GHz)                                             : 4.68
Throughput (MB/s)                                           : 75.95
Throughput Percentage Deviation (+/-%)                      : 17.58
Cycles per Execution                                        : 30314.40
Cycles per Byte                                             : 59.21
Instructions per Execution                                  : 51513.00
Instructions per Cycle                                      : 1.70
Instructions per Byte                                       : 100.61
Branches per Execution                                      : 438.25
Branch Misses per Execution                                 : 0.73
Cache References per Execution                              : 95.93
Cache Misses per Execution                                  : 73.59
----------------------------------------
Library benchmarksuite::internal::to_chars, is faster than library: glz::to_chars, by roughly: 11.36%.

This structured output helps you quickly identify which implementation is faster or more efficient.

Features

Dual Benchmarking Support

  • CPU Benchmarking: Traditional CPU performance measurement with hardware counters
  • GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
  • Mixed Workloads: Compare CPU vs GPU implementations side-by-side
  • Automatic Device Selection: Choose benchmark type via bnch_swt::benchmark_types::cpu or bnch_swt::benchmark_types::cuda

Advanced Options

  • Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
  • Custom Metrics: Define custom metric names for specialized benchmarks (e.g., compression ratios, custom throughput units)
  • Configurable Iterations: Separate control over warmup iterations and measured iterations
  • Programmatic Access: Retrieve raw performance metrics via get_results() for custom analysis

Hardware Introspection

  • CPU Properties: Comprehensive CPU detection and properties via benchmarksuite_cpu_properties.hpp
  • GPU Properties: CUDA device detection and properties via benchmarksuite_gpu_properties.hpp

Performance Counters

  • Cross-platform CPU counters: Windows, Linux, macOS, Android, Apple ARM
  • CUDA performance events: GPU-specific performance monitoring via counters/cuda_perf_events.hpp

Utilities

  • Cache management: Cross-platform cache clearing utilities
  • Aligned constants: Compile-time aligned data structures
  • Random generators: High-quality random data generation for benchmarks

API Conventions

As of v1.0.0, all APIs follow snake_case naming convention:

  • Functions: do_not_optimize_away(), generate_random_integers(), print_results()
  • Types: size_type, string_literal
  • Variables: bytes_processed, test_values

Migrating from Pre-1.0.0

If you're upgrading from an earlier version:

  1. Update package name: benchmarksuitertc-benchmarksuite

  2. Update include paths: All includes are lowercase (already standard)

  3. Update API calls: Convert camelCase/PascalCase to snake_case

    • doNotOptimizeAway()do_not_optimize_away()
    • printResults()print_results()
    • generateRandomIntegers()generate_random_integers()
  4. Change benchmark interface: Lambdas are replaced with structs

    // Old (lambda-based)
    benchmark_stage<"test">::run_benchmark<"name">([&] {
        // code here
        return bytes_processed;
    });
    
    // New (struct-based)
    struct my_benchmark {
        BNCH_SWT_HOST static uint64_t impl(/* params */) {
            // code here
            return bytes_processed;
        }
    };
    benchmark_stage<"test">::run_benchmark<"name", my_benchmark>(/* args */);
  5. Update template parameters: benchmark_stage now has more options

    // Old (positional parameters)
    benchmark_stage<"test", iterations, measured>
    
    // New (with defaults and additional options)
    benchmark_stage<"test", 200, 25, benchmark_types::cpu, false, "">
    //                      ^^^  ^^  ^^^^^^^^^^^^^^^^^^  ^^^^^  ^^
    //                      max  measured  type         cache  metric
  6. New feature - Device types: You can now specify CPU or CUDA benchmarking:

    // CPU (default)
    benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu>
    
    // CUDA/GPU
    benchmark_stage<"test", 100, 10, bnch_swt::benchmark_types::cuda>
  7. New feature - Cache clearing: Enable cache clearing between iterations for CPU benchmarks:

    // Clear cache between each iteration (5th parameter)
    benchmark_stage<"test", 200, 25, benchmark_types::cpu, true>
  8. New feature - Custom metrics: Specify custom metric names for specialized benchmarks:

    // Use custom metric instead of default throughput (6th parameter)
    benchmark_stage<"compression-test", 200, 25, benchmark_types::cpu, false, "compression-ratio">

Now you're ready to start benchmarking with rtc-benchmarksuite!