Hello and welcome to bnch_swt or "Benchmark Suite". This is a collection of classes/functions for the purpose of benchmarking CPU and GPU performance.
The following operating systems and compilers are officially supported:
This guide will walk you through setting up and running benchmarks using rtc-benchmarksuite.
- Installation
- Basic Example
- Creating Benchmarks
- CPU vs GPU Benchmarking
- Running Benchmarks
- Output and Results
- Features
- API Conventions
- Migrating from Pre-1.0.0
Step 1: Add to vcpkg.json
Create or update your vcpkg.json in your project root:
{
"name": "your-project-name",
"version": "1.0.0",
"dependencies": [
"rtc-benchmarksuite"
]
}Step 2: Configure CMake
In your CMakeLists.txt:
cmake_minimum_required(VERSION 3.20)
project(YourProject LANGUAGES CXX CUDA) # Add CUDA if using GPU benchmarks
# Set C++ standard
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# For CUDA support (optional)
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
# Find the package
find_package(rtc-benchmarksuite CONFIG REQUIRED)
# Create your executable
add_executable(your_benchmark main.cpp)
# Link against rtc-benchmarksuite (header-only, just sets up includes)
target_link_libraries(your_benchmark PRIVATE rtc-benchmarksuite::rtc-benchmarksuite)
# If using CUDA
set_target_properties(your_benchmark PROPERTIES CUDA_SEPARABLE_COMPILATION ON)Step 3: Configure with vcpkg toolchain
# Configure
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake
# Build
cmake --build build --config ReleaseStep 4: Include in your code
#include <bnch_swt/index.hpp>
int main() {
// Your benchmarks here
return 0;
}If not using vcpkg, you can include rtc-benchmarksuite as a header-only library:
Step 1: Clone the repository
git clone https://github.com/RealTimeChris/benchmarksuite.gitStep 2: Add to CMake
# Add as subdirectory
add_subdirectory(path/to/benchmarksuite)
# Or set include directory
target_include_directories(your_target PRIVATE path/to/benchmarksuite/include)Step 3: Include headers
#include <bnch_swt/index.hpp>To use rtc-benchmarksuite, ensure you have a C++23 (or later) compliant compiler.
For CPU Benchmarking:
- MSVC 2022 or later
- GCC 13 or later
- Clang 16 or later
For GPU/CUDA Benchmarking:
- NVIDIA CUDA Toolkit 11.0 or later
- NVCC compiler
- CUDA-capable GPU
Windows:
- Use Visual Studio 2022 or later
- For CUDA: Install CUDA Toolkit from NVIDIA
Linux:
- Install build essentials:
sudo apt-get install build-essential - For CUDA: Install CUDA Toolkit via package manager or NVIDIA installer
macOS:
- Install Xcode Command Line Tools
- CUDA support not available on Apple Silicon (M1/M2/M3)
Verify your installation with a simple test:
#include <bnch_swt/index.hpp>
#include <iostream>
int main() {
std::cout << "rtc-benchmarksuite successfully installed!" << std::endl;
return 0;
}The following example demonstrates how to set up and run a benchmark comparing two integer-to-string conversion functions:
// Define benchmark functions as structs with static impl() methods
struct glz_to_chars_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
std::vector<std::string>& test_values_00,
std::vector<std::string>& test_values_01) {
uint64_t bytes_processed = 0;
char newer_string[30]{};
for (uint64_t x = 0; x < test_values.size(); ++x) {
std::memset(newer_string, '\0', sizeof(newer_string));
auto new_ptr = glz::to_chars(newer_string, test_values[x]);
bytes_processed += test_values_00[x].size();
test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
}
return bytes_processed;
}
};
struct jsonifier_to_chars_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<int64_t>& test_values,
std::vector<std::string>& test_values_00,
std::vector<std::string>& test_values_01) {
uint64_t bytes_processed = 0;
char newer_string[30]{};
for (uint64_t x = 0; x < test_values.size(); ++x) {
std::memset(newer_string, '\0', sizeof(newer_string));
auto new_ptr = jsonifier_internal::to_chars(newer_string, test_values[x]);
bytes_processed += test_values_00[x].size();
test_values_01[x] = std::string{newer_string, static_cast<uint64_t>(new_ptr - newer_string)};
}
return bytes_processed;
}
};
int main() {
constexpr uint64_t count = 512;
// Setup test data
std::vector<int64_t> test_values = generate_random_integers<int64_t>(count, 20);
std::vector<std::string> test_values_00;
std::vector<std::string> test_values_01(count);
for (uint64_t x = 0; x < count; ++x) {
test_values_00.emplace_back(std::to_string(test_values[x]));
}
// Define benchmark stage with 200 total iterations, 25 measured, CPU benchmarking
using benchmark = bnch_swt::benchmark_stage<"int-to-string-comparison", 200, 25,
bnch_swt::benchmark_types::cpu>;
// Run benchmarks
benchmark::run_benchmark<"glz::to_chars", glz_to_chars_benchmark>(test_values, test_values_00, test_values_01);
benchmark::run_benchmark<"jsonifier::to_chars", jsonifier_to_chars_benchmark>(test_values, test_values_00, test_values_01);
// Print results with comparison
benchmark::print_results(true, true);
return 0;
}To create a benchmark:
- Define your benchmark functions as structs with a static
impl()method that returnsuint64_t(bytes processed) - Use
bnch_swt::benchmark_stagewith appropriate template parameters - Call
run_benchmarkwith your benchmark struct and any required arguments
The benchmark_stage structure orchestrates each test and supports both CPU and GPU benchmarking:
// Full template signature
template<bnch_swt::string_literal stage_name, // Required: benchmark stage name
uint64_t max_execution_count = 200, // Total iterations (warmup + measured)
uint64_t measured_iteration_count = 25, // Iterations to measure
bnch_swt::benchmark_types benchmark_type = bnch_swt::benchmark_types::cpu, // CPU or CUDA
bool clear_cpu_cache_between_each_iteration = false, // Cache clearing flag
bnch_swt::string_literal metric_name = bnch_swt::string_literal<1>{} // Custom metric name
>
struct benchmark_stage;
// Common usage examples
using cpu_benchmark = bnch_swt::benchmark_stage<"my-benchmark">; // Uses defaults: 200 total, 25 measured, CPU
using gpu_benchmark = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;
using custom_metric = bnch_swt::benchmark_stage<"compression", 200, 25, bnch_swt::benchmark_types::cpu, false, "compression-ratio">;- stage_name (required): String literal identifying the benchmark stage
- max_execution_count (default 200): Total number of iterations including warmup
- measured_iteration_count (default 25): Number of iterations to measure for final metrics
- benchmark_type (default cpu):
bnch_swt::benchmark_types::cpuorbnch_swt::benchmark_types::cuda - clear_cpu_cache_between_each_iteration (default false): Whether to clear CPU caches between iterations
- metric_name (default empty): Custom metric name for specialized benchmarks (e.g., compression ratios)
-
run_benchmark<name, function_type>(args...): Executes the benchmark function'simpl()method with the provided arguments- name: String literal identifying this specific benchmark within the stage
- function_type: Struct type with a static
impl()method - For CPU:
run_benchmark<name, function_type>(args...)where args are forwarded toimpl() - For CUDA:
run_benchmark<name, function_type>(grid, block, shared_mem, bytes_processed, args...)where:grid: dim3 specifying grid dimensionsblock: dim3 specifying block dimensionsshared_mem: uint64_t bytes of shared memorybytes_processed: uint64_t bytes processed for throughput calculationargs...: Additional arguments forwarded to kernelimpl()
- Returns:
performance_metrics<benchmark_type>object
-
print_results(show_comparison = true, show_metrics = true): Displays performance metrics and comparisons- show_comparison: Whether to show head-to-head comparisons between benchmarks
- show_metrics: Whether to show detailed hardware counter metrics
-
get_results(): Returns a sorted vector of allperformance_metricsfor programmatic access
Benchmark functions must be defined as structs with a static impl() method:
For CPU benchmarks:
struct my_cpu_benchmark {
BNCH_SWT_HOST static uint64_t impl(/* your parameters */) {
// Your CPU code to benchmark
uint64_t bytes_processed = /* calculate bytes */;
return bytes_processed; // Must return bytes processed
}
};For CUDA benchmarks:
struct my_cuda_benchmark {
BNCH_SWT_DEVICE static void impl(/* your parameters */) {
// Your CUDA kernel code (runs on device)
// This code will be wrapped in a kernel launch by the framework
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// ... your kernel logic
}
};Key differences:
- CPU:
impl()returnsuint64_t(bytes processed) and usesBNCH_SWT_HOST - CUDA:
impl()returnsvoid, usesBNCH_SWT_DEVICE, and contains kernel code (not a kernel launch) - CUDA: Bytes processed is passed as a parameter to
run_benchmark(), not returned fromimpl() - CUDA: The framework automatically wraps your
impl()in a kernel launch with the specified grid/block dimensions
As of v1.0.0, rtc-benchmarksuite supports both CPU and GPU (CUDA) benchmarking through the benchmark_types enum.
// Define CPU benchmark function
struct cpu_computation_benchmark {
BNCH_SWT_HOST static uint64_t impl(const std::vector<float>& input, std::vector<float>& output) {
// Your CPU computation here
for (size_t i = 0; i < input.size(); ++i) {
output[i] = std::sqrt(input[i] * input[i] + 1.0f);
}
// Return bytes processed for throughput calculation
return input.size() * sizeof(float);
}
};
// Create CPU benchmark stage (200 total iterations, 25 measured, CPU type)
using cpu_stage = bnch_swt::benchmark_stage<"cpu-test", 200, 25, bnch_swt::benchmark_types::cpu>;
// Setup data
constexpr size_t data_size = 1024 * 1024;
std::vector<float> input(data_size, 1.0f);
std::vector<float> output(data_size);
// Run the benchmark
cpu_stage::run_benchmark<"my-cpu-function", cpu_computation_benchmark>(input, output);
// Print results
cpu_stage::print_results();// Define CUDA kernel benchmark
struct cuda_kernel_benchmark {
BNCH_SWT_DEVICE static void impl(float* data, uint64_t size) {
// Your CUDA kernel code here
// This runs inside the kernel, NOT as a kernel launch
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
data[idx] = data[idx] * 2.0f; // Example operation
}
}
};
// Create CUDA benchmark stage
using cuda_stage = bnch_swt::benchmark_stage<"gpu-test", 100, 10, bnch_swt::benchmark_types::cuda>;
// Setup
constexpr uint64_t data_size = 1024 * 1024;
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));
// Configure kernel launch parameters
dim3 grid{256, 1, 1};
dim3 block{256, 1, 1};
uint64_t shared_memory = 0;
uint64_t bytes_processed = data_size * sizeof(float);
// Run CUDA benchmark
// Parameters: grid, block, shared_mem, bytes_processed, then your kernel args
cuda_stage::run_benchmark<"my-cuda-kernel", cuda_kernel_benchmark>(
grid, block, shared_memory, bytes_processed,
gpu_data, data_size
);
cuda_stage::print_results();
cudaFree(gpu_data);Important: For CUDA benchmarks, the impl() method contains the kernel code itself (not a kernel launch). The benchmarking framework wraps it in a kernel launch using the provided grid/block dimensions.
You can benchmark CPU and GPU implementations side-by-side:
constexpr uint64_t data_size = 1024 * 1024;
// CPU benchmark function
struct cpu_process_benchmark {
BNCH_SWT_HOST static uint64_t impl(std::vector<float>& cpu_data) {
// Process data on CPU
for (size_t i = 0; i < cpu_data.size(); ++i) {
cpu_data[i] = cpu_data[i] * 2.0f;
}
return cpu_data.size() * sizeof(float);
}
};
// GPU benchmark function (kernel code, NOT kernel launch)
struct gpu_process_benchmark {
BNCH_SWT_DEVICE static void impl(float* gpu_data, uint64_t size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
gpu_data[idx] = gpu_data[idx] * 2.0f;
}
}
};
// Setup test data
std::vector<float> cpu_data(data_size);
float* gpu_data;
cudaMalloc(&gpu_data, data_size * sizeof(float));
// CPU version
using cpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cpu>;
cpu_test::run_benchmark<"cpu-version", cpu_process_benchmark>(cpu_data);
// GPU version
using gpu_test = bnch_swt::benchmark_stage<"cpu-vs-gpu", 100, 10, bnch_swt::benchmark_types::cuda>;
dim3 grid{(data_size + 255) / 256, 1, 1};
dim3 block{256, 1, 1};
gpu_test::run_benchmark<"gpu-version", gpu_process_benchmark>(
grid, block, 0, data_size * sizeof(float),
gpu_data, data_size
);
// Print both results for comparison
cpu_test::print_results();
gpu_test::print_results();
cudaFree(gpu_data);This allows direct performance comparison between CPU and GPU implementations of the same algorithm.
For more accurate CPU benchmarks, you can enable cache clearing between iterations:
// Enable cache clearing (5th template parameter)
using cache_cleared = bnch_swt::benchmark_stage<"cache-test", 200, 25, bnch_swt::benchmark_types::cpu, true>;This is useful when benchmarking memory-bound operations where you want to measure cold cache performance.
You can specify custom metric names for specialized benchmarks that don't measure traditional throughput:
// Compression benchmark with custom metric name
using compression_bench = bnch_swt::benchmark_stage<"compression-test", 200, 25,
bnch_swt::benchmark_types::cpu,
false,
"compression-ratio">;
struct compress_benchmark {
BNCH_SWT_HOST static uint64_t impl(const std::vector<uint8_t>& input) {
auto compressed = compress_data(input);
// Return custom metric value (e.g., compression ratio * 1000)
return (input.size() * 1000) / compressed.size();
}
};
compression_bench::run_benchmark<"my-compressor", compress_benchmark>(input_data);
compression_bench::print_results();When a custom metric name is provided, the results will display your custom metric instead of standard MB/s throughput.
With vcpkg + CMake (recommended):
# Configure with vcpkg toolchain
cmake -B build -S . -DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake -DCMAKE_BUILD_TYPE=Release
# Build
cmake --build build --config Release
# Run
./build/your_benchmark # Linux/macOS
.\build\Release\your_benchmark.exe # WindowsManual CMake build:
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
./build/your_benchmarkFor CUDA benchmarks, ensure CUDA is enabled:
cmake -B build -S . \
-DCMAKE_TOOLCHAIN_FILE=[path-to-vcpkg]/scripts/buildsystems/vcpkg.cmake \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CUDA_ARCHITECTURES=86 # Adjust for your GPU architecture
cmake --build build --config Release-DCMAKE_BUILD_TYPE=Release- Build optimized release version-DCMAKE_CUDA_ARCHITECTURES=86- Target specific CUDA compute capability (e.g., 86 for RTX 30xx/40xx)-DCMAKE_CXX_COMPILER=clang++- Specify C++ compiler-DCMAKE_CUDA_COMPILER=nvcc- Specify CUDA compiler
Project structure:
my-benchmark/
├── CMakeLists.txt
├── vcpkg.json
├── main.cpp
└── benchmarks/
├── cpu_benchmark.hpp
└── gpu_benchmark.cuh
CMakeLists.txt:
cmake_minimum_required(VERSION 3.20)
project(MyBenchmark LANGUAGES CXX CUDA)
# C++23 required
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
# CUDA 20 for GPU benchmarks
set(CMAKE_CUDA_STANDARD 20)
set(CMAKE_CUDA_STANDARD_REQUIRED ON)
# Find rtc-benchmarksuite
find_package(rtc-benchmarksuite CONFIG REQUIRED)
# Create executable
add_executable(my_benchmark
main.cpp
benchmarks/cpu_benchmark.hpp
benchmarks/gpu_benchmark.cuh
)
# Link rtc-benchmarksuite
target_link_libraries(my_benchmark PRIVATE
rtc-benchmarksuite::rtc-benchmarksuite
)
# CUDA properties
set_target_properties(my_benchmark PROPERTIES
CUDA_SEPARABLE_COMPILATION ON
CUDA_RESOLVE_DEVICE_SYMBOLS ON
)
# Optimization flags
if(MSVC)
target_compile_options(my_benchmark PRIVATE /O2 /arch:AVX2)
else()
target_compile_options(my_benchmark PRIVATE -O3 -march=native)
endif()vcpkg.json:
{
"name": "my-benchmark",
"version": "1.0.0",
"dependencies": [
"rtc-benchmarksuite"
]
}Performance Metrics for: int-to-string-comparisons-1
Metrics for: benchmarksuite::internal::to_chars
Total Iterations to Stabilize : 394
Measured Iterations : 20
Bytes Processed : 512.00
Nanoseconds per Execution : 5785.25
Frequency (GHz) : 4.83
Throughput (MB/s) : 84.58
Throughput Percentage Deviation (+/-%) : 8.36
Cycles per Execution : 27921.20
Cycles per Byte : 54.53
Instructions per Execution : 52026.00
Instructions per Cycle : 1.86
Instructions per Byte : 101.61
Branches per Execution : 361.45
Branch Misses per Execution : 0.73
Cache References per Execution : 97.03
Cache Misses per Execution : 74.68
----------------------------------------
Metrics for: glz::to_chars
Total Iterations to Stabilize : 421
Measured Iterations : 20
Bytes Processed : 512.00
Nanoseconds per Execution : 6480.30
Frequency (GHz) : 4.68
Throughput (MB/s) : 75.95
Throughput Percentage Deviation (+/-%) : 17.58
Cycles per Execution : 30314.40
Cycles per Byte : 59.21
Instructions per Execution : 51513.00
Instructions per Cycle : 1.70
Instructions per Byte : 100.61
Branches per Execution : 438.25
Branch Misses per Execution : 0.73
Cache References per Execution : 95.93
Cache Misses per Execution : 73.59
----------------------------------------
Library benchmarksuite::internal::to_chars, is faster than library: glz::to_chars, by roughly: 11.36%.
This structured output helps you quickly identify which implementation is faster or more efficient.
- CPU Benchmarking: Traditional CPU performance measurement with hardware counters
- GPU/CUDA Benchmarking: Native CUDA kernel benchmarking with grid/block configuration
- Mixed Workloads: Compare CPU vs GPU implementations side-by-side
- Automatic Device Selection: Choose benchmark type via
bnch_swt::benchmark_types::cpuorbnch_swt::benchmark_types::cuda
- Cache Clearing: Optional cache eviction between iterations for cold-cache benchmarks
- Custom Metrics: Define custom metric names for specialized benchmarks (e.g., compression ratios, custom throughput units)
- Configurable Iterations: Separate control over warmup iterations and measured iterations
- Programmatic Access: Retrieve raw performance metrics via
get_results()for custom analysis
- CPU Properties: Comprehensive CPU detection and properties via
benchmarksuite_cpu_properties.hpp - GPU Properties: CUDA device detection and properties via
benchmarksuite_gpu_properties.hpp
- Cross-platform CPU counters: Windows, Linux, macOS, Android, Apple ARM
- CUDA performance events: GPU-specific performance monitoring via
counters/cuda_perf_events.hpp
- Cache management: Cross-platform cache clearing utilities
- Aligned constants: Compile-time aligned data structures
- Random generators: High-quality random data generation for benchmarks
As of v1.0.0, all APIs follow snake_case naming convention:
- Functions:
do_not_optimize_away(),generate_random_integers(),print_results() - Types:
size_type,string_literal - Variables:
bytes_processed,test_values
If you're upgrading from an earlier version:
-
Update package name:
benchmarksuite→rtc-benchmarksuite -
Update include paths: All includes are lowercase (already standard)
-
Update API calls: Convert camelCase/PascalCase to snake_case
doNotOptimizeAway()→do_not_optimize_away()printResults()→print_results()generateRandomIntegers()→generate_random_integers()
-
Change benchmark interface: Lambdas are replaced with structs
// Old (lambda-based) benchmark_stage<"test">::run_benchmark<"name">([&] { // code here return bytes_processed; }); // New (struct-based) struct my_benchmark { BNCH_SWT_HOST static uint64_t impl(/* params */) { // code here return bytes_processed; } }; benchmark_stage<"test">::run_benchmark<"name", my_benchmark>(/* args */);
-
Update template parameters: benchmark_stage now has more options
// Old (positional parameters) benchmark_stage<"test", iterations, measured> // New (with defaults and additional options) benchmark_stage<"test", 200, 25, benchmark_types::cpu, false, ""> // ^^^ ^^ ^^^^^^^^^^^^^^^^^^ ^^^^^ ^^ // max measured type cache metric
-
New feature - Device types: You can now specify CPU or CUDA benchmarking:
// CPU (default) benchmark_stage<"test", 200, 25, bnch_swt::benchmark_types::cpu> // CUDA/GPU benchmark_stage<"test", 100, 10, bnch_swt::benchmark_types::cuda>
-
New feature - Cache clearing: Enable cache clearing between iterations for CPU benchmarks:
// Clear cache between each iteration (5th parameter) benchmark_stage<"test", 200, 25, benchmark_types::cpu, true>
-
New feature - Custom metrics: Specify custom metric names for specialized benchmarks:
// Use custom metric instead of default throughput (6th parameter) benchmark_stage<"compression-test", 200, 25, benchmark_types::cpu, false, "compression-ratio">
Now you're ready to start benchmarking with rtc-benchmarksuite!