GitHub

BatchLAS is a high-performance library for batched linear algebra operations that supports multiple backends. It provides an abstraction layer over different vendor-specific libraries while maintaining high performance.

Features

Unified API for different hardware backends
Batched matrix operations
Support for dense and sparse matrices
SYCL interoperability for cross-platform performance

Currently Implemented Operations

Dense Matrix Operations

Basic BLAS operations
- Matrix-matrix multiplication (gemm)
- Matrix-vector multiplication (gemv)
- Triangular solve (trsm)
LAPACK operations
- Cholesky factorization (potrf)
- LU factorization with partial pivoting (getrf)
- Solution of linear systems using LU factorization (getrs)
- Matrix inversion (getri, inv)
- QR factorization (geqrf)
- Generation of orthogonal matrix from QR factorization (orgqr)
- Multiplication by orthogonal matrix from QR factorization (ormqr)
- Symmetric eigenvalue decomposition (syev)
Matrix orthogonalization with multiple algorithms
- Cholesky-based methods (Chol2, Cholesky, ShiftChol3)
- Classical Gram-Schmidt with reorthogonalization (CGS2)
- Householder QR-based orthogonalization
- SVQB (SVD-based orthogonalization)
Utility operations
- Matrix norms (Frobenius, 1-norm, infinity-norm)
- Condition number computation (cond)
- Matrix transpose
- Matrix creation utilities (Identity, Random, Zeros, Ones, Diagonal, Triangular, TriDiagToeplitz)

Sparse Matrix Operations

Basic operations
- Sparse matrix-dense matrix multiplication (spmm)
- Support for CSR (Compressed Sparse Row) format
- Format conversion between dense and sparse
Sparse eigensolvers
- Batched LOBPCG for partial eigendecomposition (syevx)
  - Finds largest or smallest eigen-pairs
  - Supports both sparse and dense matrices
  - Configurable orthogonalization algorithms and tolerances
- Batched Lanczos algorithm for full eigendecompositions (lanczos)
  - Supports sparse and dense matrices
  - Configurable orthogonalization and sorting options
- Specialized tridiagonal eigensolvers for Lanczos

Advanced Features

Batched operations: All operations support processing multiple matrices simultaneously
Multiple data formats: Dense and CSR sparse matrix support
Memory management: Unified memory vectors and spans for cross-platform compatibility
Backend abstraction: Automatic backend selection or manual specification
SYCL integration: Full SYCL interoperability for cross-platform GPU computing
Python bindings: Complete Python interface with NumPy integration

Working Backends

NVIDIA CUDA (cuBLAS, cuSOLVER, cuSPARSE)
AMD ROCm (rocBLAS, rocSOLVER, rocSPARSE)
CPU (CBLAS, LAPACKE)

Requirements

C++17 compatible compiler
CMake 3.14 or higher
SYCL implementation (Intel oneAPI DPC++)
Optional: CUDA toolkit for NVIDIA GPUs
Optional: ROCm for AMD GPUs
Optional: Netlib BLAS/LAPACK for CPU
Optional: Intel oneMKL for optimized CPU backend (currently experimental)
For oneMKL support, set the MKLROOT environment variable to your oneAPI installation
Optional: Python 3.x (for Python bindings)

Installation

Basic Installation

git clone https://github.com/yourusername/BatchLAS.git
cd BatchLAS
mkdir build && cd build
cmake ..
make -j$(nproc)
make install

Configuration Options

BatchLAS can be configured with various options:

cmake .. \
  -DBATCHLAS_BUILD_TESTS=ON \
  -DBATCHLAS_BUILD_EXAMPLES=ON \
  -DBATCHLAS_ENABLE_CUDA=ON \
  -DBATCHLAS_ENABLE_ROCM=ON \
  -DBATCHLAS_ENABLE_OPENMP=ON \
  -DBATCHLAS_BUILD_PYTHON=ON

Available options:

BATCHLAS_BUILD_TESTS: Build test suite (default: ON)
BATCHLAS_BUILD_EXAMPLES: Build examples (default: OFF)
BATCHLAS_BUILD_DOCS: Build documentation (default: OFF)
BATCHLAS_ENABLE_CUDA: Enable CUDA support (default: OFF)
BATCHLAS_ENABLE_ROCM: Enable ROCm support (default: OFF)
BATCHLAS_ENABLE_OPENMP: Enable OpenMP support (default: OFF)
BATCHLAS_BUILD_PYTHON: Build Python bindings (default: ON)
BATCHLAS_ENABLE_MKL: Enable Intel oneMKL backend (default: OFF, experimental)
BATCHLAS_AMD_ARCH: AMD GPU architecture when building ROCm backend (default: gfx942)
BATCHLAS_NVIDIA_ARCH: NVIDIA GPU architecture when building CUDA backend (default: sm_50)

Quick Start

Here's a simple example of using BatchLAS for matrix multiplication:

#include <batchlas.hh>

using namespace batchlas;

int main() {
    // Create a context
    auto ctx = Queue(Device::default_device());
    
    // Define matrix dimensions
    const int rows = 1000;
    const int cols = 1000;
    const int k = 1000;
    const int batch_size = 10;
    
    // Create matrices using factory methods
    auto A = Matrix<float>::Random(rows, k, batch_size);
    auto B = Matrix<float>::Random(k, cols, batch_size);
    auto C = Matrix<float>::Zeros(rows, cols, batch_size);
    
    // Initialize data (if needed, Random already initializes)
    // A.fill(1.0f); // Example: fill A with 1.0f
    // B.fill(2.0f); // Example: fill B with 2.0f
    
    // Perform batched matrix multiplication using views of the matrices
    gemm<Backend::AUTO>(ctx, A, B, C, 1.0f, 0.0f, Transpose::NoTrans, Transpose::NoTrans);
    
    // Wait for completion
    ctx.wait();
    
    return 0;
}

Advanced Features

Matrix Creation and Utilities

BatchLAS provides comprehensive matrix creation utilities:

// Create various matrix types
auto identity = Matrix<float>::Identity(100, batch_size);          // Identity matrices
auto random = Matrix<float>::Random(100, 100, false, batch_size);  // Random matrices
auto zeros = Matrix<float>::Zeros(100, 100, batch_size);           // Zero matrices  
auto ones = Matrix<float>::Ones(100, 100, batch_size);             // Matrices filled with ones
auto tridiag = Matrix<float>::TriDiagToeplitz(100, 2.0f, -1.0f, -1.0f, batch_size); // Tridiagonal matrices

// Create sparse matrices in CSR format
auto sparse_A = Matrix<float, MatrixFormat::CSR>(rows, cols, nnz, batch_size);

// Matrix utilities
auto norms = norm<float, MatrixFormat::Dense>(ctx, A, NormType::Frobenius);
auto conditions = cond<Backend::AUTO>(ctx, A, NormType::Frobenius);
auto A_transposed = transpose(ctx, A);
auto A_inverse = inv<Backend::AUTO>(ctx, A);

Orthogonalization

BatchLAS provides various orthogonalization algorithms with configurable parameters:

// Allocate workspace memory
UnifiedVector<std::byte> workspace(ortho_buffer_size<Backend::AUTO>(
    ctx, matrices, Transpose::NoTrans, OrthoAlgorithm::ShiftChol3));

// Orthogonalize matrices using different algorithms
ortho<Backend::AUTO>(ctx, matrices, Transpose::NoTrans, workspace, OrthoAlgorithm::CGS2);       // Classical Gram-Schmidt
ortho<Backend::AUTO>(ctx, matrices, Transpose::NoTrans, workspace, OrthoAlgorithm::Chol2);      // Cholesky-based
ortho<Backend::AUTO>(ctx, matrices, Transpose::NoTrans, workspace, OrthoAlgorithm::Householder); // QR-based
ortho<Backend::AUTO>(ctx, matrices, Transpose::NoTrans, workspace, OrthoAlgorithm::SVQB);       // SVD-based

// Orthogonalize with respect to an external metric
ortho<Backend::AUTO>(ctx, A, M, Transpose::NoTrans, Transpose::NoTrans, workspace, OrthoAlgorithm::Chol2, 2);

Sparse Eigensolvers

For large-scale eigenvalue problems:

// LOBPCG for finding specific eigenvalues
SyevxParams<float> lobpcg_params;
lobpcg_params.find_largest = true;           // Find largest eigenvalues
lobpcg_params.iterations = 100;              // Maximum iterations
lobpcg_params.extra_directions = 10;         // Extra search directions
lobpcg_params.algorithm = OrthoAlgorithm::CGS2; // Orthogonalization method

UnifiedVector<std::byte> syevx_workspace(syevx_buffer_size<Backend::AUTO>(
    ctx, sparse_A, eigenvalues, neigs, JobType::EigenVectors, eigenvectors, lobpcg_params));

syevx<Backend::AUTO>(ctx, sparse_A, eigenvalues, neigs, syevx_workspace, 
                     JobType::EigenVectors, eigenvectors, lobpcg_params);

// Lanczos for full eigendecomposition
LanczosParams<float> lanczos_params;
lanczos_params.ortho_algorithm = OrthoAlgorithm::CGS2;
lanczos_params.sort_enabled = true;
lanczos_params.sort_order = SortOrder::Ascending;

UnifiedVector<std::byte> lanczos_workspace(lanczos_buffer_size<Backend::AUTO>(
    ctx, sparse_A, all_eigenvalues, JobType::EigenVectors, all_eigenvectors, lanczos_params));

lanczos<Backend::AUTO>(ctx, sparse_A, all_eigenvalues, lanczos_workspace,
                       JobType::EigenVectors, all_eigenvectors, lanczos_params);

Matrix and Vector Views

Work efficiently with matrix and vector subsets:

// Create views into existing data
auto A_view = A.view(50, 50);  // View first 50x50 submatrix
auto batch_item = A[0];        // View single matrix from batch
auto col_vector = VectorView<float>(matrix_data + col_offset, rows, 1, 0, batch_size);

// Access and manipulate individual elements
float value = A_view.at(10, 20, 0);  // Element at row 10, col 20, batch 0
A_view.at(5, 5, 0) = 2.5f;           // Set element value

Testing

To run the test suite:

cd build
ctest

Performance Tuning

BatchLAS automatically selects the most suitable backend for your hardware, but you can manually specify a backend for optimal performance in specific use cases:

// Use CUDA backend explicitly on NVIDIA hardware
gemm<Backend::CUDA>(ctx, A, B, C, alpha, beta, Transpose::NoTrans, Transpose::NoTrans);

// Use ROCm backend explicitly on AMD hardware  
gemm<Backend::ROCM>(ctx, A, B, C, alpha, beta, Transpose::NoTrans, Transpose::NoTrans);

Benchmarks

Benchmark executables are built in the benchmarks directory. Each benchmark registers a default set of input sizes, but you can override these at runtime by providing custom sizes on the command line. Arguments may be integers, comma‑separated lists or start:end:num ranges. When custom sizes are supplied they replace the registered ones for all benchmarks. You can further limit execution to specific backends or floating point types using the --backend and --type options.

Example:

./gemm_benchmark 512 512 128 10
./gemm_benchmark 64:256:4 64:256:4 64:256:4 1,2,4
./gemm_benchmark --backend=CUDA --type=float 256 256 64 8
./ortho_benchmark --backend=ROCM --type=double 1024 512 4
./syevx_benchmark --backend=AUTO 2048 2048 50 2

Available benchmarks include:

gemm_benchmark: Dense matrix multiplication
gemv_benchmark: Matrix-vector multiplication
ortho_benchmark: Orthogonalization algorithms
syevx_benchmark: Sparse eigenvalue solvers
lanczos_benchmark: Lanczos eigenvalue algorithm
spmm_benchmark: Sparse matrix-dense matrix multiplication
trsm_benchmark: Triangular solve operations

License

TBD

Name		Name	Last commit message	Last commit date
Latest commit History 490 Commits
benchmarks		benchmarks
cmake		cmake
include		include
python		python
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BatchLAS_logo_transparent.png		BatchLAS_logo_transparent.png
CMakeLists.txt		CMakeLists.txt
README.md		README.md
backend_config.h.in		backend_config.h.in

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Currently Implemented Operations

Dense Matrix Operations

Sparse Matrix Operations

Advanced Features

Working Backends

Requirements

Installation

Basic Installation

Configuration Options

Quick Start

Advanced Features

Matrix Creation and Utilities

Orthogonalization

Sparse Eigensolvers

Matrix and Vector Views

Testing

Performance Tuning

Benchmarks

License

About

Uh oh!

Releases

Packages

Languages

jonasdelacour/BatchLAS

Folders and files

Latest commit

History

Repository files navigation

Features

Currently Implemented Operations

Dense Matrix Operations

Sparse Matrix Operations

Advanced Features

Working Backends

Requirements

Installation

Basic Installation

Configuration Options

Quick Start

Advanced Features

Matrix Creation and Utilities

Orthogonalization

Sparse Eigensolvers

Matrix and Vector Views

Testing

Performance Tuning

Benchmarks

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages