Skip to content

sands-lab/global-qsgd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Global-QSGD: Allreduce-Compatible Quantization for Distributed Learning

License: MIT Paper 2025 Docker

Global-QSGD provides an easy-to-use Python library that accelerates distributed deep learning training through gradient quantization with global information. Our approach significantly reduces communication overhead while maintaining training convergence and model accuracy, enabling efficient scaling across multiple nodes. We evaluated our approach on different models, including CNNs, Transformers, and Recommendation Models. The code is tested on ASUS ESC N4A-E11 server equipped with 4 NVIDIA A100 GPUs, which runs Ubuntu 22.04 with CUDA 11.6 and PyTorch 1.13.0.

🎯 Key Contributions

  • Global Normalization: Gradient quantization with global norm
  • Exponential Dithering: Ensures convergence
  • Hardware-Optimized: Efficient CUDA kernels for exponential encoding/decoding
  • Easy-to-use: Seamless PyTorch DDP integration

πŸ“‹ Requirements

  • Python: 3.8+
  • PyTorch: 1.13.0+
  • CUDA: 11.6+
  • Hardware: Tested on NVIDIA A100 GPUs
  • OS: Ubuntu 22.04 (recommended)

🐳 Quick Start with Docker (Recommended)

The fastest way to get started is using our pre-configured Docker environment:

# Pull the official Docker image
docker pull messagebuffer/global-qsgd:latest

# Run with GPU support
docker run --ipc=host --net=host --gpus=all \
           --ulimit memlock=-1:-1 \
           --name GlobalQSGD \
           -it messagebuffer/global-qsgd:latest bash

πŸ”§ Installation from Source

Option 1: Quick Installation

cd ~
git clone [email protected]:sands-lab/global-qsgd.git
cd global-qsgd
python3 setup.py install

Option 2: Development Installation

cd ~
rm -r Global-QSGD
git clone [email protected]:sands-lab/global-qsgd.git
cd global-qsgd
pip3 install -e .

Verify Installation

# Installation Check
python3
>>> import torch
>>> import gqsgd
>>> from gqsgd.ddphook import *
>>> from gqsgd import lgreco_hook, powerSGD_hook

# Run simple test for distributed communication
python3 test/testddp.py

πŸ’‘ Usage

Basic Integration

Global-QSGD seamlessly integrates with PyTorch's DistributedDataParallel (DDP):

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from gqsgd.ddphook import standard_dithering_hook, exponential_dithering_hook

# Initialize your model
model = YourModel()
ddp_model = DDP(model, device_ids=[local_rank])

# Register Global-QSGD communication hook
ddp_model.register_comm_hook(None, exponential_dithering_hook)

# Training proceeds normally
for batch in dataloader:
    optimizer.zero_grad()
    output = ddp_model(batch)
    loss = criterion(output, target)
    loss.backward()  # Gradients are automatically quantized
    optimizer.step()

πŸš€ Supported Quantization Methods

Method Description Hook Name Note
Global-QSGD Standard Dithering Linear quantization with global norm standard_dithering_hook Best speed-up
Global-QSGD Exponential Dithering Exponential quantization with global norm exponential_dithering_hook Best convergence
THC Quantization with global norm thc_hook Baseline for Allreduce compatible quantization
PowerSGD Low-rank matrix approximation powerSGD_hook Baseline for Allreduce compatible decomposition
QSGD Quantized SGD with stochastic rounding qsgd_hook Baseline for Allgather based quantization
Default (No Quantization) No quantization, standard DDP communication default_hook Baseline, no quantization

πŸ§ͺ Experimental Validation

Our framework has been extensively validated across three diverse domains:

Recommendation Systems: DeepLight on Criteo

cd /root/global-qsgd/models/DeepLight  
bash ./launch.sh

Each experiment includes comprehensive comparisons across all quantization methods with detailed performance metrics.

Natural Language Processing: TransformerXL on WikiText-103

# Execute from host: Copy data inside docker
# dataset can be obtained from https://www.kaggle.com/datasets/dekomposition/wikitext103
# Rename files under wikitext-103 to: test.txt, train.txt, valid.txt
docker cp <path to wikitext> GlobalQSGD:/root/global-qsgd/models/TransformerXL/pytorch
# Execute inside docker
cd /root/global-qsgd/models/TransformerXL/pytorch
bash ./launch.sh

Computer Vision: ResNet101 on ImageNet

# Execute from host: Copy data inside docker
# dataset can be obtained from https://www.kaggle.com/datasets/zcyzhchyu/mini-imagenet
# miniimagenet should  contains the train and val folders
# Inside the train and val folders, there are many subfolders that contain JPEG pictures
docker cp <path to miniimagenet> GlobalQSGD:/root/miniimagenet
# Execute inside docker
cd /root/global-qsgd/models/ResNet101
bash ./launch.sh

πŸ—οΈ Architecture Overview

Global-QSGD/
β”œβ”€β”€ gqsgd/                    # Core quantization library
β”‚   β”œβ”€β”€ ddphook.py           # PyTorch DDP integration hooks
β”‚   β”œβ”€β”€ allreduce.py         # Distributed communication primitives  
β”‚   β”œβ”€β”€ powerSGD_hook.py     # PowerSGD implementation
β”‚   └── lgreco_hook.py       # LGreco adaptive compression
β”œβ”€β”€ models/                   # Experimental validation
β”‚   β”œβ”€β”€ ResNet101/           # Computer vision experiments
β”‚   β”œβ”€β”€ TransformerXL/       # NLP experiments  
β”‚   └── DeepLight/           # Recommendation system experiments
β”œβ”€β”€ gqsgd_cuda.cu            # CUDA kernels for quantization
└── setup.py                 # Package installation

πŸ“„ Citation

If you use Global-QSGD in your research, please cite our ECAI 2025 paper:

@inproceedings{global-qsgd-ecai2025,
  title={Global-QSGD: Allreduce-Compatible Quantization for Distributed Learning with Theoretical Guarantees},
  author={Jihao Xin and Marco Canini and Peter RichtΓ‘rik and Samuel HorvΓ‘th},
  booktitle={Proceedings of the European Conference on Artificial Intelligence (ECAI)},
  year={2025},
  publisher={IOS Press}
}

Made with ❀️ by the Global-QSGD Team from KAUST & MBZUAI

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •