This repository contains tools for setting up SGLang with CUDA support and evaluating various language models using a standardized benchmarking approach.
.
โโโ ๐ README.md
โโโ ๐ Sglang_instruct_eval.ipynb # Main benchmarking notebook
โโโ ๐ Sample Generation/ # Contains model evaluation scores
โโโ ๐ instruct Benchmark.txt # Detailed benchmark outputs
- ๐ฎ NVIDIA GPU with CUDA support
- ๐ Python 3.10 or later
- ๐ฅ๏ธ Linux operating system
- ๐พ Sufficient disk space (~5GB for CUDA, ~2GB for PyTorch, ~1GB for SGLang)
# Download CUDA 12.1
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
# Install CUDA (without driver)
sudo sh cuda_12.1.0_530.30.02_linux.run
# Add CUDA to PATH and LD_LIBRARY_PATH
export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
# Make the PATH changes permanent
echo 'export PATH=/usr/local/cuda-12.1/bin${PATH:+:${PATH}}' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}' >> ~/.bashrc
source ~/.bashrc
# Install PyTorch with CUDA 12.1 support
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install SGLang with all dependencies
pip install "sglang[all]"
# Install FlashInfer (optional, for better performance)
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4/
The repository includes a comprehensive benchmarking system that evaluates language models across 20 diverse questions. The evaluation process includes:
โ
Automated model response generation
โ
Response quality assessment
โ
Score aggregation and analysis
1๏ธโฃ Open Sglang_instruct_eval.ipynb
in Jupyter Notebook/Lab
2๏ธโฃ Follow the notebook cells to:
- ๐ Load and initialize models
- โ๏ธ Generate responses to benchmark questions
- ๐พ Save results in JSON format
The framework supports two evaluation approaches:
1๏ธโฃ Automated Evaluation ๐ค
- Uses larger models (ChatGPT/Claude) as evaluators
- Systematic scoring based on predefined criteria
- Evaluation Prompt:
Task: You are an expert evaluator responsible for scoring responses from a smaller LLM across various knowledge domains. Given a question and the LLM's response, assess the response based on accuracy, coherence, depth, and clarity. Provide a score (out of 100) for each category listed below. Evaluation Criteria: Creative Writing: Assess narrative coherence, expressiveness, and poetic strength. Science & Technology: Judge accuracy, structure, and ease of understanding.Social Sciences & Humanities: Evaluate insightfulness and depth of analysis. Business & Economics: Check for relevance, practical insights, and clarity. Health & Well-being: Assess informativeness, clarity, and explanation depth. Environment & Climate: Evaluate comprehensiveness, structure, and impact assessment.
2๏ธโฃ **Manual Evaluation** ๐จโ๐ป
- Direct **human assessment** of responses
- Qualitative and quantitative scoring
### ๐ Output Format
Results are stored in **two locations**:
- ๐ `Sample Generation/` โ Contains individual model scores
- ๐ `instruct Benchmark.txt` โ Detailed benchmark outputs
## โ
Verification Steps
```bash
# Verify CUDA installation
nvcc --version
# Verify PyTorch CUDA support
python3 -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Test SGLang
python3 -c "import sglang as sgl"
import sglang as sgl
import asyncio
# Initialize the engine with a model
llm = sgl.Engine(model_path="NousResearch/Hermes-3-Llama-3.2-3B")
โก Make sure you have the appropriate NVIDIA drivers installed before CUDA installation
โก The CUDA installer will warn about driver installation โ you can ignore this if you already have compatible drivers
โก If you encounter permission issues, you might need to use sudo
for some commands
โก For production use, consider using a virtual environment
โก Benchmark results may vary based on model versions and evaluation criteria
If you encounter any issues:
๐น Verify your NVIDIA drivers are properly installed
๐น Check if CUDA paths are correctly set in your environment
๐น Ensure your Python version is compatible
๐น Make sure you have sufficient disk space
๐น Check JSON output format if evaluation results arenโt being saved properly
Feel free to contribute by:
๐น Adding new benchmark questions
๐น Implementing additional evaluation metrics
๐น Testing with different models
๐น Improving documentation
๐ NVIDIA CUDA Toolkit
๐ PyTorch Installation
๐ SGLang Documentation
๐ FlashInfer
๐ This setup guide and benchmarking framework is provided under the Apache 2.0.