Skip to content

wenhaoli-xmu/spotlight

Repository files navigation

Spotlight Attention: Extreme KV Cache Pruning for LLM Generation

Spotlight Attention


Table of Contents

  1. Installation
  2. Model Weights
  3. Evaluation
  4. Training

🚀 Installation

Clone the necessary repositories and install dependencies:

# Clone repositories
git clone https://anonymous.4open.science/r/spotlight       # Training and evaluation
git clone https://anonymous.4open.science/r/lm-corpus-FAB7     # Training corpus
git clone https://anonymous.4open.science/r/lm-profiler-A550   # Latency testing tool

# Install dependencies
cd spotlight
pip install -r requirements.txt
pip install -e .

cd ../lm-corpus-FAB7
pip install -e .

cd ../lm-profiler-A550
pip install -e .

Optional: CUDA Kernel Installation

For enhanced performance, install the CUDA kernel:

cd spotlight/spotlight/kernel
bash install.sh

Upon successful compilation, two .so files will be added to the kernel directory.


💾 Model Weights

Pre-trained model checkpoints are available for download:

Model Checkpoint
LLaMA3-8B llama3-8b-spotlight.pth
LLaMA3-8B (C4) llama3-8b-spotlight-c4.pth
LLaMA3-8B (Code) llama3-8b-spotlight-code.pth
Qwen2.5-1.5B qwen2.5-1.5b-spotlight.pth
Qwen2.5-3B qwen2.5-3b-spotlight.pth
Qwen2.5-7B qwen2.5-7b-spotlight.pth
Qwen2.5-14B qwen2.5-14b-spotlight.pth

Kindly reminder: If you’re unable to download these files on anonymous GitHub, you can clone the repository locally and use Git LFS to fetch them. Each file is small in size, with the largest being only a few dozen megabytes.


📊 Evaluation

IoU

Evaluate the Intersection over Union (IoU) metric:

  1. Run the test script:

    bash scripts/test_iou.sh
  2. By default, the script evaluates the training-free linear hashing version. To evaluate a trained model, update the load_ckp key in the relevant JSON configuration file (e.g., test_iou/llama2-7b-linearhashing.json) to point to the desired checkpoint from the Model Weights section.

Perplexity

  1. Prepare Datasets:

    Download the required datasets:

    Set environment variables:

    export SPOTLIGHT_PROOFPILE_PATH=/path/to/proof-pile.json
    export SPOTLIGHT_CODEPARROT_PATH=/path/to/codeparrot.json

    Replace /path/to/ with the actual file paths.

  2. Run Evaluation:

    bash scripts/test_ppl.sh

Few-Shot Learning

All required datasets are automatically downloaded during evaluation. Ensure lm-eval-harness version 0.3.0 is installed, then run:

bash scripts/test_lmeval.sh

LongBench

  1. Prepare Datasets:

    Download data.zip and place it in the LongBench/ directory as LongBench/data.zip.

  2. Run Evaluation:

    bash scripts/test_longbench.sh
  3. Evaluation Results:

    Below are the evaluation logs for various models and configurations:

    LLaMA2-7B

    Method Config Eval Log
    Original N/A llama2-7b
    +Quest 1024 Budget llama2-7b-quest-1024
    +Quest 128 Budget llama2-7b-quest-128
    +MagicPIG Default llama2-7b-magicpig
    +Spotlight 90% Pruned llama2-7b-spotlight-90
    +Spotlight 98% Pruned llama2-7b-spotlight-98

    LLaMA2-7B-Chat

    Method Config Eval Log
    Original N/A llama2-7b-chat
    +Quest 1024 Budget llama2-7b-chat-quest-1024
    +Quest 128 Budget llama2-7b-chat-quest-128
    +MagicPIG Default llama2-7b-chat-magicpig
    +Spotlight 90% Pruned llama2-7b-chat-spotlight-90
    +Spotlight 98% Pruned llama2-7b-chat-spotlight-98

    LLaMA3-8B

    Method Config Eval Log
    Original N/A llama3-8b
    +Quest 1024 Budget llama3-8b-quest-1024
    +Quest 256 Budget llama3-8b-quest-256
    +MagicPIG Default llama3-8b-magicpig
    +Spotlight 90% Pruned llama3-8b-spotlight-90
    +Spotlight 98% Pruned llama3-8b-spotlight-98

    Kindly reminder: The link above points to a summary of the results (if it doesn’t open properly, please click the view raw button in the top right corner). The detailed responses generated by each model can be found under test_longbench/log, where you’ll find hundreds of JSON files recording each model’s output on each sub-dataset.

QA Response Fidelity

To evaluate response fidelity, obtain LongBench output files by either:

  • Running the LongBench test scripts to generate output files.
  • Using provided output files from the LongBench section.

For example, to compare output similarity between LLaMA3-8B with and without Spotlight Attention:

python test_longbench/test_sim.py test_longbench/log/llama3-8b.json test_longbench/log/llama3-8b-spotlight.json

🔥 Training

1. Create Directories

cd spotlight
mkdir -p data/slimpajama ckp

2. Download Datasets

Download and place the following datasets in the data/slimpajama directory:

3. Training Execution

Choose a training method based on available disk space:

  • Sufficient Disk Space:

    1. Edit train.sh to include the --prepare_data argument and run the script.
    2. After completion, replace --prepare_data with --use_prepared_data and rerun.
  • Limited Disk Space (Default): Generate activations on-the-fly by running the train.sh script without modifications.

The trained checkpoint will be saved in the ckp directory and can be referenced in test scripts using the load_ckp keyword.

4. Memory Optimization

Training involves computing a ranking loss, which can be memory-intensive due to the large tensor ( Z ):

[ n_{\text{heads}} \times n_{\text{query}} \times n_{\text{top}} \times (n_{\text{query}} - n_{\text{top}}) ]

To mitigate memory issues, adjust the following parameters (default: 1024):

  • --max_que: Maximum query tokens.
  • --max_top: Maximum top-ranked tokens.
  • --max_oth: Maximum other tokens.

If out-of-memory errors occur, reduce --max_que and --max_oth first.


This README provides a polished, professional guide to installing, evaluating, and training models with Spotlight Attention. For additional details or support, refer to the linked repositories or datasets.

About

Spotlight Attention: Extreme KV Cache Pruning for Efficient LLM Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published