Spotlight Attention: Extreme KV Cache Pruning for LLM Generation

🚀 Installation

Clone the necessary repositories and install dependencies:

# Clone repositories
git clone https://anonymous.4open.science/r/spotlight       # Training and evaluation
git clone https://anonymous.4open.science/r/lm-corpus-FAB7     # Training corpus
git clone https://anonymous.4open.science/r/lm-profiler-A550   # Latency testing tool

# Install dependencies
cd spotlight
pip install -r requirements.txt
pip install -e .

cd ../lm-corpus-FAB7
pip install -e .

cd ../lm-profiler-A550
pip install -e .

Optional: CUDA Kernel Installation

For enhanced performance, install the CUDA kernel:

cd spotlight/spotlight/kernel
bash install.sh

Upon successful compilation, two .so files will be added to the kernel directory.

💾 Model Weights

Pre-trained model checkpoints are available for download:

Model	Checkpoint
LLaMA3-8B	llama3-8b-spotlight.pth
LLaMA3-8B (C4)	llama3-8b-spotlight-c4.pth
LLaMA3-8B (Code)	llama3-8b-spotlight-code.pth
Qwen2.5-1.5B	qwen2.5-1.5b-spotlight.pth
Qwen2.5-3B	qwen2.5-3b-spotlight.pth
Qwen2.5-7B	qwen2.5-7b-spotlight.pth
Qwen2.5-14B	qwen2.5-14b-spotlight.pth

Kindly reminder: If you’re unable to download these files on anonymous GitHub, you can clone the repository locally and use Git LFS to fetch them. Each file is small in size, with the largest being only a few dozen megabytes.

📊 Evaluation

IoU

Evaluate the Intersection over Union (IoU) metric:

Run the test script:
```
bash scripts/test_iou.sh
```
By default, the script evaluates the training-free linear hashing version. To evaluate a trained model, update the load_ckp key in the relevant JSON configuration file (e.g., test_iou/llama2-7b-linearhashing.json) to point to the desired checkpoint from the Model Weights section.

Perplexity

Prepare Datasets:

Download the required datasets:
- proof-pile.json
- codeparrot.json
Set environment variables:
```
export SPOTLIGHT_PROOFPILE_PATH=/path/to/proof-pile.json
export SPOTLIGHT_CODEPARROT_PATH=/path/to/codeparrot.json
```
Replace /path/to/ with the actual file paths.
Run Evaluation:
```
bash scripts/test_ppl.sh
```

Few-Shot Learning

All required datasets are automatically downloaded during evaluation. Ensure lm-eval-harness version 0.3.0 is installed, then run:

bash scripts/test_lmeval.sh

LongBench

Prepare Datasets:

Download data.zip and place it in the LongBench/ directory as LongBench/data.zip.
Run Evaluation:
```
bash scripts/test_longbench.sh
```

Evaluation Results:

Below are the evaluation logs for various models and configurations:

LLaMA2-7B

Method	Config	Eval Log
Original	N/A	llama2-7b
+Quest	1024 Budget	llama2-7b-quest-1024
+Quest	128 Budget	llama2-7b-quest-128
+MagicPIG	Default	llama2-7b-magicpig
+Spotlight	90% Pruned	llama2-7b-spotlight-90
+Spotlight	98% Pruned	llama2-7b-spotlight-98

LLaMA2-7B-Chat

Method	Config	Eval Log
Original	N/A	llama2-7b-chat
+Quest	1024 Budget	llama2-7b-chat-quest-1024
+Quest	128 Budget	llama2-7b-chat-quest-128
+MagicPIG	Default	llama2-7b-chat-magicpig
+Spotlight	90% Pruned	llama2-7b-chat-spotlight-90
+Spotlight	98% Pruned	llama2-7b-chat-spotlight-98

LLaMA3-8B

Method	Config	Eval Log
Original	N/A	llama3-8b
+Quest	1024 Budget	llama3-8b-quest-1024
+Quest	256 Budget	llama3-8b-quest-256
+MagicPIG	Default	llama3-8b-magicpig
+Spotlight	90% Pruned	llama3-8b-spotlight-90
+Spotlight	98% Pruned	llama3-8b-spotlight-98

Kindly reminder: The link above points to a summary of the results (if it doesn’t open properly, please click the view raw button in the top right corner). The detailed responses generated by each model can be found under test_longbench/log, where you’ll find hundreds of JSON files recording each model’s output on each sub-dataset.

QA Response Fidelity

To evaluate response fidelity, obtain LongBench output files by either:

Running the LongBench test scripts to generate output files.
Using provided output files from the LongBench section.

For example, to compare output similarity between LLaMA3-8B with and without Spotlight Attention:

python test_longbench/test_sim.py test_longbench/log/llama3-8b.json test_longbench/log/llama3-8b-spotlight.json

🔥 Training

1. Create Directories

cd spotlight
mkdir -p data/slimpajama ckp

2. Download Datasets

Download and place the following datasets in the data/slimpajama directory:

3. Training Execution

Choose a training method based on available disk space:

Sufficient Disk Space:
1. Edit train.sh to include the --prepare_data argument and run the script.
2. After completion, replace --prepare_data with --use_prepared_data and rerun.
Limited Disk Space (Default): Generate activations on-the-fly by running the train.sh script without modifications.

The trained checkpoint will be saved in the ckp directory and can be referenced in test scripts using the load_ckp keyword.

4. Memory Optimization

Training involves computing a ranking loss, which can be memory-intensive due to the large tensor ( Z ):

[ n_{\text{heads}} \times n_{\text{query}} \times n_{\text{top}} \times (n_{\text{query}} - n_{\text{top}}) ]

To mitigate memory issues, adjust the following parameters (default: 1024):

--max_que: Maximum query tokens.
--max_top: Maximum top-ranked tokens.
--max_oth: Maximum other tokens.

If out-of-memory errors occur, reduce --max_que and --max_oth first.

This README provides a polished, professional guide to installing, evaluating, and training models with Spotlight Attention. For additional details or support, refer to the linked repositories or datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
LongBench		LongBench
ckp		ckp
config		config
mmlu		mmlu
scripts		scripts
spotlight		spotlight
test_iou		test_iou
test_lmeval		test_lmeval
test_longbench		test_longbench
test_ppl		test_ppl
train		train
train_curves		train_curves
train_results		train_results
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
spotlight.png		spotlight.png
train-qwen2.5-1.5b.sh		train-qwen2.5-1.5b.sh
train-qwen2.5-14b.sh		train-qwen2.5-14b.sh
train-qwen2.5-3b.sh		train-qwen2.5-3b.sh
train-qwen2.5-7b-inst.sh		train-qwen2.5-7b-inst.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spotlight Attention: Extreme KV Cache Pruning for LLM Generation

Table of Contents

🚀 Installation

Optional: CUDA Kernel Installation

💾 Model Weights

📊 Evaluation

IoU

Perplexity

Few-Shot Learning

LongBench

QA Response Fidelity

🔥 Training

1. Create Directories

2. Download Datasets

3. Training Execution

4. Memory Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wenhaoli-xmu/spotlight

Folders and files

Latest commit

History

Repository files navigation

Spotlight Attention: Extreme KV Cache Pruning for LLM Generation

Table of Contents

🚀 Installation

Optional: CUDA Kernel Installation

💾 Model Weights

📊 Evaluation

IoU

Perplexity

Few-Shot Learning

LongBench

QA Response Fidelity

🔥 Training

1. Create Directories

2. Download Datasets

3. Training Execution

4. Memory Optimization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages