Authors: Yixiao Chen, Yanyue Xie, Ruining Yang, Pu Zhao et al.
The Mixture of Experts (MoE) architecture enables scaling Large Language Models (LLMs) efficiently by increasing capacity without raising computation cost. However, ultra-large MoEs like DeepSeek-V3 still pose challenges for deployment on memory-constrained edge devices.
We introduce a collaborative compression framework that integrates expert pruning, activation adjustment, and mixed-precision quantization, reducing DeepSeek-V3’s storage from 1.3 TB to 103 GB while maintaining considerable accuracy.
The overall framework architecture of CC-MoE. The middle part shows a block-level schematic of DeepSeek MoE. The left part highlights our Performance-Aware Expert Reduction and Pruning-Aware Activation Adjustment for FFN layers, while the right part illustrates the mixed-precision quantization process applied to the remaining model.
For each component of this project, we provide detailed usage instructions and examples in the corresponding subfolder README files.
Please refer to those for step-by-step tutorials and implementation details.
DeepSeek-V3-Pruning/— Framework for expert pruning and activation adjustment.moe-quant/— Mixed-precision quantization with beginner-friendly llama.cpp guidance.benchmark/— GGUF benchmarking and evaluation methods.
All of our released models are publicly available on 🤗Hugging Face.
You are welcome to visit our page for more details, or download and test the models directly using our provided scripts.
pip install huggingface_hub hf_transfer # hf_transfer optional: speeds up downloads
python snapshot_download.py
If this work is helpful, please kindly cite as:
@article{chen2025collaborative,
title={Collaborative Compression for Large-Scale MoE Deployment on Edge},
author={Chen, Yixiao and Xie, Yanyue and Yang, Ruining and Jiang, Wei and Wang, Wei and He, Yong and Chen, Yue and Zhao, Pu and Wang, Yanzhi},
journal={arXiv preprint arXiv:2509.25689},
year={2025}
}This repository builds upon the outstanding work of the following open-source authors and projects:
- DeepSeek-V3.
- tflsxyy.
- ggml-org/llama.cpp, unsloth.ai, bartowski.
- ikawrakow/ik_llama.cpp, ikawrakow, ubergarm.
- EleutherAI/lm-evaluation-harness.
We sincerely thank them for their excellent contributions to the open-source community.
