Skip to content

moxin-org/CC-MoE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Collaborative Compression for Large-Scale MoE Deployment on Edge

Authors: Yixiao Chen, Yanyue Xie, Ruining Yang, Pu Zhao et al.

arXiv HuggingFace DeepSeek llama.cpp

Introduction

The Mixture of Experts (MoE) architecture enables scaling Large Language Models (LLMs) efficiently by increasing capacity without raising computation cost. However, ultra-large MoEs like DeepSeek-V3 still pose challenges for deployment on memory-constrained edge devices.
We introduce a collaborative compression framework that integrates expert pruning, activation adjustment, and mixed-precision quantization, reducing DeepSeek-V3’s storage from 1.3 TB to 103 GB while maintaining considerable accuracy.

Framework Overview


The overall framework architecture of CC-MoE. The middle part shows a block-level schematic of DeepSeek MoE. The left part highlights our Performance-Aware Expert Reduction and Pruning-Aware Activation Adjustment for FFN layers, while the right part illustrates the mixed-precision quantization process applied to the remaining model.

Usage

For each component of this project, we provide detailed usage instructions and examples in the corresponding subfolder README files.
Please refer to those for step-by-step tutorials and implementation details.

  • DeepSeek-V3-Pruning/ — Framework for expert pruning and activation adjustment.
  • moe-quant/ — Mixed-precision quantization with beginner-friendly llama.cpp guidance.
  • benchmark/ — GGUF benchmarking and evaluation methods.

Download

All of our released models are publicly available on 🤗Hugging Face.
You are welcome to visit our page for more details, or download and test the models directly using our provided scripts.

pip install huggingface_hub hf_transfer  # hf_transfer optional: speeds up downloads
python snapshot_download.py

Citation

If this work is helpful, please kindly cite as:

@article{chen2025collaborative,
  title={Collaborative Compression for Large-Scale MoE Deployment on Edge},
  author={Chen, Yixiao and Xie, Yanyue and Yang, Ruining and Jiang, Wei and Wang, Wei and He, Yong and Chen, Yue and Zhao, Pu and Wang, Yanzhi},
  journal={arXiv preprint arXiv:2509.25689},
  year={2025}
}

Acknowledgements

This repository builds upon the outstanding work of the following open-source authors and projects:

We sincerely thank them for their excellent contributions to the open-source community.