Collaborative Compression for Large-Scale MoE Deployment on Edge

Authors: Yixiao Chen, Yanyue Xie, Ruining Yang, Pu Zhao et al.

Introduction

The Mixture of Experts (MoE) architecture enables scaling Large Language Models (LLMs) efficiently by increasing capacity without raising computation cost. However, ultra-large MoEs like DeepSeek-V3 still pose challenges for deployment on memory-constrained edge devices.
We introduce a collaborative compression framework that integrates expert pruning, activation adjustment, and mixed-precision quantization, reducing DeepSeek-V3’s storage from 1.3 TB to 103 GB while maintaining considerable accuracy.

Framework Overview

The overall framework architecture of CC-MoE. The middle part shows a block-level schematic of DeepSeek MoE. The left part highlights our Performance-Aware Expert Reduction and Pruning-Aware Activation Adjustment for FFN layers, while the right part illustrates the mixed-precision quantization process applied to the remaining model.

Usage

For each component of this project, we provide detailed usage instructions and examples in the corresponding subfolder README files.
Please refer to those for step-by-step tutorials and implementation details.

DeepSeek-V3-Pruning/ — Framework for expert pruning and activation adjustment.
moe-quant/ — Mixed-precision quantization with beginner-friendly llama.cpp guidance.
benchmark/ — GGUF benchmarking and evaluation methods.

Download

All of our released models are publicly available on 🤗Hugging Face.
You are welcome to visit our page for more details, or download and test the models directly using our provided scripts.

pip install huggingface_hub hf_transfer  # hf_transfer optional: speeds up downloads
python snapshot_download.py

Citation

If this work is helpful, please kindly cite as:

@article{chen2025collaborative,
  title={Collaborative Compression for Large-Scale MoE Deployment on Edge},
  author={Chen, Yixiao and Xie, Yanyue and Yang, Ruining and Jiang, Wei and Wang, Wei and He, Yong and Chen, Yue and Zhao, Pu and Wang, Yanzhi},
  journal={arXiv preprint arXiv:2509.25689},
  year={2025}
}

Acknowledgements

This repository builds upon the outstanding work of the following open-source authors and projects:

We sincerely thank them for their excellent contributions to the open-source community.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
DeepSeek-V3-Pruning		DeepSeek-V3-Pruning
assets		assets
benchmark		benchmark
moe-quant		moe-quant
.gitmodules		.gitmodules
README.md		README.md
snapshot_download.py		snapshot_download.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Collaborative Compression for Large-Scale MoE Deployment on Edge

Introduction

Framework Overview

Usage

Download

Citation

Acknowledgements

About

Uh oh!

Languages

moxin-org/CC-MoE

Folders and files

Latest commit

History

Repository files navigation

Collaborative Compression for Large-Scale MoE Deployment on Edge

Introduction

Framework Overview

Usage

Download

Citation

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages