🌟 Awesome AI Efficiency 🌟

A curated list of resources dedicated to enhancing efficiency in AI systems. This repository covers a wide range of topics essential for optimizing AI models and processes, aiming to make AI faster, cheaper, smaller, and greener!

Topics Summary 🎨

Topic	Description	Topics
Quantization	Reducing precision of AI models without loss
Pruning	Removing unnecessary model parameters for efficiency
Caching	Storing computation results for faster reuse
Distillation	Transferring knowledge from a large model to a smaller one
Factorization	Breaking down complex models into simpler, efficient components
Compilation	Optimizing model code for specific hardware and environments
Parameter-Efficient Fine-tuning	Learning a subset of parameters
Speculative Decoding	Decoding with batches
Hardware	Leveraging specialized hardware for faster model execution
Training	Techniques for making model training faster and more efficient
Inference	Optimizing the speed and resource usage during model inference
Sustainability	Strategies to reduce the environmental impact of AI systems
Scalability	Approaches for scaling AI models and infrastructure efficiently

If you find this list helpful, give it a ⭐ on GitHub, share it, and contribute by submitting a pull request or issue!

Facts 📊

3-40Wh: Amount of energy consumed for one small to long ChatGPT query (Source, 2025)
1L: Estimated amount of water required for 20-100 ChatGPT queries (Source, 2025)
2 nuclear plants: Number of nuclear plants to constantly work ot generate enough energy if 80M people generate 5 pages per day (Source, 2025)
1 smartphone charge: Amount of energy required to AI generate a couple of images or run a few thousands inference with an LLM (Source, 2024)
>10s: Time requried to generate 1 HD image with Flux on H100 or to generate 100 tokens with Llama 3 on T4 (Source and Source, 2024)
7-10 smartphone charges: Amount of energy required to AI generate one video with Wan 2.1 (Source)
61,848.0x: Difference between the highest and lowest energy use in energy leaderboard for AI models (Source, 2025).
1,300MWh: GPT-3, for example, is estimated to use just under 1,300 megawatt hours (MWh) of electricity; about as much power as consumed annually by 130 US homes (Source, 2024)
800M users/week: Amount of users using ChatGPT per week in 2025 (Source)
1B messages/day: Amount of ChatGPT queries per day in 2025 (Source)
+160%: Expected increase of data center power consumption by 2030 (Source)
x3.8: Hardware acceleration (GPU/TPU) reduces energy consumption by a factor of 3.8 compared with the CPU, for the same task, but also reduces response time by up to 39% (Source)
x18:The carbon footprint of a task can vary by a factor of 18 depending on the model, framework and backend used (Source)

Tools 🛠️

❤️ Pruna ❤️: A package to make AI models faster, smaller, faster, greener by combining compression methods (incl. quantization, pruning, caching, compilation, distillation...) on various hardware.
TensorRT: High-performance deep learning inference library for NVIDIA GPUs.
ONNX: Open Neural Network Exchange format for interoperability among deep learning frameworks.
Code Carbon: Library to track energy and carbon efficiency of various hardware.
LLM Perf: A framework for benchmarking the performance of transformers models with different hardwares, backends and optimizations.
ML.ENERGY Leaderboard: An initiative to benchmark energy efficiency of AI models.
AI Energy Score: An initiative to establish comparable energy efficiency ratings for AI models, helping the industry make informed decisions about sustainability in AI development.
Model Optimization Toolkit: TensorFlow toolkit for optimizing machine learning models for deployment and execution.
Green Coding: LLM service that you can use to prompt most open source models and see the resource usage.
EcoLogits: EcoLogits is a python library that tracks the energy consumption and environmental footprint of using generative AI models through APIs.
Perplexity Kernels: GPU kernels by Perplexity.
Fast Tokenizer: Fast tokenizer is an efficient and optimized tokenizer engine for llm inference serving.
WeightWatcher: WeightWatcher (WW) is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data..
Cockpit: A Practical Debugging Tool for Training Deep Neural Networks.
Electrictiy Map: A live map showing the origin of the electricity in world regions and their CO2 intensity.
MLCA: A tool for machine learning life cycle assessment.
TritonParse: A visualization and analysis tool for Triton IR files, designed to help developers analyze, debug, and understand Triton kernel compilation processes.
Routing on Random Forests: A framework for training and serving LLM based on random forest-based routers, thus allowing to optimize for costs.
LLMCache: An LLM serving engine extension to reduce time-to-first-token and increase throughput, especially under long-context scenarios.
ExLlamaV3: An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs.
FlashDeBERTa: Flash implementation of DeBERTa disentangled attention mechanism.
QuACK: An assortiment of Kernels for GPUs.
Pi-Quant: An assortiment of Kernels for CPUs.
pplx-kernels: An assortiment of Kernels for GPUs.
LMCache: an LLM serving engine extension to reduce TTFT and increase throughput, especially under long-context scenarios, by optimizing the KV caches.
FastWan: a family of video generation models trained via “sparse distillation”.
GEAK Agent: This is an LLM-based multi-agent framework, which can generate functional and efficient gpu kernels automatically.

News Articles 📰

"Energy and AI Observatory" (2025) - IEA
"AI’s Impacts, how to limit them, and why" (2025) - Better Tech
"How much energy does ChatGPT use?" (2025) - Epoch AI
Data centers et intelligence artificielle : la course au gigantisme (2025) - Le Monde
"What's the environmental cost of AI?" (2024) - CO2 AI
"Shrinking the giants: Paving the way for TinyAI" (2024) - Cell Press
"DeepSeek might not be such good news for energy after all" (2024) - MIT Technology Review
"AI already uses as much energy as a small country. It’s only the beginning." (2024) - Vox
"Quelle contribution du numérique à la décarbonation ?" (2024) - France Stratégie
"Les promesses de l’IA grevées par un lourd bilan carbone" (2024) - Le Monde
"How much electricity does AI consume?" (2024) - The Verge
"How do I track the direct environmental impact of my own inference and training when working with AI?" (2024) - Blog
"Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse?" (2024) - The Guardian
"Light bulbs have energy ratings — so why can’t AI chatbots?" (2024) - Nature
"The Environmental Impacts of AI -- Primer" (2024) - Hugging Face
"The Climate and Sustainability Implications of Generative AI" (2024) - MIT
"AI's "eye-watering" use of resources could be a hurdle to achieving climate goals, argue experts" (2023) - dezeen
"How coders can help save the planet?" (2023) - Blog
"Reducing the Carbon Footprint of Generative AI" (2023) - Blog
"The MPG of LLMs: Exploring the Energy Efficiency of Generative AI" (2023) - Blog
"Ecologie numérique: L’IA durable, entre vœu pieux et opportunité de marché" (2025) - Libération

Reports 📈

"The environmental impact of local text AI" (2025) - Green Spector
"Misinformation by Omission: The Need for More Environmental Transparency in AI" (2025) - None
"A General Framework for Frugal AI" (2025) - AFNOR
"The 2025 AI Index Report" (2025) - Stanford Human-centered Artificial Intelligence
"Energy and AI" (2025) - International Energy Agency
"Key challenges for the environmental performance of AI" (2025) - French Ministry
"Artificial Intelligence and electricity: A system dynamics approach" (2024) - Schneider
"Notable AI Models" (2025) - Epoch AI
"Powering Artificial Intelligence" (2024) - Deloitte
"Google Sustainability Reports" (2024) - Google
"How much water does AI consume? The public deserves to know" (2023) - OECD
"Measuring the environmental impacts of artificial intelligence compute and applications" (2022) - OECD

Research Articles 📄

Paper	Year	Venue
_{QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models}	2025	None
_{Fast Video Generation with Sliding Tile Attention}	2025	ICML
_{Quartet: Native FP4 Training Can Be Optimal for Large Language Models}	2025	None
_{How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference}	2025	None
_{MagCache: Fast Video Generation with Magnitude-Aware Cache}	2025	None
_{Compressing Language Models for Specialized Domains}	2025	None
_{Dynamic Chunking for End-to-End Hierarchical Sequence Modeling}	2025	None
_{SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training}	2025	None
_{XAttention: Block Sparse Attention with Antidiagonal Scoring}	2025	ICML
_{Jenga: Effective Memory Management for Serving LLM with Heterogeneity}	2025	None
_{Learning Few-Step Diffusion Models by Trajectory Distribution Matching}	2025	ICCV
_{Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation}	2025	None
_{Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding}	2025	None
_{Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic Column-Sparse Deltas}	2025	None
_{Mirage: A Multi-Level Superoptimizer for Tensor Programs}	2025	None
_{The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization}	2025	None
_{AB-Cache: Training-Free Acceleration of Diffusion Models via Adams-Bashforth Cached Feature Reuse}	2025	None
_{Hardware-Efficient Attention for Fast Decoding}	2025	None
_{Model-Preserving Adaptive Rounding}	2025	None
_{Frugal AI: Introduction, Concepts, Development and Open Questions}	2025	None
_{Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models}	2025	None
_{Efficient Time Series Processing for Transformers and State-Space Models through Token Merging}	2025	None
_{A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency}	2025	None
_{SpargeAttn: Accurate Sparse Attention Accelerating Any Model Inference}	2025	None
_{s1: Simple test-time scaling}	2025	None
_{BitNet b1.58 2B4T Technical Report}	2025	None
_{NdLinear Is All You Need for Representation Learning}	2025	None
_{LoRI: Reducing Cross-Task Interference in Multi-Task LowRank Adaptation}	2025	ICLR
_{FISH-Tuning: Enhancing PEFT Methods with Fisher Information}	2025	None
_{Green Prompting}	2025	None
_{Compression Scaling Laws:Unifying Sparsity and Quantization}	2025	None
_{FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality}	2025	ICLR
_{LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding}	2025	ICLR
_{Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models}	2025	None
_{Real-Time Video Generation with Pyramid Attention Broadcast}	2025	ICLR
_{Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models}	2025	ICLR
_{Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing}	2025	ICLR
_{Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention}	2025	None
_{FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute}	2025	None
_{Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling}	2025	None
_{SpinQuant: LLM Quantization with Learned Rotations}	2025	ICLR
_{Making AI Less “Thirsty”: Uncovering and Addressing the Secret Water Footprint of AI Models}	2025	None
_{Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps}	2025	None
_{QuEST: Stable Training of LLMs with 1-Bit Weights and Activations}	2025	None
_{Distillation Scaling Laws}	2025	None
_{From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate}	2025	None
_{Coca4ai: checking energy behaviors on AI data centers}	2024	None
_{LTX-Video: Realtime Video Latent Diffusion}	2024	CVPR
_{How Green Can AI Be? A Study of Trends in Machine Learning Environmental Impacts}	2024	None
_{QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs}	2024	NeurIPS
_{The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information}	2024	NeurIPS
_{Palu: Compressing KV-Cache with Low-Rank Projection}	2024	None
_{AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration}	2024	MLSys
_{LOFIT: Localized Fine-tuning on LLM Representations}	2024	NeurIPS
_{Outlier Weighed Layerwise Sparsity: A Missing Secret Sauce for Pruning LLMs to High Sparsity}	2024	ICML
_{FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality}	2024	None
_{QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks}	2024	ICML
_{QTIP: Quantization with Trellises and Incoherence Processing}	2024	NeurIPS
_{VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models}	2024	EMNLP
_{QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs}	2024	NeurIPS
_{QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving}	2024	None
_{Extreme Compression of Large Language Models via Additive Quantization}	2024	ICML
_{Fast Matrix Multiplications for Lookup Table-Quantized LLMs}	2024	None
_{GPTVQ: The Blessing of Dimensionality for LLM Quantization}	2024	None
_{Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey}	2024	None
_{SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration}	2024	None
_{SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices}	2024	NeurIPS
_{ShortGPT: Layers in Large Language Models are More Redundant Than You Expecthttps://arxiv.org/pdf/2403.03853}	2024	None
_{Canvas: End-to-End Kernel Architecture Search in Neural Networks}	2024	None
_{Scaling Laws for Precision}	2024	None
_{DeepCache: Accelerating Diffusion Models for Free}	2024	CVPR
_{Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding}	2024	ACL
_{Power Hungry Processing: Watts Driving the Cost of AI Deployment?}	2024	FaccT
_{Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression}	2024	ICML
_{Pushing the Limits of Large Language Model Quantization via the Linearity Theorem}	2024	None
_{Position: Tensor Networks are a Valuable Asset for Green AI}	2024	None
_{Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI}	2024	None
_{Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes}	2024	ICLR
_{Efficient Memory Management for Large Language Model Serving with PagedAttention}	2023	SOSP
_{Broken Neural Scaling Laws}	2023	ICLR
_{Post Training Mixed Precision Quantization of Neural Networks using First-Order Information}	2023	ICCV
_{Ring Attention with Blockwise Transformers for Near-Infinite Context}	2023	None
_{A Practical Mixed Precision Algorithm for Post-Training Quantization}	2023	None
_{SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models}	2023	ICML
_{PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs}	2023	None
_{Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning}	2023	Sustainable Computing: Informatics and Systems
_{An experimental comparison of software-based power meters: focus on CPU and GPU}	2023	CCGrid
_{Fast Inference from Transformers via Speculative Decoding}	2023	ICML
_{Efficient Streaming Language Models with Attention Sinks}	2023	ICLR
_{GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers}	2023	None
_{Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance}	2022	ECCV
_{Knowledge Distillation: A Good Teacher is Patient and Consistent}	2022	CVPR
_{LoRA: Low-Rank Adaptation of Large Language Models}	2022	ICLR
_{LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale}	2022	NeurIPS
_{Optimal Clipping and Magnitude-aware Differentiation for Improved Quantization-aware Training}	2022	ICML
_{Sustainable AI: Environmental Implications, Challenges and Opportunities}	2022	None
_{Learnable Lookup Table for Neural Network Quantization}	2022	CVPR
_{Training Compute-Optimal Large Language Models}	2022	None
_{FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness}	2022	None
_{Towards a Unified View of Parameter-Efficient Transfer Learning}	2022	ICLR
_{Parameter-Efficient Transfer Learning with Diff Pruning}	2021	ACL
_{What is the State of Neural Network Pruning?}	2020	MLSys
_{Scaling Laws for Autoregressive Generative Modeling}	2020	None
_{Model Compression via Distillation and Quantization}	2018	ICLR
_{Optimal Brain Damage}	1989	NeurIPs

Blogs 📰

"Our contribution to a global environmental standard for AI (2025)" - Mistral AI
"AI: It's All About Inference Now (2025)" - ACM Queue
"ScalarLM vLLM Optimization with Virtual Channels" (2025) - ScalarLM
"Review of Inference Optimization" (2025) - Aussie AI
"The Limits of Large Fused Kernels on Nvidia GPUs: Why Real-Time AI Inference Needs More" (2025) - Smallest AI
"How Much Power does a SOTA Open Video Model Use?" (2025) - Hugging Face
"Improving Quantized FP4 Weight Quality via Logit Distillation" (2025) - Mobius Labs
"Introducing NVFP4 for Efficient and Accurate Low-Precision Inference" (2025) - Nvidia
"The LLM Engineer Almanac" (2025) - Modal
"Enhance Your Models in 5 Minutes with the Hugging Face Kernel Hub" (2025) - Hugging Face
"Reduce, Reuse, Recycle: Why Open Source is a Win for Sustainability" (2025) - Hugging Face
"Mixture of Experts: When Does It Really Deliver Energy Efficiency?" (2025) - Neural Watt
"Efficient and Portable Mixture-of-Experts Communication" (2025) - Perplexity
"Optimizing Tokenization for Faster and Efficient LLM Processing" (2025) - Medium
"Tensor Parallelism with CUDA - Multi-GPU Matrix Multiplication" (2025) - Substack
"Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling" (2025) - Nvidia Developer
"AI CUDA Engineer" (2025) - Sakana AI
"The ML/AI Engineer's starter guide to GPU Programming" (2025) - Neural Bits
"Understanding Quantization for LLMs" (2024) - Medium
"Don't Merge Your LoRA Adapter Into a 4-bit LLM" (2023) - Substack
"Matrix Multiplication Background User's Guide" (2023) - Nvidia Developer
"GPU Performance Background User's Guide" (2023) - Nvidia Developer

Books 📚

Programming Massively Parallel Processors: A Hands-on Approach (2022), Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj
Efficient Deep Learning (2022), Gaurav Menghani, Naresh Singh

Lectures 🎓

AI Efficiency Courses: Slides, Exercises (2025) - Lecture by Bertrand Charpentier
Data Compression, Theory and Applications: YouTube, Slides (2024) - Stanford
MIT Han's Lab (2024) - MIT Lecture by Han's lab
GPU Mode (2020) - Tutorials by GPU mode community

People 🧑‍💻

Name	Affiliation	Research Interests
James Martin	Better Tech	_{AI Sustainability}
Saleh Ashkboos	ETH Zurich	_Quantization
Dan Alistarh	IST Austria	_{AI Compression}
Elias Frantar	OpenAI	_Quantization
Tim Dettmers	CMU	_Quantization
Song Han	MIT	_{AI Efficiency}
Scott Chamberlin	TBD	_{AI Efficiency}
Benoit Petit	Boavista	_{Data Center Efficiency}
Samuel Rincé	Gen AI Impact	_{AI Efficiency, Sustainability}
Théo Alves Da Costa	Ekimetrics	_{AI Efficiency, Sustainability}
Sasha Luccioni	Hugging Face	_{AI Sustainability}
Anne-Laure Ligozat	ENSIEE	_{AI Sustainability}
Boris Gamazaychikov	Sales Force	_{AI Sustainability}
Julie Ravillon	Sales Force	_{AI Sustainability}
Will Alpine	Enabled EMissions Camapaigns	_{AI Sustainability}
Holly Alpine	Enabled EMissions Camapaigns	_{AI Sustainability}
Drew Wilkinson	Climate Leadership Collective	_{AI Sustainability}
Maren Costa	WorkforClimate	_{AI Sustainability}
Lou Welgryn	Data4Good	_{AI Ethics, Sustainability}
Caroline Jean-Pierre	Gen AI Impact	_{AI Sustainability}
Claire Saignol	Gen AI Impact	_{AI Sustianability}
Juliette Fropier	French Ministry	_{AI Sustainability}
Helene Costa de Beauregard	French Ministry	_{AI Sustainability}
Rémy Marrone	Independent	_{AI Sustainability}
Mark Butcher	Positive Cloud	_{Cloud sustainability}
Robert Keus	Green PT	_{AI Sustainability}
Cas Burggraaf	GreenPT	_{AI Sustainability}
Wilco Burggraaf	GreenPT	_{AI Sustainability}
Anna Lerner Nesbitt	Climate Collective	_{AI Sustainability}
Scott Chamberlin	Neural Watt	_{AI Sustainability}
Jeremy Tamanini	Dual CItizen LLC	_{AI Sustainability}
Emma Strubell	CMU	_{AI Sustainability}

Organizations 🌍

Organization	Description	Website
Data4Good	A platform that connects data scientists with social impact projects to address global challenges using data.	data4good.org
Gen AI Impact	A platform dedidaceted to understand generative AI environmental footprint.	genai-impact.org
Make.org	A global platform that empowers citizens to propose and take action on social and environmental issues through collective projects.	make.org
CodeCarbon	A tool that helps track the carbon emissions of machine learning models and optimizes them for sustainability.	codecarbon.io
Sustainable AI Coalition	An organization dedicated to advancing sustainability in AI technologies and promoting best practices for green AI.	sustainableaicoalition.org
FruitPunch AI	A community that solves AI solutions for impact organizations that contribute to the SDG's.	fruitpunch.ai

Contributing 🤝

Contributions are welcome! Please follow our contribution guidelines to add new resources or suggest improvements that promote AI efficiency. Youc can contact @sharpenb if you have any questions.

License 📄

This project is licensed under the MIT License. Feel free to share and use the resources as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌟 Awesome AI Efficiency 🌟

Topics Summary 🎨

Table of Contents

Facts 📊

Tools 🛠️

News Articles 📰

Reports 📈

Research Articles 📄

Blogs 📰

Books 📚

Lectures 🎓

People 🧑‍💻

Organizations 🌍

Contributing 🤝

License 📄

About

Uh oh!

Releases

Packages

Contributors 2

License

PrunaAI/awesome-ai-efficiency

Folders and files

Latest commit

History

Repository files navigation

🌟 Awesome AI Efficiency 🌟

Topics Summary 🎨

Table of Contents

Facts 📊

Tools 🛠️

News Articles 📰

Reports 📈

Research Articles 📄

Blogs 📰

Books 📚

Lectures 🎓

People 🧑‍💻

Organizations 🌍

Contributing 🤝

License 📄

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages