Running Llama4 quantized on 2xH100 80GB #17628

ilyabcodin · 2025-05-04T08:23:59Z

ilyabcodin
May 4, 2025

Hey everyone,
Has anyone had success running Llama4 with something like fp8 or experts_int8 quantization?

Currently 4 is a little bit much, and 3 is impossible due to Llama's architecture, so I've been experimenting with different quant types to be able to run on 2xH100 (160GB total) GPUs, but to no success - running into CUDA out of memory.

I thought int8 would be sufficient since it should result in roughly a halved parameter size (i.e. 110GB VRAM required), but it didn't work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Running Llama4 quantized on 2xH100 80GB #17628

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Running Llama4 quantized on 2xH100 80GB #17628

Uh oh!

ilyabcodin May 4, 2025

Replies: 0 comments

ilyabcodin
May 4, 2025