Running Llama4 quantized on 2xH100 80GB #17628
Closed
ilyabcodin
announced in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
Has anyone had success running Llama4 with something like
fp8
orexperts_int8
quantization?Currently 4 is a little bit much, and 3 is impossible due to Llama's architecture, so I've been experimenting with different quant types to be able to run on 2xH100 (160GB total) GPUs, but to no success - running into CUDA out of memory.
I thought int8 would be sufficient since it should result in roughly a halved parameter size (i.e. 110GB VRAM required), but it didn't work.
Beta Was this translation helpful? Give feedback.
All reactions