gpt-oss Model - Crash loading + completion isn't quite right. #15135
Replies: 2 comments
-
Some more info... My cross-platform Cosmo build loads the model on a Raspberry Pi (ARM) without crashing. So I'm guessing there's some wrongly sized integer or some such in new code to support loading that model. I don't quite know enough to dig in and figure it out with purpose, but enough to flail and maybe find something, so that is what I will do. The ARM build still has the problem with not being able to generate 10 tokens at a time, or resume from an arbitrary spot in the completion. It seems as if the processing prompt, completing, and kv cache aren't in sync or something. But again, I have flail-around knowledge of how this code works. Hope this helps someone figure out the problem. |
Beta Was this translation helpful? Give feedback.
-
I've given up on this model. It appears to be unusable without structuring everything with chat tokens. It looks like some people are running into crashes when loading, so I look forward to seeing if they ran into my issue. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
First, I build llama.cpp with Cosmopolitan for cross-platform binaries, kinda like llamafile, except I keep my fork up to date with changes here:
https://github.com/BradHutchings/Mmojo-Server
For these builds, I use cpu inference, and have it use the
_generic
functions, rather than CPU specific functions for the two sub-builds (x86 and ARM). I've made my own subfolder ofggml/src/ggml-cpu/arch
for my builds. This has worked great since the ggml stuff was refactored.My build of Mmojo-Server, with the latest changes from here, crashes around "warming up the model" when using the gpt-oss-20B model. It works fine with other models popular like Google Gemma, Meta Llama, Mistral 7B, etc.
If I build llama.cpp with the same
_generic
CPU inference, no crashes. So it looks like something new brushing up against an edge in Cosmo. I'm sure I'll figure it out.The
/completion
endpoint in server doesn't seem quite right with this model. It seems to trigger thinking/safety initially whenever it is invoked.A demo I like to show is completing 10 tokens at a time. If I give it a cue like "Write me a story about my dogs." and do 10 tokens at a time, calling the
/completion
endpoint on the old story plus the 10 tokens generated, I expect to eventually see the whole story. What I see though, is it doubling back to evaluate the user request each time. It's like we're not just using the raw weights to predict 10 best enough tokens, and instead pushing it through chat formatting.Does this make sense? Is it fixable?
Beta Was this translation helpful? Give feedback.
All reactions