gpt-oss Model - Crash loading + completion isn't quite right. #15135

BradHutchings · 2025-08-06T19:02:52Z

BradHutchings
Aug 6, 2025

First, I build llama.cpp with Cosmopolitan for cross-platform binaries, kinda like llamafile, except I keep my fork up to date with changes here:

https://github.com/BradHutchings/Mmojo-Server

For these builds, I use cpu inference, and have it use the _generic functions, rather than CPU specific functions for the two sub-builds (x86 and ARM). I've made my own subfolder of ggml/src/ggml-cpu/arch for my builds. This has worked great since the ggml stuff was refactored.

My build of Mmojo-Server, with the latest changes from here, crashes around "warming up the model" when using the gpt-oss-20B model. It works fine with other models popular like Google Gemma, Meta Llama, Mistral 7B, etc.

common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 32768
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Terminating on uncaught SIGSEGV. Pass --strace and/or ShowCrashReports() for details.

If I build llama.cpp with the same _generic CPU inference, no crashes. So it looks like something new brushing up against an edge in Cosmo. I'm sure I'll figure it out.

The /completion endpoint in server doesn't seem quite right with this model. It seems to trigger thinking/safety initially whenever it is invoked.

A demo I like to show is completing 10 tokens at a time. If I give it a cue like "Write me a story about my dogs." and do 10 tokens at a time, calling the /completion endpoint on the old story plus the 10 tokens generated, I expect to eventually see the whole story. What I see though, is it doubling back to evaluate the user request each time. It's like we're not just using the raw weights to predict 10 best enough tokens, and instead pushing it through chat formatting.

Does this make sense? Is it fixable?

BradHutchings · 2025-08-06T23:13:05Z

BradHutchings
Aug 6, 2025
Author

Some more info... My cross-platform Cosmo build loads the model on a Raspberry Pi (ARM) without crashing. So I'm guessing there's some wrongly sized integer or some such in new code to support loading that model. I don't quite know enough to dig in and figure it out with purpose, but enough to flail and maybe find something, so that is what I will do.

The ARM build still has the problem with not being able to generate 10 tokens at a time, or resume from an arbitrary spot in the completion. It seems as if the processing prompt, completing, and kv cache aren't in sync or something. But again, I have flail-around knowledge of how this code works.

Hope this helps someone figure out the problem.

0 replies

BradHutchings · 2025-08-08T18:53:39Z

BradHutchings
Aug 8, 2025
Author

I've given up on this model. It appears to be unusable without structuring everything with chat tokens. It looks like some people are running into crashes when loading, so I look forward to seeing if they ran into my issue.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gpt-oss Model - Crash loading + completion isn't quite right. #15135

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

gpt-oss Model - Crash loading + completion isn't quite right. #15135

Uh oh!

Uh oh!

BradHutchings Aug 6, 2025

Replies: 2 comments

Uh oh!

BradHutchings Aug 6, 2025 Author

Uh oh!

BradHutchings Aug 8, 2025 Author

BradHutchings
Aug 6, 2025

BradHutchings
Aug 6, 2025
Author

BradHutchings
Aug 8, 2025
Author