Skip to content

runtime/debug.SetCrashOutput + GOGC=off + GOMEMLIMIT can lead to OOM kills #73490

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
felixge opened this issue Apr 24, 2025 · 8 comments
Closed
Labels
BugReport Issues describing a possible bug in the Go implementation.

Comments

@felixge
Copy link
Contributor

felixge commented Apr 24, 2025

We recently noticed some OOM kills, and after a fun debugging adventure involving core dumps, we concluded that the issue was caused by our usage of the SetCrashOutput API which mirrors the example from the documentation.

In our case the parent process is subject to a GOMEMLIMIT env variable, and the child process inherited it. This in combination with some init functions that were spawning background goroutine activity caused the child process to grow its heap without the GC kicking in due to the high memory limit inherited from the parent process. Eventually this lead to the child+parent exceeding the memory limit of the container, resulting in an OOM kill.

Suggestion: Let's update the docs to either strip the memory limit env var or avoid passing any environment variables from the parent process (remove the os.Environ()). I can send a patch if there is consensus behind either of those two approaches.

cc @mknyszek

Edit: The parent process was also setting GOGC=off. That's what actually caused the OOM.

@felixge felixge changed the title runtime/debug.SetCrashOutput example interacts badly with GOMEMLIMIT runtime/debug.SetCrashOutput + GOMEMLIMIT can lead to OOM kills Apr 24, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label Apr 24, 2025
@adonovan
Copy link
Member

adonovan commented Apr 24, 2025

In our case the parent process is subject to a GOMEMLIMIT env variable, and the child process inherited it. This in combination with some init functions that were spawning background goroutine activity caused the child process to grow its heap without the GC kicking in due to the high memory limit inherited from the parent process. Eventually this lead to the child+parent exceeding the memory limit of the container, resulting in an OOM kill.

Funny, I was just thinking this morning about how the crashmonitor causes all init functions to run twice, once in the parent and once in the child, causing both the startup latency and memory allocation costs of the application to be doubled. I was wondering whether we already have any means of profiling the initialization: what I had in mind was an environment variable that would cause all init functions to report their wall time to stderr; such a mechanism could also report whether any go statements were executed. This would make it easy to track down bloated initializers and make them lazy (using sync.Once) instead.

Also, I was wondering whether the crashmonitor child process's use of the current directory after the parent process has terminated might be the cause of #73481 on Windows, and that if so, the remedy is for the child to os.Chdir("/") like a UNIX daemon process.

I agree that a change to the documentation and example is the appropriate fix.

@felixge
Copy link
Contributor Author

felixge commented Apr 24, 2025

Funny, I was just thinking this morning about how the crashmonitor causes all init functions to run twice, once in the parent and once in the child, causing both the startup latency and memory allocation costs of the application to be doubled. I was wondering whether we already have any means of profiling the initialization: what I had in mind was an environment variable that would cause all init functions to report their wall time to stderr; such a mechanism could also report whether any go statements were executed. This would make it easy to track down bloated initializers and make them lazy (using sync.Once) instead.

GODEBUG=inittrace=1 does this AFAIK.

I agree that a change to the documentation and example is the appropriate fix.

Which fix do you prefer? Not inherit the env of the parent at all, or stripping the GOMEMLIMIT (and maybe GOGC) env vars?

Anyway, I think even with this improvement, the nature of init is kind of at odds with the way the crash monitor child process is supposed to work. In theory init functions could do all kinds of undesirable things, and it's very hard to reason about the risks in large programs with big import graphs.

One thing we're considering on our end is embedding a fat cosmopolitan binary to use as the crash monitor. This would be written to a /tmp location before spawning a child process for it.

@mknyszek
Copy link
Contributor

mknyszek commented Apr 24, 2025

Just to be as precise as possible, it's not just passing GOMEMLIMIT that causes this. You also have to set GOGC=off (or to some high number), right? If it was just GOMEMLIMIT set and GOGC was the default, large memory growth shouldn't be a problem.

FWIW, inheriting environment variables causes problems in a variety of ways, not just with GOMEMLIMIT specifically. IIRC we've tripped over unintentionally propagating other environment variables in our infrastructure more than once. From that perspective, this seems more like inherent problems with environment variables, not so much GOMEMLIMIT or SetCrashOutput specifically.

Perhaps we should recommend GOMEMLIMIT and GOGC be set via flags and the runtime/debug package, or warn about using them as environment variables in the docs as a reminder.

I am also in support of updating the example for SetCrashOutput. To that end, perhaps we also only update the docs to strip GOGC, not GOMEMLIMIT.

@felixge
Copy link
Contributor Author

felixge commented Apr 24, 2025

Just to be as precise as possible, it's not just passing GOMEMLIMIT that causes this. You also have to set GOGC=off (or to some high number), right? If it was just GOMEMLIMIT set and GOGC was the default, large memory growth shouldn't be a problem.

Yes! Sorry for omitting this important (!) detail. This program was indeed operating with GOGC=off:

(gdb) p 'syscall.envs'
$2 = []string = {"GOMEMLIMIT=18GiB", "GOGC=off", ...}

FWIW, inheriting environment variables causes problems in a variety of ways, not just with GOMEMLIMIT specifically. IIRC we've tripped over unintentionally propagating other environment variables in our infrastructure more than once. From that perspective, this seems more like inherent problems with environment variables, not so much GOMEMLIMIT or SetCrashOutput specifically.

Perhaps we should recommend GOMEMLIMIT and GOGC be set via flags and the runtime/debug package, or warn about using them as environment variables in the docs as a reminder.

I am also in support of updating the example for SetCrashOutput. To that end, perhaps we also only update the docs to strip GOGC, not GOMEMLIMIT.

Sounds good. But considering what you wrote above, wouldn't it be better to not propagate any environment variables in the example?

@felixge felixge changed the title runtime/debug.SetCrashOutput + GOMEMLIMIT can lead to OOM kills runtime/debug.SetCrashOutput + GOGC=off + GOMEMLIMIT can lead to OOM kills Apr 24, 2025
@mknyszek
Copy link
Contributor

Sounds good. But considering what you wrote above, wouldn't it be better to not propagate any environment variables in the example?

Yup, sorry, I support that, too. :)

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/667775 mentions this issue: runtime/debug: update SetCrashOutput example to not pass parent env vars

@adonovan
Copy link
Member

GODEBUG=inittrace=1 does this AFAIK.

Thanks; I had a vague feeling it existed already, just without the goroutine part.

Anyway, I think even with this improvement, the nature of init is kind of at odds with the way the crash monitor child process is supposed to work. In theory init functions could do all kinds of undesirable things, and it's very hard to reason about the risks in large programs with big import graphs.

I agree, it does require an unsatisfactory degree of control over the whole executable. However, depending on what you want to do, the crashmonitor (child) process currently must use the same executable as the target (parent) process because it needs access to the correct symbol table and pclntab (runtime.CallersFrames and runtime.Func.FileLine). At least, that's what the gopls crashmonitor does. If the runtime were responsible for computing all this information and emitting it as JSON, the child process could be almost anything. We've talked about doing this (#65761), but it's a fiddly project.

@gabyhelp
Copy link

Related Code Changes

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation.
Projects
None yet
Development

No branches or pull requests

5 participants