perf(app): reduce peak memory usage #8090

psychedelicious · 2025-06-07T09:11:47Z

Summary

Reduce peak memory usage by Invoke process. See commits for details.

TL;DR: We thought we had a memory leak. Not sure about that. I think we may just have a slowly climbing memory usage that eventually peaks and plateaus. Before we plateau, it looks like we have a memory leak.

Once the process allocates memory from the OS, it can never give it back, but it can re-use it. So the apparent memory leak is actually internal poor management of allocations, stemming frmo our usage patterns

Changes

del patched tokenizer and related objects in compel nodes so python can free up the arenas used by them
Skip TI patching logic when there are no TIs to patch
GC before each queue item

Results

Sampled process memory usage after each queue item (SDXL t2i nothing extra). First column is memory at that time, second is delta from prev sample. Invocation cache is disabled for the testing bc that would really confound the results.

Top graph is usage, bottom is delta.

Notes

Before the improvements, it takes ~100 queue items for mem to plateau. So for this test, if you didn't know any better, you'd think we have a memory leak unless you waited for the ~100 gens
After the improvements, it takes ~15 queue items for mem to plateau.
Plateau is ~900 MB greater without the fixes
Before the fixes we are seeing ~20 to 30 mb increase per queue item
We have some some caching internally - images, latents, conditioning tensors are all lru-cached. This contributes to the memory usage. I tried with these disabled and it reduced time to plateau bit a little bit.
Even with the fixes, and the same graphs being executed, we sometimes need to allocate more memory. That's OK. It's the initial rapid climb that we want to reduce as much as possible.

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

We've long suspected there is a memory leak in Invoke, but that may not be true. What looks like a memory leak may in fact be the expected behaviour for our allocation patterns. We observe ~20 to ~30 MB increase in memory usage per session executed. I did some prolonged tests, where I measured the process's RSS in bytes while doing 200 SDXL generations. I found that it eventually leveled off at around 100 generations, at which point memory usage had climbed by ~900MB from its starting point. I used tracemalloc to diff the allocations of single session executions and found that we are allocating ~20MB or so per session in `ModelPatcher.apply_ti()`. In `ModelPatcher.apply_ti()` we add tokens to the tokenizer when handling TIs. The added tokens should be scoped to only the current invocation, but there is no simple way to remove the tokens afterwards. As a workaround for this, we clone the tokenizer, add the TI tokens to the clone, and use the clone to when running compel. Afterwards, this cloned tokenizer is discarded. The tokenizer uses ~20MB of memory, and it has referrers/referents to other compel stuff. This is what is causing the observed increases in memory per session! We'd expect these objects to be GC'd but python doesn't do it immediately. After creating the cond tensors, we quickly move on to denoising. So there isn't any time for the GC to happen to free up its existing memory arenas/blocks to reuse them. Instead, python needs to request more memory from the OS. We can improve the situation by immediately calling `del` on the tokenizer clone and related objects. In fact, we already had some code in the compel nodes to `del` some of these objects, but not all. Adding the `del`s vastly improves things. We hit peak RSS in half the sessions (~50 or less) and it's now ~100MB more than starting value. There is still a gradual increase in memory usage until we level off.

This reduces peak memory usage at a negligible cost. Queue items typically take on the order of seconds, making the time cost of a GC essentially free. Not a great idea on a hotter code path though.

psychedelicious added 3 commits June 7, 2025 17:19

perf(app): skip TI logic when no TIs to apply

b6feade

perf(app): gc before every queue item

82fb6e5

This reduces peak memory usage at a negligible cost. Queue items typically take on the order of seconds, making the time cost of a GC essentially free. Not a great idea on a hotter code path though.

psychedelicious requested review from lstein, blessedcoolant, hipsterusername, jazzhaiku and maryhipp as code owners June 7, 2025 09:11

github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files services PRs that change app services labels Jun 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(app): reduce peak memory usage #8090

perf(app): reduce peak memory usage #8090

Uh oh!

psychedelicious commented Jun 7, 2025

Uh oh!

Uh oh!

perf(app): reduce peak memory usage #8090

Are you sure you want to change the base?

perf(app): reduce peak memory usage #8090

Uh oh!

Conversation

psychedelicious commented Jun 7, 2025

Summary

Changes

Results

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

Uh oh!