Skip to content

perf(app): reduce peak memory usage #8090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

psychedelicious
Copy link
Collaborator

Summary

Reduce peak memory usage by Invoke process. See commits for details.

TL;DR: We thought we had a memory leak. Not sure about that. I think we may just have a slowly climbing memory usage that eventually peaks and plateaus. Before we plateau, it looks like we have a memory leak.

Once the process allocates memory from the OS, it can never give it back, but it can re-use it. So the apparent memory leak is actually internal poor management of allocations, stemming frmo our usage patterns

Changes

  • del patched tokenizer and related objects in compel nodes so python can free up the arenas used by them
  • Skip TI patching logic when there are no TIs to patch
  • GC before each queue item

Results

Sampled process memory usage after each queue item (SDXL t2i nothing extra). First column is memory at that time, second is delta from prev sample. Invocation cache is disabled for the testing bc that would really confound the results.

Top graph is usage, bottom is delta.

image image

Notes

  • Before the improvements, it takes ~100 queue items for mem to plateau. So for this test, if you didn't know any better, you'd think we have a memory leak unless you waited for the ~100 gens
  • After the improvements, it takes ~15 queue items for mem to plateau.
  • Plateau is ~900 MB greater without the fixes
  • Before the fixes we are seeing ~20 to 30 mb increase per queue item
  • We have some some caching internally - images, latents, conditioning tensors are all lru-cached. This contributes to the memory usage. I tried with these disabled and it reduced time to plateau bit a little bit.
  • Even with the fixes, and the same graphs being executed, we sometimes need to allocate more memory. That's OK. It's the initial rapid climb that we want to reduce as much as possible.

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

We've long suspected there is a memory leak in Invoke, but that may not be true. What looks like a memory leak may in fact be the expected behaviour for our allocation patterns.

We observe ~20 to ~30 MB increase in memory usage per session executed. I did some prolonged tests, where I measured the process's RSS in bytes while doing 200 SDXL generations. I found that it eventually leveled off at around 100 generations, at which point memory usage had climbed by ~900MB from its starting point.

I used tracemalloc to diff the allocations of single session executions and found that we are allocating ~20MB or so per session in `ModelPatcher.apply_ti()`.

In `ModelPatcher.apply_ti()` we add tokens to the tokenizer when handling TIs. The added tokens should be scoped to only the current invocation, but there is no simple way to remove the tokens afterwards.

As a workaround for this, we clone the tokenizer, add the TI tokens to the clone, and use the clone to when running compel. Afterwards, this cloned tokenizer is discarded.

The tokenizer uses ~20MB of memory, and it has referrers/referents to other compel stuff. This is what is causing the observed increases in memory per session!

We'd expect these objects to be GC'd but python doesn't do it immediately. After creating the cond tensors, we quickly move on to denoising. So there isn't any time for the GC to happen to free up its existing memory arenas/blocks to reuse them. Instead, python needs to request more memory from the OS.

We can improve the situation by immediately calling `del` on the tokenizer clone and related objects. In fact, we already had some code in the compel nodes to `del` some of these objects, but not all.

Adding the `del`s vastly improves things. We hit peak RSS in half the sessions (~50 or less) and it's now ~100MB more than starting value. There is still a gradual increase in memory usage until we level off.
This reduces peak memory usage at a negligible cost. Queue items typically take on the order of seconds, making the time cost of a GC essentially free.

Not a great idea on a hotter code path though.
@github-actions github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files services PRs that change app services labels Jun 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend PRs that change backend files invocations PRs that change invocations python PRs that change python files services PRs that change app services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant