mem fix - reset more updated protobuf objects #7089

dwsutherland · 2025-11-18T02:53:29Z

From the example in the associated issue:
before (using queue size 5, sleep 5, n=1)

(Drops indicate new cycle point. This memory problem presents with huge number of tasks between pruning)

after (using queue size 10, sleep 5, n=1)

This fix essentially extends #6727 to some short-lived objects that receive a barrage of deltas over their life..
Workflows that balloon out to Gbs for this reason should now be back to the ~500Mb realm

This should also reduce respective UIS memory footprint (as it will replicate the delta application there).

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests functionally covered by any use of the data-store with jobs, no specific memory tests in place.
Changelog entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

hjoliver

LGTM

dwsutherland · 2025-11-18T05:17:10Z

Note: the coverage thing is just because I wanted to be more explicit about which types were being reset (so the other side of an if statement is not hit).. There's also a line not changed by this PR that's being complained about..

oliver-sanders · 2025-11-18T14:11:46Z

@dwsutherland, do we think that this issue would also affect the cylc-uiserver (which presumably has the same long lived message problem)?

If so, does this fix also cover the UIS side of things?

oliver-sanders · 2025-11-18T15:29:46Z

This fix is essentially serialising / deserialising the entire data store periodically.

The main concern here is that this ends up being a high CPU hit.

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

hjoliver · 2025-11-19T03:02:31Z

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

I just went with fixing a massive memory leak as the priority, but good that you checked that.

dwsutherland · 2025-11-19T04:38:53Z

@dwsutherland, do we think that this issue would also affect the cylc-uiserver (which presumably has the same long lived message problem)?

If so, does this fix also cover the UIS side of things?

Yes, as mentioned:

This should also reduce respective UIS memory footprint (as it will replicate the delta application there).

dwsutherland · 2025-11-19T04:56:15Z

This fix is essentially serialising / deserialising the entire data store periodically.

No, I moved away from serialising/deserialising (I initially did it that way because of paranoia, but changed to something less intensive in that same #6727 ticket)..
reset_protobuf_object just creates an new store object (so new memory allocation), then uses new_obj.CopyFrom(orig_obj) with the original delta updated object...
Which may still have a one-delta accumulation with each (new+copy), but avoids the accumulation over all deltas (org+delta1+delta2+...).

Also this isn't periodic, and it's for these selected types (i.e. workflow, T/F and T/F-proxies... jobs didn't appear to be too bad) and on every delta..
This means it's happening very frequently in small doses..
As opposed to batching the reset and doing the whole store periodically (or flagged-part, but either way adding flagging/processing overhead), which could result in a more noticeable jerk in performance (even if more efficient overall)..

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

Good to see it's not a massive hit, I would trade 0.01s to avoid a 500% increase in memory usage (for some workflows)

dwsutherland added this to the 8.6.1 milestone Nov 18, 2025

dwsutherland requested a review from oliver-sanders November 18, 2025 02:53

dwsutherland self-assigned this Nov 18, 2025

dwsutherland added bug Something is wrong :( small labels Nov 18, 2025

dwsutherland added 2 commits November 18, 2025 16:23

reset more merged protobuf objects

a652888

changelog entry

48f60e7

dwsutherland force-pushed the large-slow-cycle-mem-fix-i7078 branch from 3fd65c3 to 48f60e7 Compare November 18, 2025 03:23

dwsutherland mentioned this pull request Nov 18, 2025

example workflow with high memory use #7078

Open

test fixes

36ea453

hjoliver approved these changes Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mem fix - reset more updated protobuf objects #7089

mem fix - reset more updated protobuf objects #7089

Uh oh!

dwsutherland commented Nov 18, 2025 •

edited

Loading

Uh oh!

hjoliver left a comment

Uh oh!

dwsutherland commented Nov 18, 2025

Uh oh!

oliver-sanders commented Nov 18, 2025 •

edited

Loading

Uh oh!

oliver-sanders commented Nov 18, 2025

Uh oh!

hjoliver commented Nov 19, 2025

Uh oh!

dwsutherland commented Nov 19, 2025 •

edited

Loading

Uh oh!

dwsutherland commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mem fix - reset more updated protobuf objects #7089

Are you sure you want to change the base?

mem fix - reset more updated protobuf objects #7089

Uh oh!

Conversation

dwsutherland commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hjoliver left a comment

Choose a reason for hiding this comment

Uh oh!

dwsutherland commented Nov 18, 2025

Uh oh!

oliver-sanders commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliver-sanders commented Nov 18, 2025

Uh oh!

hjoliver commented Nov 19, 2025

Uh oh!

dwsutherland commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dwsutherland commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dwsutherland commented Nov 18, 2025 •

edited

Loading

oliver-sanders commented Nov 18, 2025 •

edited

Loading

dwsutherland commented Nov 19, 2025 •

edited

Loading

dwsutherland commented Nov 19, 2025 •

edited

Loading