Skip to content

Conversation

@dwsutherland
Copy link
Member

@dwsutherland dwsutherland commented Nov 18, 2025

closes #7078

From the example in the associated issue:
before (using queue size 5, sleep 5, n=1)
Image
Image
(Drops indicate new cycle point. This memory problem presents with huge number of tasks between pruning)

after (using queue size 10, sleep 5, n=1)
image
image

This fix essentially extends #6727 to some short-lived objects that receive a barrage of deltas over their life..
Workflows that balloon out to Gbs for this reason should now be back to the ~500Mb realm

This should also reduce respective UIS memory footprint (as it will replicate the delta application there).

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests functionally covered by any use of the data-store with jobs, no specific memory tests in place.
  • Changelog entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@dwsutherland dwsutherland added this to the 8.6.1 milestone Nov 18, 2025
@dwsutherland dwsutherland self-assigned this Nov 18, 2025
@dwsutherland dwsutherland added bug Something is wrong :( small labels Nov 18, 2025
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dwsutherland
Copy link
Member Author

Note: the coverage thing is just because I wanted to be more explicit about which types were being reset (so the other side of an if statement is not hit).. There's also a line not changed by this PR that's being complained about..

@oliver-sanders
Copy link
Member

oliver-sanders commented Nov 18, 2025

@dwsutherland, do we think that this issue would also affect the cylc-uiserver (which presumably has the same long lived message problem)?

If so, does this fix also cover the UIS side of things?

@oliver-sanders
Copy link
Member

This fix is essentially serialising / deserialising the entire data store periodically.

The main concern here is that this ends up being a high CPU hit.

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

@hjoliver
Copy link
Member

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

I just went with fixing a massive memory leak as the priority, but good that you checked that.

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 19, 2025

@dwsutherland, do we think that this issue would also affect the cylc-uiserver (which presumably has the same long lived message problem)?

If so, does this fix also cover the UIS side of things?

Yes, as mentioned:

This should also reduce respective UIS memory footprint (as it will replicate the delta application there).

@dwsutherland
Copy link
Member Author

dwsutherland commented Nov 19, 2025

This fix is essentially serialising / deserialising the entire data store periodically.

No, I moved away from serialising/deserialising (I initially did it that way because of paranoia, but changed to something less intensive in that same #6727 ticket)..
reset_protobuf_object just creates an new store object (so new memory allocation), then uses new_obj.CopyFrom(orig_obj) with the original delta updated object...
Which may still have a one-delta accumulation with each (new+copy), but avoids the accumulation over all deltas (org+delta1+delta2+...).

Also this isn't periodic, and it's for these selected types (i.e. workflow, T/F and T/F-proxies... jobs didn't appear to be too bad) and on every delta..
This means it's happening very frequently in small doses..
As opposed to batching the reset and doing the whole store periodically (or flagged-part, but either way adding flagging/processing overhead), which could result in a more noticeable jerk in performance (even if more efficient overall)..

From a quick profiling run with the following config, this change increased the CPU hit of the reset_protobuf_object method by ~3x up to a whopping 0.01s, so I think we're good here.

Good to see it's not a massive hit, I would trade 0.01s to avoid a 500% increase in memory usage (for some workflows)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something is wrong :( small

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants