[3006.x] Fix msgpack leak in tcp.py #68399

OrangeDog · 2025-10-15T10:22:12Z

Backport #68394

Author: tcarroll25 saltstack#67931 (comment)

dwoz · 2025-10-15T11:02:04Z

We do not need #68394. These changes will get merged forward.

Sxderp · 2025-10-15T12:55:46Z

How does this prevent a leak? A method scoped variable is now being assigned to the object? Shouldn't the variable get cleaned up once it leaves scope? Even if it's doesn't, surely just assign "unpacker = None" inside the catch block would do the same? What am I missing?

OrangeDog · 2025-10-15T14:59:12Z

@tcarroll25 can you explain your change?

tcarroll25 · 2025-10-16T13:32:53Z

How does this prevent a leak? A method scoped variable is now being assigned to the object? Shouldn't the variable get cleaned up once it leaves scope? Even if it's doesn't, surely just assign "unpacker = None" inside the catch block would do the same? What am I missing?

To give a quick summary of the situation:

In Python, memory leaks related to exception handling occur when a persistent reference to an exception object is held, preventing the garbage collector from cleaning it up. The danger lies in the fact that an exception's traceback object contains a complete stack frame, which can hold references to all local variables within that scope, including very large data structures.

When an exception is raised, the Python interpreter bundles the exception object with a traceback object that contains the state of the call stack. The traceback keeps references to the stack frames, and each stack frame holds references to its local variables. A memory leak occurs if your code creates a cycle by holding a reference to the exception or traceback object in a way that prevents the garbage collector from deleting it.

In this case the local unpacker object was being preserved in the exception stack. I am 100% certain of this. I used tracemalloc to track the memory usage and found over 1GB of msgpack.Unpacker objects in memory after running for only an hour while continuously running saltutil.sync_all requests:

2025-10-03 17:58:51,756 [salt.master      :2717][INFO    ][877739] Top 10 lines
2025-10-03 17:58:51,757 [salt.master      :2722][INFO    ][877739] #1: utils/msgpack.py:84: 1016832.0 KiB
2025-10-03 17:58:51,758 [salt.master      :2726][INFO    ][877739]     msgpack.Unpacker.__init__(

Here is the end of the trace showing the exact location of the leak in tcp.py line 652:

2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]   File "/usr/lib/128technology/unzip/runfiles/x86_el9_pypi__39__salt_128tech_3007_8/salt/transport/tcp.py", line 652
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]     unpacker = salt.utils.msgpack.Unpacker()
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]   File "/usr/lib/128technology/unzip/runfiles/x86_el9_pypi__39__salt_128tech_3007_8/salt/utils/msgpack.py", line 84
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]     msgpack.Unpacker.__init__(

Here is the memory with the fix after running saltutil.sync_all requests for 16 hours:

2025-10-05 18:17:33,508 [salt.master      :2717][INFO    ][96930] Top 10 lines
2025-10-05 18:17:33,509 [salt.master      :2722][INFO    ][96930] #1: utils/msgpack.py:84: 5120.0 KiB
2025-10-05 18:17:33,510 [salt.master      :2726][INFO    ][96930]     msgpack.Unpacker.__init__(

I tried various fixes but the only one to work was making a member variable out of the unpacker object that I re-instantiate when an exception occurs to free up the memory. This prevents the previous local copies from being preserved on the exception stack because the exception stack contains a reference to the member variable which we just explicitly re-instantiated to free its previous memory.

I noticed that other places in the salt tcp.py code already use the same pattern of using the msgpack.Unpacker object as a member variable that gets re-instantiated when exceptions occur.

When triaging this bug I noticed that someone else from my organization had actually patched this memory leak years ago by setting unpacker = None in both exception blocks in the same spot. I apologize that this fix was never ported over to the saltstack repo. We noticed the leak again recently when upgrading to 3006.x and 3007.x because their original fix to set unpacker = None stopped working. Clearly there was another reference holding onto that unpacker object which prevented it from being garbage collected even when setting unpacker = None. I'm wondering if the refactor to asynchronous methods is leaving that stack trace around in a co-routine or something.

tcarroll25 · 2025-10-16T13:38:18Z

@tcarroll25 can you explain your change?

@OrangeDog thanks for opening this PR! I've been slammed with work the last few days and had not circled back to it yet.

Sxderp · 2025-10-16T14:38:25Z

@tcarroll25 interesting... I'm nonexpert in Python cleanup, but it sounds like the exception object itself is being held onto and therefore Python can't clean it up. The only place e is used is the log.trace. I'm curious, does the error go away if that line is commented out (maybe the logger is holding onto the reference)?

Overall.. Seems weird.

Edit: I just saw you also mentioned coroutines. Maybe, I've seen a number of dangling coroutines in Salt.

tcarroll25 · 2025-10-16T14:46:16Z

@tcarroll25 interesting... I'm nonexpert in Python cleanup, but it sounds like the exception object itself is being held onto and therefore Python can't clean it up. The only place e is used is the log.trace. I'm curious, does the error go away if that line is commented out (maybe the logger is holding onto the reference)?

Overall.. Seems weird.

Edit: I just saw you also mentioned coroutines. Maybe, I've seen a number of dangling coroutines in Salt.

I agree, it is a strange one. For one of my fix attempts I tried removing the log messages in each exception block because I thought the exc_info=True in log.trace("other master-side exception: %s", e, exc_info=True) was what was holding onto the memory but the leak still occurred even with both log.trace lines deleted.

bdrx312 · 2025-10-16T15:48:32Z

My assumption is leaking is related to the use @salt.ext.tornado.gen.coroutine annotation so something to possibly look out for any time that is used. @tcarroll25 based on your analysis it sounds like there might also be a leak with a reference to the whole exception getting held onto that could use further investigation and a fix in a future PR but that this solves the majority of the impact since the Unpacker buffer is so large?

dwoz · 2025-10-16T22:50:45Z

I assume this change was tested in a production environment where a leak went away, is this correct?

tcarroll25 · 2025-10-20T13:43:08Z

I assume this change was tested in a production environment where a leak went away, is this correct?

Yes that is correct. I can usually reproduce the leak in a few minutes and I ran it for over a day and saw no memory leak.

Fix msgpack leak in tcp.py

2275ba3

Author: tcarroll25 saltstack#67931 (comment)

OrangeDog requested a review from a team as a code owner October 15, 2025 10:22

OrangeDog temporarily deployed to ci October 15, 2025 10:22 — with GitHub Actions Inactive

!fixup 2275ba3

51068bc

OrangeDog temporarily deployed to ci October 15, 2025 10:29 — with GitHub Actions Inactive

OrangeDog temporarily deployed to ci October 15, 2025 10:52 — with GitHub Actions Inactive

tcarroll25 mentioned this pull request Oct 16, 2025

[BUG] New ReqServer memory leak in 3006.10 #67931

Open

dwoz mentioned this pull request Oct 16, 2025

Fix msgpack leak in tcp.py #68394

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[3006.x] Fix msgpack leak in tcp.py #68399

[3006.x] Fix msgpack leak in tcp.py #68399

OrangeDog commented Oct 15, 2025

Uh oh!

dwoz commented Oct 15, 2025

Uh oh!

Sxderp commented Oct 15, 2025

Uh oh!

OrangeDog commented Oct 15, 2025

Uh oh!

tcarroll25 commented Oct 16, 2025 •

edited

Loading

Uh oh!

tcarroll25 commented Oct 16, 2025

Uh oh!

Sxderp commented Oct 16, 2025 •

edited

Loading

Uh oh!

tcarroll25 commented Oct 16, 2025

Uh oh!

bdrx312 commented Oct 16, 2025

Uh oh!

dwoz commented Oct 16, 2025

Uh oh!

tcarroll25 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[3006.x] Fix msgpack leak in tcp.py #68399

Are you sure you want to change the base?

[3006.x] Fix msgpack leak in tcp.py #68399

Conversation

OrangeDog commented Oct 15, 2025

Uh oh!

dwoz commented Oct 15, 2025

Uh oh!

Sxderp commented Oct 15, 2025

Uh oh!

OrangeDog commented Oct 15, 2025

Uh oh!

tcarroll25 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcarroll25 commented Oct 16, 2025

Uh oh!

Sxderp commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tcarroll25 commented Oct 16, 2025

Uh oh!

bdrx312 commented Oct 16, 2025

Uh oh!

dwoz commented Oct 16, 2025

Uh oh!

tcarroll25 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tcarroll25 commented Oct 16, 2025 •

edited

Loading

Sxderp commented Oct 16, 2025 •

edited

Loading