Skip to content

Conversation

@OrangeDog
Copy link
Contributor

Backport #68394

@dwoz
Copy link
Contributor

dwoz commented Oct 15, 2025

We do not need #68394. These changes will get merged forward.

@Sxderp
Copy link
Contributor

Sxderp commented Oct 15, 2025

How does this prevent a leak? A method scoped variable is now being assigned to the object? Shouldn't the variable get cleaned up once it leaves scope? Even if it's doesn't, surely just assign "unpacker = None" inside the catch block would do the same? What am I missing?

@OrangeDog
Copy link
Contributor Author

@tcarroll25 can you explain your change?

@tcarroll25
Copy link

tcarroll25 commented Oct 16, 2025

How does this prevent a leak? A method scoped variable is now being assigned to the object? Shouldn't the variable get cleaned up once it leaves scope? Even if it's doesn't, surely just assign "unpacker = None" inside the catch block would do the same? What am I missing?

To give a quick summary of the situation:

In Python, memory leaks related to exception handling occur when a persistent reference to an exception object is held, preventing the garbage collector from cleaning it up. The danger lies in the fact that an exception's traceback object contains a complete stack frame, which can hold references to all local variables within that scope, including very large data structures.

When an exception is raised, the Python interpreter bundles the exception object with a traceback object that contains the state of the call stack. The traceback keeps references to the stack frames, and each stack frame holds references to its local variables. A memory leak occurs if your code creates a cycle by holding a reference to the exception or traceback object in a way that prevents the garbage collector from deleting it.

In this case the local unpacker object was being preserved in the exception stack. I am 100% certain of this. I used tracemalloc to track the memory usage and found over 1GB of msgpack.Unpacker objects in memory after running for only an hour while continuously running saltutil.sync_all requests:

2025-10-03 17:58:51,756 [salt.master      :2717][INFO    ][877739] Top 10 lines
2025-10-03 17:58:51,757 [salt.master      :2722][INFO    ][877739] #1: utils/msgpack.py:84: 1016832.0 KiB
2025-10-03 17:58:51,758 [salt.master      :2726][INFO    ][877739]     msgpack.Unpacker.__init__(

Here is the end of the trace showing the exact location of the leak in tcp.py line 652:

2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]   File "/usr/lib/128technology/unzip/runfiles/x86_el9_pypi__39__salt_128tech_3007_8/salt/transport/tcp.py", line 652
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]     unpacker = salt.utils.msgpack.Unpacker()
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]   File "/usr/lib/128technology/unzip/runfiles/x86_el9_pypi__39__salt_128tech_3007_8/salt/utils/msgpack.py", line 84
2025-10-04 20:43:18,686 [salt.master      :2743][INFO    ][2181310]     msgpack.Unpacker.__init__(

Here is the memory with the fix after running saltutil.sync_all requests for 16 hours:

2025-10-05 18:17:33,508 [salt.master      :2717][INFO    ][96930] Top 10 lines
2025-10-05 18:17:33,509 [salt.master      :2722][INFO    ][96930] #1: utils/msgpack.py:84: 5120.0 KiB
2025-10-05 18:17:33,510 [salt.master      :2726][INFO    ][96930]     msgpack.Unpacker.__init__(

I tried various fixes but the only one to work was making a member variable out of the unpacker object that I re-instantiate when an exception occurs to free up the memory. This prevents the previous local copies from being preserved on the exception stack because the exception stack contains a reference to the member variable which we just explicitly re-instantiated to free its previous memory.

I noticed that other places in the salt tcp.py code already use the same pattern of using the msgpack.Unpacker object as a member variable that gets re-instantiated when exceptions occur.

When triaging this bug I noticed that someone else from my organization had actually patched this memory leak years ago by setting unpacker = None in both exception blocks in the same spot. I apologize that this fix was never ported over to the saltstack repo. We noticed the leak again recently when upgrading to 3006.x and 3007.x because their original fix to set unpacker = None stopped working. Clearly there was another reference holding onto that unpacker object which prevented it from being garbage collected even when setting unpacker = None. I'm wondering if the refactor to asynchronous methods is leaving that stack trace around in a co-routine or something.

@tcarroll25
Copy link

@tcarroll25 can you explain your change?

@OrangeDog thanks for opening this PR! I've been slammed with work the last few days and had not circled back to it yet.

@Sxderp
Copy link
Contributor

Sxderp commented Oct 16, 2025

@tcarroll25 interesting... I'm nonexpert in Python cleanup, but it sounds like the exception object itself is being held onto and therefore Python can't clean it up. The only place e is used is the log.trace. I'm curious, does the error go away if that line is commented out (maybe the logger is holding onto the reference)?

Overall.. Seems weird.

Edit: I just saw you also mentioned coroutines. Maybe, I've seen a number of dangling coroutines in Salt.

@tcarroll25
Copy link

@tcarroll25 interesting... I'm nonexpert in Python cleanup, but it sounds like the exception object itself is being held onto and therefore Python can't clean it up. The only place e is used is the log.trace. I'm curious, does the error go away if that line is commented out (maybe the logger is holding onto the reference)?

Overall.. Seems weird.

Edit: I just saw you also mentioned coroutines. Maybe, I've seen a number of dangling coroutines in Salt.

I agree, it is a strange one. For one of my fix attempts I tried removing the log messages in each exception block because I thought the exc_info=True in log.trace("other master-side exception: %s", e, exc_info=True) was what was holding onto the memory but the leak still occurred even with both log.trace lines deleted.

@bdrx312
Copy link
Contributor

bdrx312 commented Oct 16, 2025

My assumption is leaking is related to the use @salt.ext.tornado.gen.coroutine annotation so something to possibly look out for any time that is used. @tcarroll25 based on your analysis it sounds like there might also be a leak with a reference to the whole exception getting held onto that could use further investigation and a fix in a future PR but that this solves the majority of the impact since the Unpacker buffer is so large?

@dwoz
Copy link
Contributor

dwoz commented Oct 16, 2025

I assume this change was tested in a production environment where a leak went away, is this correct?

@dwoz dwoz mentioned this pull request Oct 16, 2025
3 tasks
@tcarroll25
Copy link

I assume this change was tested in a production environment where a leak went away, is this correct?

Yes that is correct. I can usually reproduce the leak in a few minutes and I ran it for over a day and saw no memory leak.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants