Skip to content

Conversation

@gabsuren
Copy link
Collaborator

@gabsuren gabsuren commented Oct 28, 2025

TODO - Will remove comments after the review ( left it for easier review)

Description

This PR fixes critical memory leaks and crashes in the ESP WebSocket client that occur during reconnection scenarios(CONFIG_ESP_WS_CLIENT_SEPARATE_TX_LOCK = y).

  • Double-free crashes: Heap corruption during abort/reconnect scenarios
  • Data loss: First packet after reconnection not received
  • Error buffer accumulation: 2KB memory leak on disconnect

Changes Made:

  • Add state check in abort_connection to prevent double-close
  • Fix memory leak: free errormsg_buffer on disconnect
  • Reset connection state on reconnect to prevent stale data
  • Implement lock ordering for separate TX lock mode
  • Added sdkconfig.ci.tx_lock conf

Related

#898

Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • [ ✓ ] All CI checks (GH Actions) pass.
  • [ ✓] Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • [ ✓] Code is well-commented, especially in complex areas.
  • [ ✓] Git history is clean — commits are squashed to the minimum necessary.

Note

Fixes ws-client races and memory leak; corrects lock ordering for separate TX lock; initializes/reset state on connect and handles initial recv; adds CI config.

  • Core fixes
    • esp_websocket_client_abort_connection(...): add safe-state checks (skip if closing/closed), dispatch disconnect, and free errormsg_buffer.
    • Initialize frame state (payload_len/offset, last_fin, last_opcode) on connect; process initial data via esp_websocket_client_recv(...) and abort on failure.
    • Destroy path: nullify client->transport_list and client->transport after destroy.
    • Send path: on write error, abort connection with correct lock handling.
  • Locking (CONFIG_ESP_WS_CLIENT_SEPARATE_TX_LOCK)
    • Define WEBSOCKET_TX_LOCK_TIMEOUT_MS; enforce lock ordering: release client->lock before taking tx_lock, then re-acquire and state-check.
    • Apply to PING/PONG/CLOSE sends to avoid deadlocks; handle timeout gracefully; ensure buffers freed on early return.
    • Poll/read path: ensure lock is held exactly once around recv() and abort logic.
  • Examples/Config
    • Add examples/target/sdkconfig.ci.tx_lock enabling separate TX lock with timeout.

Written by Cursor Bugbot for commit f92da56. This will update automatically on new commits. Configure here.

@CLAassistant
Copy link

CLAassistant commented Oct 28, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

cursor[bot]

This comment was marked as outdated.

@gabsuren gabsuren changed the title Fix/ws race on abort fix(websocket): Fix websocket client race on abort and memory leak(IDFGH-16555) Oct 28, 2025
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch 3 times, most recently from 67bd7e3 to 46871bf Compare October 28, 2025 13:09
#else
// When separate TX lock is not configured, we already hold client->lock
// which protects the transport, so we can send PONG directly
esp_transport_ws_send_raw(client->transport, WS_TRANSPORT_OPCODES_PONG | WS_TRANSPORT_OPCODES_FIN, data, client->payload_len,

Check warning

Code scanning / clang-tidy

The value '138' provided to the cast expression is not in the valid range of values for 'ws_transport_opcodes' [clang-analyzer-optin.core.EnumCastOutOfRange] Warning

The value '138' provided to the cast expression is not in the valid range of values for 'ws_transport_opcodes' [clang-analyzer-optin.core.EnumCastOutOfRange]
@gabsuren gabsuren requested a review from david-cermak October 29, 2025 09:09
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from 46871bf to 5577e03 Compare October 29, 2025 10:54
cursor[bot]

This comment was marked as outdated.

@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch 3 times, most recently from ca2956e to 0e58789 Compare October 30, 2025 10:53
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from 0e58789 to 62925a5 Compare November 10, 2025 10:13
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch 2 times, most recently from 15dcb35 to f474654 Compare November 10, 2025 10:27
@gabsuren
Copy link
Collaborator Author

#898 (comment)

@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from 50e3068 to 22eb17e Compare November 19, 2025 11:44
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch 2 times, most recently from 52abfc0 to 30778c0 Compare November 25, 2025 11:02
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from 30778c0 to d202ae4 Compare November 25, 2025 11:12
Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

but would like to double-check the locking order, doesn't feel right to lock one while holding another.

@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from d202ae4 to 0f28a4f Compare December 5, 2025 11:27
@gabsuren gabsuren requested a review from david-cermak December 5, 2025 11:28
esp_event_loop_run(client->event_handle, 0);
if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
ESP_LOGE(TAG, "Failed to re-acquire lock after event loop");
break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Lock released without being held after failed reacquisition

When xSemaphoreTakeRecursive fails at line 1211 (after releasing the lock at line 1209), the break statement only exits the switch, not the while loop. Execution continues to line 1311 where xSemaphoreGiveRecursive(client->lock) is called on a mutex that isn't held by the task. In FreeRTOS, calling Give on a mutex not owned by the calling task is undefined behavior and can corrupt the mutex state or cause assertion failures.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@gabsuren
Copy link
Collaborator Author

gabsuren commented Dec 5, 2025

@david-cermak @euripedesrocha

Btw I run autobahn test suit locally and sow that framing section improved to to 100%.
And passed tests increased from ( 28.2%) to ( 55.2%)
But sow some regression in reserve bit section and one more failure in ping/pong section. Looking into it

Before the fix

Screenshot 2025-12-05 at 15 58 33

After the fix

Screenshot 2025-12-05 at 15 57 41

- Add state check in abort_connection to prevent double-close
- Fix memory leak: free errormsg_buffer on disconnect
- Reset connection state on reconnect to prevent stale data
- Implement lock ordering for separate TX lock mode
- Read buffered data immediately after connection to prevent data loss
- Added sdkconfig.ci.tx_lock config
@gabsuren gabsuren force-pushed the fix/ws_race_on_abort branch from 0f28a4f to f92da56 Compare December 5, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants