Fix race condition on downlink attempt event registration #7681

vlasebian · 2025-07-25T14:28:12Z

Summary

References:

Changes

Remove the panic recover for marshalling event data that sends events to sentry.

Logging would be useful to still have in place but I realised it's not really doing what is supposed to. Sometimes SIGBUS or even SIGSEGV is triggered which cannot be caught by the recover mechanism in Go.

To get more insight into the problem, only logging every event name from GS will help (only events that marshal data). That might increase the logs significantly because the events that marshal data are for uplinks and downlinks too.

Clone the downlink message before registering an schedule downlink attempt event.

There is already a clone created:

//pkg/gatewayserver/grpc_nsgs.go:82
connDown := ttnpb.Clone(down)             // Let the connection own the DownlinkMessage.
connDown.GetRequest().DownlinkPaths = nil // And do not leak the downlink paths to the gateway.
connDown.CorrelationIds = events.CorrelationIDsFromContext(ctx)

However, this clone is published as an event using registerScheduleDownlinkAttempt and marshalled later in the subscriber of the events and at the same time it is modified in the conn.ScheduleDown method:

//pkg/gatewayserver/io/io.go:690
msg.Settings = &ttnpb.DownlinkMessage_Scheduled{
	Scheduled: settings,
}

Testing

I don't have a way to trigger this race condition.

Results

N/A.

Regressions

None.

Notes for Reviewers

This is caused by the race condition triggered by publishing events. The issue was initially discussed in here: 

https://github.com/TheThingsIndustries/lorawan-stack-support/issues/1163

There are more issues related to this one (some closed some still on going):

I believe the root cause of the problem is that the event system does not marshal the data immediately when the event is created or published. Instead, the data is stored as a reference in the event struct (event.data) and is later marshalled by the events subscriber. The subscriber runs in a different goroutine and it is not synced in any way with the publisher who might modify the referenced event.

I don't know if marshalling in the subscribers was a conscious decision or not. The only reason I can think of is pulling out the marshalling workload of the events from the hot path of processing messages.

The proper fix I believe would be to move the event marshalling into the publisher and send the already marshalled data to the subscriber. This change might take some work (I haven’t yet gone through the code to see what this implies) and will affect the whole codebase because the event system is shared by other components too.

The quick fix is to just clone all the events that marshal data, but might increase resource usage.

Checklist

Scope: The referenced issue is addressed, there are no unrelated changes.
Compatibility: The changes are backwards compatible with existing API, storage, configuration and CLI, according to the compatibility commitments in README.md for the chosen target branch.
Documentation: Relevant documentation is added or updated.
Testing: The steps/process to test this feature are clearly explained including testing for regressions.
Infrastructure: If infrastructural changes (e.g., new RPC, configuration) are needed, a separate issue is created in the infrastructural repositories.
Changelog: Significant features, behavior changes, deprecations and fixes are added to CHANGELOG.md.
Commits: Commit messages follow guidelines in CONTRIBUTING.md, there are no fixup commits left.

vlasebian added 2 commits July 25, 2025 15:21

all: Remove code used to catch panics triggered by data marshalling

4a2519a

gs: Clone downlink message before registering attempt event

d674025

vlasebian self-assigned this Jul 25, 2025

vlasebian requested review from a team as code owners July 25, 2025 14:28

vlasebian requested a review from johanstokking July 25, 2025 14:28

github-actions bot added the c/gateway server This is related to the Gateway Server label Jul 25, 2025

dev: Update messages

c7981fb

johanstokking approved these changes Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix race condition on downlink attempt event registration #7681

Fix race condition on downlink attempt event registration #7681

Uh oh!

vlasebian commented Jul 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix race condition on downlink attempt event registration #7681

Are you sure you want to change the base?

Fix race condition on downlink attempt event registration #7681

Uh oh!

Conversation

vlasebian commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Results

Regressions

Notes for Reviewers

Checklist

Uh oh!

Uh oh!

vlasebian commented Jul 25, 2025 •

edited

Loading