Skip to content

Fix race condition on downlink attempt event registration #7681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: v3.35
Choose a base branch
from

Conversation

vlasebian
Copy link
Contributor

@vlasebian vlasebian commented Jul 25, 2025

Summary

References:

Changes

  • Remove the panic recover for marshalling event data that sends events to sentry.

Logging would be useful to still have in place but I realised it's not really doing what is supposed to. Sometimes SIGBUS or even SIGSEGV is triggered which cannot be caught by the recover mechanism in Go.

To get more insight into the problem, only logging every event name from GS will help (only events that marshal data). That might increase the logs significantly because the events that marshal data are for uplinks and downlinks too.

  • Clone the downlink message before registering an schedule downlink attempt event.

There is already a clone created:

//pkg/gatewayserver/grpc_nsgs.go:82
connDown := ttnpb.Clone(down)             // Let the connection own the DownlinkMessage.
connDown.GetRequest().DownlinkPaths = nil // And do not leak the downlink paths to the gateway.
connDown.CorrelationIds = events.CorrelationIDsFromContext(ctx)

However, this clone is published as an event using registerScheduleDownlinkAttempt and marshalled later in the subscriber of the events and at the same time it is modified in the conn.ScheduleDown method:

//pkg/gatewayserver/io/io.go:690
msg.Settings = &ttnpb.DownlinkMessage_Scheduled{
	Scheduled: settings,
}

Testing

I don't have a way to trigger this race condition.

Results

N/A.

Regressions

None.

Notes for Reviewers

This is caused by the race condition triggered by publishing events. The issue was initially discussed in here:


There are more issues related to this one (some closed some still on going):

I believe the root cause of the problem is that the event system does not marshal the data immediately when the event is created or published. Instead, the data is stored as a reference in the event struct (event.data) and is later marshalled by the events subscriber. The subscriber runs in a different goroutine and it is not synced in any way with the publisher who might modify the referenced event.

I don't know if marshalling in the subscribers was a conscious decision or not. The only reason I can think of is pulling out the marshalling workload of the events from the hot path of processing messages.

The proper fix I believe would be to move the event marshalling into the publisher and send the already marshalled data to the subscriber. This change might take some work (I haven’t yet gone through the code to see what this implies) and will affect the whole codebase because the event system is shared by other components too.

The quick fix is to just clone all the events that marshal data, but might increase resource usage.

Checklist

  • Scope: The referenced issue is addressed, there are no unrelated changes.
  • Compatibility: The changes are backwards compatible with existing API, storage, configuration and CLI, according to the compatibility commitments in README.md for the chosen target branch.
  • Documentation: Relevant documentation is added or updated.
  • Testing: The steps/process to test this feature are clearly explained including testing for regressions.
  • Infrastructure: If infrastructural changes (e.g., new RPC, configuration) are needed, a separate issue is created in the infrastructural repositories.
  • Changelog: Significant features, behavior changes, deprecations and fixes are added to CHANGELOG.md.
  • Commits: Commit messages follow guidelines in CONTRIBUTING.md, there are no fixup commits left.

@vlasebian vlasebian self-assigned this Jul 25, 2025
@vlasebian vlasebian requested review from a team as code owners July 25, 2025 14:28
@vlasebian vlasebian requested a review from johanstokking July 25, 2025 14:28
@github-actions github-actions bot added the c/gateway server This is related to the Gateway Server label Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/gateway server This is related to the Gateway Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants