Skip to content

Conversation

wtoorop
Copy link
Member

@wtoorop wtoorop commented Oct 13, 2025

This fixes a bug where the reload process crashes because a node in a zone's NSEC3 hash tree is tried to be deleted, but the tree is not or no longer there. I.e. zone->hashtree == NULL.

I still don not know what the root cause of the zone->hashtree being NULL is, but the consequence of the crash is quite severe. Not only will the zone in question not be updated, but since the xfr that caused it is not marked faulty, the reload will just happen again and again with all the outstanding xfrs, causing the crash to just happen over and over again, and none of the other outstanding xfrs being applied (stalling those zones).

We may consider conveying the xfr number being processed by the reload process to the old-main process, so that the old-main can mark it as faulty if a crash happens during processing of that xfr, so that the other xfrs can still be processed (and will not be blocked). @wcawijngaards @mozzieongit WDYT?

@mozzieongit
Copy link
Member

This seems like a decent immediate solution. And finding the root cause later should be fine.

We may consider conveying the xfr number being processed by the reload process to the old-main process, so that the old-main can mark it as faulty if a crash happens during processing of that xfr, so that the other xfrs can still be processed (and will not be blocked).

That sounds like a good idea.

Copy link
Member

@wcawijngaards wcawijngaards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code change is fine. It is also good to have as a defense in depth, after other fixes are found for the root cause possibly, then this code is still useful to have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants