refactor(controller): signal handling and shutdown #5798

TobyTheHutt · 2025-09-03T12:27:50Z

What does it do ?

Improves test coverage on the controller.

Before (parent e78b385): 65.0% on controller, 79.2% on project
After (commit a7bb468): 84.5% on controller, 79.9% on project

Motivation

The referenced ticket (#5150), aiming to improve code coverage.

More

Yes, this PR title follows Conventional Commits
Yes, I added unit tests
Yes, I updated end user documentation accordingly

…5150) Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot · 2025-09-03T12:27:59Z

Welcome @TobyTheHutt!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-09-03T12:28:00Z

Hi @TobyTheHutt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mloiseleur · 2025-09-03T12:53:13Z

Thanks @TobyTheHutt 👍
/ok-to-test
Would you please complete your description with coverage before and after this PR ?

Signed-off-by: Tobias Harnickell <[email protected]>

ivankatliarchuk · 2025-09-05T07:28:45Z

/lgtm

ivankatliarchuk · 2025-09-05T07:47:10Z

seems like the prow tests are catching some race conditions

/lgtm cancel

ivankatliarchuk · 2025-09-05T07:49:37Z

I'm not sure how exactly to reproduce exact environment as where the tests executes. The test pull-external-dns-unit-test runs on kubernetes infra outside of our controll. On local machine try to limit resources to 1 or 0.5 CPU it may help, and it runs on Linux OS there.

TobyTheHutt · 2025-09-06T09:28:05Z

@ivankatliarchuk I'm working on it. I wrote a Docker setup to emulate the test pipeline and am able to reproduce the issue. Would the project be interested in the setup? If yes, I'll open an issue for that. It would additionally allow OS-independent testing and makefile runs.

…5150) Prevent data races and premature process terminations by managing signal capture and synchronization. Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot · 2025-09-06T10:57:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ivankatliarchuk · 2025-09-06T13:36:36Z

Open a PR lets have a look

TobyTheHutt · 2025-09-06T14:34:20Z

I'll clean up the files and create a PR for the Docker setup soon.
In the meantime, this PR is ready for review. I tested the project multiple times and did not encounter any more failures or flakes.

ivankatliarchuk · 2025-09-07T11:06:47Z

I'm unsure about changes proposed. Let's say I have no much expertise in the area most likely and mainly as it’s not paired with thorough end-to-end(smoke) testing of shutdown paths.

This is my understanding, and why I have raised questions to myself about service stability

Signal handling changes

The PR moves from a relatively simple signal.Notify → os.Exit model to a more complex context-based cancellation approach.
While that’s testable and idiomatic, it also means if the context isn’t wired correctly, parts of ExternalDNS (like metrics serving, informer loops, or controller run loop) might not stop or clean up consistently. That can leave goroutines hanging.

Reliance on context for shutdown

Before, a SIGTERM resulted in a hard stop (not graceful, but very predictable).
Now, graceful shutdown depends on multiple components respecting context cancellation. If any one doesn’t, shutdown can stall or behave inconsistently. That’s less robust in production.

Increased moving parts

-T ests improve, but production gets more code paths where cancellation or cleanup could be mishandled. More opportunities for subtle bugs.

Why It’s Still an Improvement (in theory and in from my view)

Graceful shutdown is generally preferable: avoids dropped in-flight changes, lets metrics flush, etc.
Context propagation is the standard in Go — so aligning with ecosystem best practices makes sense long-term.

So:

For tests: most likely a win (controllable shutdown, better coverage).
- The init + global channel version works, but it’s less testable, less flexible, and harder to maintain
For production stability: riskier unless carefully validated.
- The hard-exit model was crude but predictable.
- The context model is correct but adds complexity that could fail in edge cases.

So I leave it for other reviewvers to review and make a decision.

…t-controller-execute Signed-off-by: Tobias Harnickell <[email protected]>

TobyTheHutt · 2025-09-08T15:41:54Z

I gave the feedback some considerations.

Concerns:

It's true that my changes to signal handling and shutdown introduce more components which participate in graceful shutdown. Therefore, more potential points of failure.

Also, Execute does not wait for background goroutines to finish. This might lead to a situation in which the metrics and event controller are still shutting down when main exits. This won't cause a lot of trouble, since main will just tear them down forcefully. But this has nothing to do with a "graceful" shutdown.

Benefits:

Moving away from os.Exit improves testability
The issue for the graceful metrics and event controller shutdown can be fixed using a goroutine wait with a fixed timeout
The changes, as you already mentioned, do improve testability
- They were introduced because the new tests uncovered a data race on DefaultServeMux under specific CPU throttling conditions

Next steps:

Depending on whether we decide to go down this road, I will add more commits to "smoothen" out the edges I introduced with my change:

Introduce goroutine wait and timeout for graceful metrics and event controller shutdown
- The shutdown timeout could even be made configurable
Replace global signal channel with NotifyContext
- This should take it easy on consumers and tests that don't expect SIGTERM to be intercepted
- It reduces code
Handle both SIGINT and SIGTERM
Any other improvements from other reviewers.

ivankatliarchuk · 2025-09-09T11:21:41Z

Main risk

this needs explicit testing aka running on cluster before merg and share results

And concern that better to avoid

added init and shared state

TobyTheHutt · 2025-09-14T01:42:38Z

I tested the code in a new VM with a Kind cluster.

Steps to reproduce:

Prepare the cluster:

kubectl create ns dns
helm repo add bitnami https://charts.bitnami.com/bitnami
helm -n dns install etcd bitnami/etcd --set auth.rbac.create=false

Create file coredns-values.yaml:

isClusterService: false
serviceType: ClusterIP
servers:
  - zones:
      - zone: example.test.
    port: 53
    plugins:
      - name: errors
      - name: log
      - name: etcd
        parameters: example.test
        configBlock: |
          stubzones
          path /skydns
          endpoint http://etcd.dns.svc.cluster.local:2379
      - name: forward
        parameters: . /etc/resolv.conf
      - name: cache
        parameters: 30
      - name: health
      - name: ready

Create coredns instance:
helm -n dns install coredns-ext coredns/coredns -f coredns-values.yaml

Create & deploy extdns.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services","endpoints","pods","nodes"]
  verbs: ["get","watch","list"]
- apiGroups: ["networking.k8s.io","discovery.k8s.io","extensions"]
  resources: ["ingresses","endpointslices"]
  verbs: ["get","watch","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: extdns:dev
        imagePullPolicy: Never
        args:
          - --provider=coredns
          - --source=ingress
          - --source=service
          - --domain-filter=example.test
          - --log-level=debug
        env:
        - name: ETCD_URLS
          value: http://etcd.dns.svc.cluster.local:2379

Create and deploy hello.yaml:

apiVersion: v1
kind: Service
metadata:
  name: hello
  annotations:
    external-dns.alpha.kubernetes.io/hostname: hello.example.test
spec:
  type: NodePort
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
        - name: app
          image: nginx:stable
          ports:
          - containerPort: 8080

Of course, this is not a production-like environment. However, so far it could handle:

adding/removing multiple services and ingresses
kubectl delete pod and kubectl delete pod --grace-period=0
Scaling the deployment replicas up and down rapidly

Hope this helps.

mloiseleur · 2025-09-17T07:16:38Z

@TobyTheHutt I tried to give a better title to this PR, feel free to amend it if I missed something.

/lgtm

TobyTheHutt · 2025-09-17T10:07:28Z

Going through my changes I just realised that syscall is not available on Microsoft OS. I will look for a better and OS-agnostic solution and add a new commit.

mloiseleur · 2025-09-17T10:22:12Z

/lgtm cancel

Signed-off-by: Tobias Harnickell <[email protected]>

TobyTheHutt · 2025-09-17T11:52:30Z

Result of my tests with the new commit (analogue to my last tests):

mloiseleur · 2025-09-21T09:13:26Z

/lgtm

ivankatliarchuk · 2025-09-22T13:03:08Z

Main concern here:

We are refactoring from one logic to somewhere similar logic.

In kubernetes runtime-controller the OS signals already wired up https://github.com/kubernetes-sigs/controller-runtime/blob/961fc2c233d64e40b5bdf449f12bfa22cf2d7b28/pkg/manager/signals/signal.go#L30

We are quite away from implementing proper kubernetes controller manager. And at the moment we lack strategy in this area

chore(controller): Test controller Execute function (kubernetes-sigs#…

639edbc

…5150) Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 3, 2025

k8s-ci-robot requested a review from ivankatliarchuk September 3, 2025 12:27

k8s-ci-robot added the controller Issues or PRs related to the controller label Sep 3, 2025

k8s-ci-robot requested a review from szuecs September 3, 2025 12:27

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 3, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 3, 2025

mloiseleur changed the title ~~chore(controller): Test controller Execute function (kubernetes-sigs#5150)~~ test(controller): improve coverage on Execute() Sep 3, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2025

TobyTheHutt and others added 3 commits September 3, 2025 16:01

chore(controller): Prevent testing data race (kubernetes-sigs#5150)

6bf78da

Signed-off-by: Tobias Harnickell <[email protected]>

Merge branch 'kubernetes-sigs:master' into test-controller-execute

6d91cad

test(controller): Fix controller test lint issues

b6595ca

Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot assigned ivankatliarchuk Sep 5, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2025

test(controller): Fix controller signal termination (kubernetes-sigs#…

a7bb468

…5150) Prevent data races and premature process terminations by managing signal capture and synchronization. Signed-off-by: Tobias Harnickell <[email protected]>

TobyTheHutt mentioned this pull request Sep 7, 2025

test(controller): improve code coverage #5816

Merged

3 tasks

vflaux mentioned this pull request Sep 8, 2025

test(controller): race condition in handleSigterm() #5821

Draft

3 tasks

Merge branch 'master' of github.com:TobyTheHutt/external-dns into tes…

047d2b3

…t-controller-execute Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 8, 2025

mloiseleur changed the title ~~test(controller): improve coverage on Execute()~~ reafctor(controller): signal handling and shutdown Sep 17, 2025

mloiseleur changed the title ~~reafctor(controller): signal handling and shutdown~~ refactor(controller): signal handling and shutdown Sep 17, 2025

k8s-ci-robot assigned mloiseleur Sep 17, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025

fix(controller): platform-agnostic termination sig

e3c8818

Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 21, 2025

mloiseleur mentioned this pull request Oct 4, 2025

docs: add instructions for local testing #5887

Open

2 tasks

refactor(controller): signal handling and shutdown #5798

Are you sure you want to change the base?

refactor(controller): signal handling and shutdown #5798

Uh oh!

Conversation

TobyTheHutt commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does it do ?

Motivation

More

Uh oh!

k8s-ci-robot commented Sep 3, 2025

Uh oh!

k8s-ci-robot commented Sep 3, 2025

Uh oh!

mloiseleur commented Sep 3, 2025

Uh oh!

ivankatliarchuk commented Sep 5, 2025

Uh oh!

ivankatliarchuk commented Sep 5, 2025

Uh oh!

ivankatliarchuk commented Sep 5, 2025

Uh oh!

TobyTheHutt commented Sep 6, 2025

Uh oh!

k8s-ci-robot commented Sep 6, 2025

Uh oh!

ivankatliarchuk commented Sep 6, 2025

Uh oh!

TobyTheHutt commented Sep 6, 2025

Uh oh!

ivankatliarchuk commented Sep 7, 2025

Uh oh!

TobyTheHutt commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankatliarchuk commented Sep 9, 2025

Uh oh!

TobyTheHutt commented Sep 14, 2025

Uh oh!

mloiseleur commented Sep 17, 2025

Uh oh!

TobyTheHutt commented Sep 17, 2025

Uh oh!

mloiseleur commented Sep 17, 2025

Uh oh!

TobyTheHutt commented Sep 17, 2025

Uh oh!

mloiseleur commented Sep 21, 2025

Uh oh!

ivankatliarchuk commented Sep 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TobyTheHutt commented Sep 3, 2025 •

edited

Loading

TobyTheHutt commented Sep 8, 2025 •

edited

Loading