Skip to content

Conversation

TobyTheHutt
Copy link
Contributor

@TobyTheHutt TobyTheHutt commented Sep 3, 2025

What does it do ?

Improves test coverage on the controller.

  • Before (parent e78b385): 65.0% on controller, 79.2% on project
  • After (commit a7bb468): 84.5% on controller, 79.9% on project

Motivation

The referenced ticket (#5150), aiming to improve code coverage.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 3, 2025
@k8s-ci-robot k8s-ci-robot added the controller Issues or PRs related to the controller label Sep 3, 2025
@k8s-ci-robot k8s-ci-robot requested a review from szuecs September 3, 2025 12:27
@k8s-ci-robot
Copy link
Contributor

Welcome @TobyTheHutt!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 3, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @TobyTheHutt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 3, 2025
@mloiseleur mloiseleur changed the title chore(controller): Test controller Execute function (kubernetes-sigs#5150) test(controller): improve coverage on Execute() Sep 3, 2025
@mloiseleur
Copy link
Collaborator

Thanks @TobyTheHutt 👍
/ok-to-test
Would you please complete your description with coverage before and after this PR ?

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2025
@ivankatliarchuk
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2025
@ivankatliarchuk
Copy link
Contributor

seems like the prow tests are catching some race conditions

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2025
@ivankatliarchuk
Copy link
Contributor

I'm not sure how exactly to reproduce exact environment as where the tests executes. The test pull-external-dns-unit-test runs on kubernetes infra outside of our controll. On local machine try to limit resources to 1 or 0.5 CPU it may help, and it runs on Linux OS there.

@TobyTheHutt
Copy link
Contributor Author

@ivankatliarchuk I'm working on it. I wrote a Docker setup to emulate the test pipeline and am able to reproduce the issue. Would the project be interested in the setup? If yes, I'll open an issue for that. It would additionally allow OS-independent testing and makefile runs.

…5150)

Prevent data races and premature process terminations by managing signal
capture and synchronization.

Signed-off-by: Tobias Harnickell <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from ivankatliarchuk. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ivankatliarchuk
Copy link
Contributor

Open a PR lets have a look

@TobyTheHutt
Copy link
Contributor Author

I'll clean up the files and create a PR for the Docker setup soon.
In the meantime, this PR is ready for review. I tested the project multiple times and did not encounter any more failures or flakes.

@ivankatliarchuk
Copy link
Contributor

I'm unsure about changes proposed. Let's say I have no much expertise in the area most likely and mainly as it’s not paired with thorough end-to-end(smoke) testing of shutdown paths.

This is my understanding, and why I have raised questions to myself about service stability

  1. Signal handling changes
  • The PR moves from a relatively simple signal.Notify → os.Exit model to a more complex context-based cancellation approach.
  • While that’s testable and idiomatic, it also means if the context isn’t wired correctly, parts of ExternalDNS (like metrics serving, informer loops, or controller run loop) might not stop or clean up consistently. That can leave goroutines hanging.
  1. Reliance on context for shutdown
  • Before, a SIGTERM resulted in a hard stop (not graceful, but very predictable).
  • Now, graceful shutdown depends on multiple components respecting context cancellation. If any one doesn’t, shutdown can stall or behave inconsistently. That’s less robust in production.
  1. Increased moving parts

-T ests improve, but production gets more code paths where cancellation or cleanup could be mishandled. More opportunities for subtle bugs.

Why It’s Still an Improvement (in theory and in from my view)

  • Graceful shutdown is generally preferable: avoids dropped in-flight changes, lets metrics flush, etc.
  • Context propagation is the standard in Go — so aligning with ecosystem best practices makes sense long-term.

So:

  • For tests: most likely a win (controllable shutdown, better coverage).
    • The init + global channel version works, but it’s less testable, less flexible, and harder to maintain
  • For production stability: riskier unless carefully validated.
    • The hard-exit model was crude but predictable.
    • The context model is correct but adds complexity that could fail in edge cases.

So I leave it for other reviewvers to review and make a decision.

…t-controller-execute

Signed-off-by: Tobias Harnickell <[email protected]>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 8, 2025
@TobyTheHutt
Copy link
Contributor Author

TobyTheHutt commented Sep 8, 2025

I gave the feedback some considerations.

  1. Concerns:

It's true that my changes to signal handling and shutdown introduce more components which participate in graceful shutdown. Therefore, more potential points of failure.

Also, Execute does not wait for background goroutines to finish. This might lead to a situation in which the metrics and event controller are still shutting down when main exits. This won't cause a lot of trouble, since main will just tear them down forcefully. But this has nothing to do with a "graceful" shutdown.

  1. Benefits:
  • Moving away from os.Exit improves testability
  • The issue for the graceful metrics and event controller shutdown can be fixed using a goroutine wait with a fixed timeout
  • The changes, as you already mentioned, do improve testability
    • They were introduced because the new tests uncovered a data race on DefaultServeMux under specific CPU throttling conditions
  1. Next steps:

Depending on whether we decide to go down this road, I will add more commits to "smoothen" out the edges I introduced with my change:

  • Introduce goroutine wait and timeout for graceful metrics and event controller shutdown
    • The shutdown timeout could even be made configurable
  • Replace global signal channel with NotifyContext
    • This should take it easy on consumers and tests that don't expect SIGTERM to be intercepted
    • It reduces code
  • Handle both SIGINT and SIGTERM
  • Any other improvements from other reviewers.

@ivankatliarchuk
Copy link
Contributor

Main risk

  • this needs explicit testing aka running on cluster before merg and share results

And concern that better to avoid

  • added init and shared state

@TobyTheHutt
Copy link
Contributor Author

I tested the code in a new VM with a Kind cluster.

Steps to reproduce:

Prepare the cluster:

kubectl create ns dns
helm repo add bitnami https://charts.bitnami.com/bitnami
helm -n dns install etcd bitnami/etcd --set auth.rbac.create=false

Create file coredns-values.yaml:

isClusterService: false
serviceType: ClusterIP
servers:
  - zones:
      - zone: example.test.
    port: 53
    plugins:
      - name: errors
      - name: log
      - name: etcd
        parameters: example.test
        configBlock: |
          stubzones
          path /skydns
          endpoint http://etcd.dns.svc.cluster.local:2379
      - name: forward
        parameters: . /etc/resolv.conf
      - name: cache
        parameters: 30
      - name: health
      - name: ready

Create coredns instance:
helm -n dns install coredns-ext coredns/coredns -f coredns-values.yaml

Create & deploy extdns.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services","endpoints","pods","nodes"]
  verbs: ["get","watch","list"]
- apiGroups: ["networking.k8s.io","discovery.k8s.io","extensions"]
  resources: ["ingresses","endpointslices"]
  verbs: ["get","watch","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: extdns:dev
        imagePullPolicy: Never
        args:
          - --provider=coredns
          - --source=ingress
          - --source=service
          - --domain-filter=example.test
          - --log-level=debug
        env:
        - name: ETCD_URLS
          value: http://etcd.dns.svc.cluster.local:2379

Create and deploy hello.yaml:

apiVersion: v1
kind: Service
metadata:
  name: hello
  annotations:
    external-dns.alpha.kubernetes.io/hostname: hello.example.test
spec:
  type: NodePort
  selector:
    app: hello
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello
spec:
  selector:
    matchLabels:
      app: hello
  template:
    metadata:
      labels:
        app: hello
    spec:
      containers:
        - name: app
          image: nginx:stable
          ports:
          - containerPort: 8080

Of course, this is not a production-like environment. However, so far it could handle:

  • adding/removing multiple services and ingresses
  • kubectl delete pod and kubectl delete pod --grace-period=0
  • Scaling the deployment replicas up and down rapidly
image

Hope this helps.

@mloiseleur mloiseleur changed the title test(controller): improve coverage on Execute() reafctor(controller): signal handling and shutdown Sep 17, 2025
@mloiseleur mloiseleur changed the title reafctor(controller): signal handling and shutdown refactor(controller): signal handling and shutdown Sep 17, 2025
@mloiseleur
Copy link
Collaborator

@TobyTheHutt I tried to give a better title to this PR, feel free to amend it if I missed something.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025
@TobyTheHutt
Copy link
Contributor Author

Going through my changes I just realised that syscall is not available on Microsoft OS. I will look for a better and OS-agnostic solution and add a new commit.

@mloiseleur
Copy link
Collaborator

/lgtm cancel

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2025
@TobyTheHutt
Copy link
Contributor Author

Result of my tests with the new commit (analogue to my last tests):
image

@mloiseleur
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 21, 2025
@ivankatliarchuk
Copy link
Contributor

Main concern here:

  • We are refactoring from one logic to somewhere similar logic.

In kubernetes runtime-controller the OS signals already wired up https://github.com/kubernetes-sigs/controller-runtime/blob/961fc2c233d64e40b5bdf449f12bfa22cf2d7b28/pkg/manager/signals/signal.go#L30

We are quite away from implementing proper kubernetes controller manager. And at the moment we lack strategy in this area

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. controller Issues or PRs related to the controller lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants