Skip to content

Conversation

@jakobmoellerdev
Copy link
Member

@jakobmoellerdev jakobmoellerdev commented Jul 23, 2025

Watch child objects and trigger instance reconciliation on change.

Alternative implementation proposal draft as discussed in KRO community call (23. Jul)

Compares to #576 by:

  • replacing watchset with dynamic registrations on the shared informers used by KRO
  • using the native factory (as of now) for handler registration tracking the shared informer factory does not allow for efficient dynamic shutdown and restart of informers
  • reusing our existing informer setup as much as possible
  • taking over the existing approach for tracking gvrs for the graphs resource list
  • adding a feature flag based on a label in the RGD ( we might want to enable by default, open for opinions, community call feedback was mixed on this )
  • testing that this flag actually works

based on reviews:

  • metadata informer based dynamic informers now instead of unstructured

* watchset package to implement watches
  * Abstracts changes away from kro controllers
  * Reusable package that can be used for other controllers
  * Can be upstreamed to k8s ecosystem at some point.
* watchset implementation
  * watchset sets up watches per GVK and uses a map[id]callbacks to
    demux event object to parent object.
  * When an event is detected (for a GVK), check callback is called to
    see if it matches a specific resource. If it matches, trigger
    callback is invoked to generate a parent (instance) event for the
    reconciler.
  * Uses a dynamic interface that intercepts reconciler calls on behalf
    of a parent and added the resources to a watchset.
  * watchset manager interface is used by dynamic interface to register
    handlers for an object.
  * objectWatchManager - sets up callback per object-parent pair.
  * labelWatchManager - sets up callback per parent based on labels.
* Changes in kro controller
  * Create watchset in RGD reconciler with a label selector
  * Use watchset dynamic client for managing resources in instance
    reconciler
  * During reconcile setup watchManager using object and a trigger
    callback to Enqueue the parent to dynamic controller.
* examples/kubernetes
  * Add a simple-deployment example for testing. The current webapp does
    not successfully reconcile in a kind cluster.
* Makefile changes to help with e2e test locally.
@jakobmoellerdev jakobmoellerdev force-pushed the replace-watchset-with-informers branch 2 times, most recently from a7512ae to e499afe Compare July 23, 2025 18:29
@jakobmoellerdev jakobmoellerdev force-pushed the replace-watchset-with-informers branch from e499afe to 94eafd1 Compare July 23, 2025 18:37
@jakobmoellerdev jakobmoellerdev force-pushed the replace-watchset-with-informers branch from 6b28826 to 41738d0 Compare July 24, 2025 16:10
@jakobmoellerdev jakobmoellerdev marked this pull request as ready for review July 24, 2025 16:18
@fabianburth
Copy link
Contributor

@a-hilaly, we found out why the tests were failing and fixed that up!

@barney-s barney-s requested review from barney-s and matthchr August 8, 2025 18:09
@a-hilaly
Copy link
Member

solves #323

@k8s-triage-robot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-contributor-experience at kubernetes/community.

/check-cla
/easycla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 12, 2025
Copy link
Member

@a-hilaly a-hilaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobmoellerdev Solid work - thanks a lot! I really like the informer based implementation - it feel to like the right architectural choice. +1 for the feature flag as annotation although i'm down to make it a spec field considering other things stated below.

I've dropped a few comments below - got some thoughts on the startup reconciliation behavior and wondering if we could potentially share informers between RGDs to reduce API server load. Curious to hear your thoughts on these, but overall this is heading in a great direction!

- Updated `Set` to include a metadata client and added corresponding methods.
- Switched `DynamicController` to use `metadata.Interface` over `dynamic.Interface`.
- Updated tests to utilize `metadata` fake clients.
- Removed unused dynamic client utilities and transformations.

Signed-off-by: Jakob Möller <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jakobmoellerdev
Once this PR has been reviewed and has the lgtm label, please assign bridgetkromhout for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…on policy

- Introduced `instancePolicy` in `ReconcileSpec` to specify periodic or reactive reconciliation for instances.
- Removed `instance-watch-resources` label and its associated logic.
- Updated controller logic to utilize `instancePolicy` for configuring resource watchers dynamically.
- Adjusted integration tests to validate behavior based on `instancePolicy`.
- Streamlined event handler registration in `DynamicController`.

Signed-off-by: Jakob Möller <[email protected]>
…th-informers

Signed-off-by: Jakob Möller <[email protected]>

# Conflicts:
#	test/integration/environment/setup.go
#	test/integration/suites/core/crd_test.go
#	test/integration/suites/core/setup_test.go
@jakobmoellerdev
Copy link
Member Author

/hold

I believe I spotted a bug on restart that is not properly handling watches. This might be due to the factory handler. have to investigate some more

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 2, 2025
@jakobmoellerdev jakobmoellerdev force-pushed the replace-watchset-with-informers branch from 7faa49b to cd1e97f Compare October 2, 2025 18:19
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025
Copy link
Member

@a-hilaly a-hilaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool stuff, thank you @jakobmoellerdev! left a few comments in line

@a-hilaly a-hilaly added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 8, 2025
…th-informers

Signed-off-by: Jakob Möller <[email protected]>

# Conflicts:
#	api/v1alpha1/resourcegraphdefinition_types.go
#	cmd/controller/main.go
#	config/crd/bases/kro.run_resourcegraphdefinitions.yaml
#	helm/crds/kro.run_resourcegraphdefinitions.yaml
#	pkg/controller/resourcegraphdefinition/controller_reconcile.go
#	pkg/dynamiccontroller/dynamic_controller.go
#	pkg/dynamiccontroller/dynamic_controller_test.go
…proved lifecycle management

- Added logger integration to LazyInformer for structured error reporting.
- Introduced context reset on LazyInformer to facilitate re-initialization after shutdown.
- Streamlined event handler lifecycle management in DynamicController.
- Replaced `NamespacedKey` with `NamespacedName` in `ObjectIdentifiers` for consistency.

Signed-off-by: Jakob Möller <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 9, 2025
…on clarity

- Set default `dynamic-controller-default-resync-period` to 0, effectively disabling periodic resyncs.
- Enhanced comments and struct documentation for better readability and maintainability.
- Updated integration test environment to reflect the disabled resync configuration.

Signed-off-by: Jakob Möller <[email protected]>
@jakobmoellerdev jakobmoellerdev force-pushed the replace-watchset-with-informers branch from 200f29c to cc988bb Compare October 9, 2025 13:20
- Introduced `QueueShutdownTimeout` to DynamicController configuration for clean shutdown handling.
- Adjusted shutdown logic to respect the configured timeout.
- Improved code clarity in shutdown coordination.

Signed-off-by: Jakob Möller <[email protected]>
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 16, 2025
…th-informers

# Conflicts:
#	pkg/client/set.go
Updated `TestStatus` kind and schema definitions to append "Reactive" or "Periodic" suffix based on the reactive behavior, improving test clarity and flexibility in ConfigMap watch behavior.

Signed-off-by: Jakob Möller <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 20, 2025
Comment on lines 47 to 50
// +kubebuilder:validation:Optional
// +kubebuilder:default:=Periodic
// +kubebuilder:validation:Enum=Periodic;Reactive
InstancePolicy ResourceGraphDefinitionInstancePolicy `json:"instancePolicy,omitempty"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't be exposed.

There are at least 2 cases which makes the current behaviour unpredictable. Either when the KRO pod is restarted or when the RGD is updated. Both cases trigger a reconciliation of each instance and hence override any changes on the children.

Making KRO always reactive helps bring predictability in the system as any manual change on the children would be overridden almost immediately, just as soon as the child is updated.
With this, I would expect the person doing the manual change to wonder why the manual changes were overridden rather than taking the manual change for done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk about this. I think this policy is not about the instance itself but resources it manages. Periodic behaves more like traditional gitops systems such as flux/argo with a resync interval, whereas reactive is based on dynamic watches purely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still i agree that default should probably be reactive.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Periodic behaves more like traditional gitops systems such as flux/argo with a resync interval, whereas reactive is based on dynamic watches purely.

I get this. but I their context is different. gitops handles external sources and hardly have a way to get notified when the source changes. That's the main reason why they have to go for resyncs.

I may be mistaking, but I don't think this is a problem we have with KRO. In the KRO case, all resources are k8s resources and should be watchable.

Copy link

@tjamet tjamet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see reconciliation based on child objects.
I think the code can get easier to reason about, specially removing many locks and hence risks of deadlocks, but the feature is worth giving it a try

- Refactor `Register` method to use variadic parameters for `resourceGVRsToWatch`.
- Add `meta.RESTMapper` to `DynamicController` for precise `GroupVersionKind` resolution.
- Update integration and unit tests to incorporate `RESTMapper` usage.
- Introduce comprehensive tests for lazy informer and dynamic controller behaviors.
@jakobmoellerdev
Copy link
Member Author

/unhold
PR should be fine now.

Probably only thing left is if we want the spec extension or not. Considering what I received as feedback and our choice on applysets, I will remove the spec field again.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants