Skip to content

Conversation

u-kai
Copy link
Contributor

@u-kai u-kai commented Sep 3, 2025

What does it do ?

Add warning logs when AWS Route53 provider-specific routing policy properties (weight, region, failover, geolocation, geoproximity, multi-value) are specified without the required setIdentifier.

This helps users identify misconfigurations where their routing policies are silently ignored by Route53.

Motivation

Add warning logs when AWS Route53 provider-specific routing policy properties (weight, region, failover, geolocation, geoproximity, multi-value) are specified without the required setIdentifier.
This helps users identify misconfigurations where their routing policies are silently ignored by Route53.

Fixes #5775

When users configure AWS Route53 routing policies using ExternalDNS annotations like external-dns.alpha.kubernetes.io/aws-weight: "200" but forget to include external-dns.alpha.kubernetes.io/set-identifier, Route53 silently ignores the routing policy and creates a standard DNS record instead.

This leads to confusion as users expect weighted/failover routing to be active but see no effect.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 3, 2025
@k8s-ci-robot k8s-ci-robot requested a review from szuecs September 3, 2025 12:37
@k8s-ci-robot k8s-ci-robot added provider Issues or PRs related to a provider needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 3, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @u-kai. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 3, 2025
Copy link
Collaborator

@mloiseleur mloiseleur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@u-kai Thanks for this PR to improve userXP.

The use case looks good to me 👍 .

For the implementation, it seems quite inefficient to re-browse all those fields.
Wdyt of adding this debug log inside this loop instead of creating a dedicated loop ? Would it work or did I miss something ?

@u-kai
Copy link
Contributor Author

u-kai commented Sep 3, 2025

@mloiseleur

Thank you for the suggestion!
I checked the loop you mentioned, but it's in the records method for reading existing records from Route53,
while my implementation is in the newChange method for creating/updating records.
These serve different purposes.

However, I agree about efficiency.
I can move providerSpecificRequiringSetIdentifier to a variable to avoid recreating it on every call.
This maintains the current approach (single consolidated warning) while improving performance.

@mloiseleur
Copy link
Collaborator

mloiseleur commented Sep 4, 2025

I can move providerSpecificRequiringSetIdentifier to a variable to avoid recreating it on every call.

That would be a good first step.

I also noticed that GetProviderSpecificProperty is iterating over all specific properties, so it means double loop with O(n*m) complexity.

=> Wdyt of adding an intersect method ?

Using an Hash for providerSpecificRequiringSetIdentifier you may reach O(n*x) complexity, with n between 1 and 2 (source)

@u-kai
Copy link
Contributor Author

u-kai commented Sep 4, 2025

@mloiseleur

Thanks — does this implementation look correct?

var providerSpecificRequiringSetIdentifier = []string{
	providerSpecificWeight,
	providerSpecificRegion,
	providerSpecificFailover,
	providerSpecificGeolocationContinentCode,
	providerSpecificGeolocationCountryCode,
	providerSpecificGeolocationSubdivisionCode,
	providerSpecificGeoProximityLocationAWSRegion,
	providerSpecificGeoProximityLocationBias,
	providerSpecificGeoProximityLocationCoordinates,
	providerSpecificGeoProximityLocationLocalZoneGroup,
	providerSpecificMultiValueAnswer,
}

....
	if setIdentifier == "" {
		ignoredProperties := make([]string, 0, len(providerSpecificRequiringSetIdentifier))
		tmpMap := make(map[string]struct{}, len(ep.ProviderSpecific))
		for _, ps := range ep.ProviderSpecific {
			tmpMap[ps.Name] = struct{}{}
		}
		for _, prop := range providerSpecificRequiringSetIdentifier {
			if _, ok := tmpMap[prop]; ok {
				ignoredProperties = append(ignoredProperties, prop)
			}
		}
		if len(ignoredProperties) > 0 {
			log.Warnf("Endpoint %s has provider-specific properties %v that require a setIdentifier, but none was set; ignoring these properties",
				ep.DNSName, ignoredProperties)
		}
	}

It should indeed be faster. However, I have a couple of concerns:

We’re accessing the ProviderSpecific field directly and building a temporary map, rather than calling GetProviderSpecificProperty.
That makes the implementation slightly less idiomatic / a bit harder to follow at first glance.

Given the numbers involved — providerSpecificRequiringSetIdentifier is 11 items and ProviderSpecific is about 15 items for AWS — the absolute work is small, so the practical performance gain is modest even in the worst case.

So this is a readability vs. optimization trade-off.
I’d appreciate your opinion: prefer this small optimization now, or keep the simpler/clearer approach (e.g. keep using GetProviderSpecificProperty or extract a small helper) for better readability?

@vflaux
Copy link
Contributor

vflaux commented Sep 4, 2025

@u-kai you can check k8s.io/utils/set package. There is an Intersection() method.

@u-kai
Copy link
Contributor Author

u-kai commented Sep 4, 2025

@vflaux
Thanks — I didn’t know about that. How about something like this?

if setIdentifier == "" {
	providerSpecificSet := make(set.Set[string], len(ep.ProviderSpecific))
	for _, s := range ep.ProviderSpecific {
		providerSpecificSet.Insert(s.Name)
	}
	ignoredProperties := providerSpecificRequiringSetIdentifier.Intersection(providerSpecificSet)
	if len(ignoredProperties) > 0 {
		pMsg := ignoredProperties.SortedList()
		log.Warnf("Endpoint %s has provider-specific properties %v that require a setIdentifier, but none was set; ignoring these properties",
			ep.DNSName, pMsg)
	}
}

@vflaux
Copy link
Contributor

vflaux commented Sep 5, 2025

No need to range over ep.ProviderSpecific, there is a constructor:

providerSpecificSet := set.New(ep.ProviderSpecific...)

@u-kai
Copy link
Contributor Author

u-kai commented Sep 5, 2025

@vflaux

Thanks! Just to clarify: set.New expects ...string (since it’s defined as func New[E ordered](items ...E) Set[E]). Meanwhile, ep.ProviderSpecific is a slice of structs ([]ProviderSpecificProperty), each with fields like Name and Value. So we can’t pass ep.ProviderSpecific directly to set.New[string](...).

@vflaux
Copy link
Contributor

vflaux commented Sep 5, 2025

@u-kai You're right, my mistake.

@u-kai u-kai requested a review from mloiseleur September 6, 2025 05:00
Copy link
Contributor

@ivankatliarchuk ivankatliarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I actually see here. This change reveals a design solution where at the moment

// ProviderSpecific holds configuration which is specific to individual DNS providers
type ProviderSpecific []ProviderSpecificProperty

func (e *Endpoint) GetProviderSpecificProperty(key string) (string, bool) {
	for _, providerSpecific := range e.ProviderSpecific {
		if providerSpecific.Name == key {
			return providerSpecific.Value, true
		}
	}
	return "", false
}

func (e *Endpoint) SetProviderSpecificProperty(key string, value string) {
	for i, providerSpecific := range e.ProviderSpecific {
		if providerSpecific.Name == key {
			e.ProviderSpecific[i] = ProviderSpecificProperty{
				Name:  key,
				Value: value,
			}
			return
		}
	}

	e.ProviderSpecific = append(e.ProviderSpecific, ProviderSpecificProperty{Name: key, Value: value})
}

Throughout the codebase, whenever external-dns needs to retrieve a provider-specific property, it performs an iteration. In large environments, this could become a performance bottleneck.

Before adding a warning logg (not something super critical), we should most likely first consider the possible data structures that could better fit ProviderSpecific use case.


setIdentifier := ep.SetIdentifier

// Check if provider-specific values requiring setIdentifier are present but setIdentifier is empty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably there are other design solutions. But could we have it at least a private method, as I see no point to increase complexity as it currently does?

@u-kai
Copy link
Contributor Author

u-kai commented Sep 6, 2025

@ivankatliarchuk

You're right, the current slice-based design might not be the most efficient, and switching to a map could be a better fit here.
I'll give it a try to see if I can refactor it that way.
If it works out, I'll open a separate PR for it.

@u-kai
Copy link
Contributor Author

u-kai commented Sep 6, 2025

@ivankatliarchuk

I looked into it, and changing ProviderSpecific directly to a map would be a breaking change since it’s part of the CRD schema.

To avoid that, my plan is:

Keep the CRD interface as-is ([]ProviderSpecificProperty) for backward compatibility.

Define a separate internal Endpoint type, where ProviderSpecific is represented as map[string][]string.

Add conversion helpers between the CRD type and the internal type.

This way we can improve performance and readability without introducing a breaking change to users.

Does that sound reasonable to you?

@ivankatliarchuk
Copy link
Contributor

Makes sense on paper)))

@ivankatliarchuk
Copy link
Contributor

If the other change (the O(1) logic) isn't approved, it doesn't make sense to add a warning message, as that would only increase complexity. So most likely this is going to be on hold

@u-kai
Copy link
Contributor Author

u-kai commented Oct 2, 2025

Could you clarify a bit more on your first point? 🙇

If the other change (the O(1) logic) isn't approved, it doesn't make sense to add a warning message, as that would only increase complexity.

I’d like to better understand why the warning message would lose its meaning or increase complexity if the O(1) logic isn’t approved.

From my perspective, based on some quick benchmarks I ran, for our typical workload in external-dns (3–5 ProviderSpecific keys, ~30 membership checks in the AWS provider paths), a simple for loop over the slice is actually faster and simpler than building a set/map.

So my current plan is to adjust this PR to keep the underlying representation as a slice and switch back to a straightforward loop, while still adding the warning log as originally intended.

I’d appreciate a quick sanity check on this assumption. If this looks reasonable, I’d like to proceed in that direction.

@ivankatliarchuk
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 5, 2025
@mloiseleur
Copy link
Collaborator

mloiseleur commented Oct 5, 2025

@u-kai I'll try to explain.

  1. AWS users are only a part of all external dns users.
  2. Users adding specific property weight are only a part of all external AWS users

So let's try to take some high level point of view: Is adding this kind of check for all resources loaded by ExternalDNS is an improvement on UserXP or overengineering ? (ie: adding more complexity to the code base)

Normally, when a required parameter is not there, the API called should fail. In this specific case, The call to AWS API should fail, return an error and so the log would be displayed to the user. Like for all the other error cases with this provider.

We can improve our documentation but like @ivankatliarchuk, I'm not sure we should complexify our code base because of this specific edge case. The AWS APIs are evolving and, maybe this case will return an error in a few months.

But, maybe I missed something. Feel free to share your thoughts.

@u-kai
Copy link
Contributor Author

u-kai commented Oct 5, 2025

@mloiseleur

Thanks for the detailed feedback.

Is adding this kind of check for all resources loaded by ExternalDNS an improvement on UserXP or overengineering?

I understand your concern, but just to clarify — this change only applies to the AWS provider, and only when setIdentifier is empty.

You're right that it adds some complexity, but as the issue reporter mentioned, users currently get no warning at all when this happens, which can be confusing. I think this small check improves UX in such cases.
In terms of code complexity, it doesn’t really differ between slice and map implementations — both access the same exposed interface of ProviderSpecific.
I’ve just pushed the latest version, so please take a look and see if the complexity feels acceptable. 🙇
This version no longer uses a set Intersection; based on the earlier benchmarks I shared, iterating over a slice is actually faster for the expected number of ProviderSpecific keys.

Normally, when a required parameter is not there, the API called should fail

Yes, the AWS API already returns an error like below.

 Error: An error occurred (InvalidInput) when calling the ChangeResourceRecordSets
     operation: Invalid request: Missing field 'SetIdentifier' in Change with [Action=CREATE,
     Name=test-failover.example.com, Type=A, SetIdentifier=null]

However, ExternalDNS currently only sets the providerSpecific property when setIdentifier is non-empty, which means the AWS API never gets a chance to validate the invalid case.
If we changed it to always set the property, AWS would indeed return an error — but that would break existing systems that currently succeed under this behavior.

As an alternative approach, in a future version, we could consider setting the property regardless of setIdentifier and let the AWS API error out.
That would make the behavior more explicit, though it might break setups that are currently “working” by accident.

For now, I believe adding a warn-level log strikes a good balance between UX improvement and safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. provider Issues or PRs related to a provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ingressClassName alb does not set aws-weight
5 participants