Skip to content

Conversation

krishicks
Copy link

What does it do ?

Creates a credentials provider which loads the default config on each retrieval when not using AssumeRole.

Fixes #3713

Motivation

When supplying credentials with Hashicorp Vault, users create a file with static credentials that's referenced by the environment variable AWS_SHARED_CREDENTIALS_FILE. The contents of this file are temporary credentials that are rotated by Vault on a continuous basis to ensure they're always valid.

AWS SDK for Go (v2) presently only reads this file once on load, so when the credentials expire it will never see the updates supplied by Vault.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 26, 2025
Copy link

linux-foundation-easycla bot commented Jun 26, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jun 26, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mloiseleur June 26, 2025 16:58
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from szuecs June 26, 2025 16:58
@k8s-ci-robot k8s-ci-robot added the provider Issues or PRs related to a provider label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @krishicks!

It looks like this is your first PR to kubernetes-sigs/external-dns 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/external-dns has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 26, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @krishicks. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 26, 2025
@krishicks krishicks force-pushed the master branch 2 times, most recently from b324e3c to b53c707 Compare June 26, 2025 16:59
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 26, 2025
@ivankatliarchuk
Copy link
Member

What prevents from using established methods like Vault Agent (https://www.hashicorp.com/en/blog/refresh-secrets-for-kubernetes-applications-with-vault-agent) to restart a service upon credential updates? Furthermore, why are we not leveraging Pod Identity, IRSA, or similar robust identity management solutions?

@krishicks
Copy link
Author

Good question!

Normally, we use amazon-eks-pod-identity-webhook with IRSA to provide temporary credentials to pods on CSP-managed Kubernetes. However, for non-CSP-managed Kubernetes, we're investigating using Vault configured with JWT validation pubkeys to have Vault issue these temporary credentials via the external-secrets operator's VaultDynamicSecret. This is to avoid having to expose the OIDC discovery information for the cluster, which by definition must be publicly accessible and unauthenticated.

Using Vault Agent would mean having to run a sidecar, which means running yet another component that needs to be monitored, maintained, and updated. Additionally, the way the Vault Agent works is to disrupt the running container to get it to reload, which isn't great as external-dns runs as a singleton. It would also be restarted up to 24 times per day as the temporary credentials we supply have a TTL of one hour.

This PR implements a solution that makes it just work if someone wants to utilize AWS_SHARED_CREDENTIALS_FILE with temporary credentials. No worrying about disrupting the process, or having to ignore increased container restarts, or having to run a sidecar. It's a better user experience, and one that people expect, without which people are surprised by:

aws/aws-sdk-go#3163
boto/botocore#2450

This PR implements the same kind of automatic refreshing of the credentials file that has been implemented in aws-sdk-java-v2:

https://github.com/aws/aws-sdk-java-v2/blob/cd49f1f9a49c3c5aa9f7e463ed5435cfbbd23943/core/auth/src/main/java/software/amazon/awssdk/auth/credentials/ProfileCredentialsProvider.java#L121

When supplying credentials with Hashicorp Vault, users create a file
with static credentials that's referenced by the environment variable
`AWS_SHARED_CREDENTIALS_FILE`. The contents of this file are temporary
credentials that are rotated by Vault on a continuous basis to ensure
they're always valid.

AWS SDK for Go (v2) presently only reads this file once on load, so when
the credentials expire it will never see the updates supplied by Vault.

To fix this, we create a credentials provider which loads the default
config on each retrieval. As a result, the credentials should always be
the latest credentials supplied by Vault.
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 27, 2025
@ivankatliarchuk
Copy link
Member

I'm not sure where or not unit test is correct. In external-dns the CreateV2Configs exected only once on provider creation. I'm not sure about my solution, but could be something like

func Test_newV2Config_WithRefresh(t *testing.T) {

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	done := make(chan struct{})

	var cfg awsv2.Config
	var err error

	dir := t.TempDir()
	credsFile := filepath.Join(dir, "credentials")
	t.Setenv("AWS_SHARED_CREDENTIALS_FILE", credsFile)

	err = os.WriteFile(credsFile, []byte(`
[default]
aws_access_key_id=AKID1234
aws_secret_access_key=SECRET1
`), 0777)
	require.NoError(t, err)

	cfg, err = newV2Config(AWSSessionConfig{})

	go func() {
		defer close(done) // Ensure the channel is closed at the end
		// Execute the function
		cfg, err = newV2Config(AWSSessionConfig{})
		require.NoError(t, err)
	}()

	// Wait for the channel to close or context to be canceled
	select {
	case <-done:
		// Test completed successfully
	case <-ctx.Done():
		t.Fatal("Test timed out or was canceled")
	}

	fmt.Println("creds first call >>>")
	fmt.Println(cfg.Credentials.Retrieve(context.Background()))

	err = os.WriteFile(credsFile, []byte(`
[default]
aws_access_key_id=AKID2567
aws_secret_access_key=SECRET2
`), 0777)
	require.NoError(t, err)

	fmt.Println("creds second call >>>")
	fmt.Println(cfg.Credentials.Retrieve(context.Background()))
}

Go routine most likely not required.

@krishicks
Copy link
Author

I'm not sure I understand the concern with my test. In my test as committed it only calls newV2Config once, and calls retrieve on it twice; once after writing the original creds, and then once after updating the creds. This proves that without reinstantiating the V2 Config it always loads the credentials from the file, or in other words does not cache the credentials.

The test you wrote will block on done being closed, which means it will wait for the goroutine to complete, so the first instantiation of cfg will never be used. newV2Config should not be called more than once in the test to more accurately portray the behavior of the production code.

Can you restate the concern with my test?

@ivankatliarchuk
Copy link
Member

Tests are fine.

My primary concern with this Pull Request is that it appears to read the file system on every API request to AWS, which I don't believe is a recommended approach for credential handling. Typically, credentials should only be read or refreshed when necessary.

In my opinion, this behavior should be supported at the AWS SDK level. I would suggest opening a new issue or PR in the aws-sdk-go-v2 repository first to address this. Additionally, credentials should primarily be refreshed based on some sort of a strategy, like creds expired or the API returns an ExpiredToken or a similar error, with middleware, rather than on every request.

What is required;

  • open an issue or PR in aws-sdk-go-v2 (I think the maintainers will recommend what and how to do it)

Feature itself make sense as well as the use case.

Therefore, this feature will be put on hold for now.

Relevant issue aws/aws-sdk-go-v2#2135, and similar which outlines how credential refresh decisions are currently made.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 28, 2025
@krishicks
Copy link
Author

I don't think getting it added in aws-sdk-go-v2 is viable given they closed that issue saying to do exactly what I've done here, implementing our own provider. How about we implement a different kind of provider, then?

What we really want here is to be able to get credentials from Vault, so what if we implemented a provider that would do that? It could use JWT or Kubernetes auth methods and only fetch new creds from Vault when the creds it already has expires.

If that sounds like a better option, I'm happy to implement that. It would mean adding config so the user could say "for AWS, use Vault, and here's its address and the auth method/mount path/role to use".

@ivankatliarchuk
Copy link
Member

We actually moving providers out-of-tree #4347

So only a webhook is an option at the moment for bespoke setup. https://kubernetes-sigs.github.io/external-dns/latest/docs/tutorials/webhook-provider/

My concern is that adding this as a custom credentials resolver, without first engaging the go-sdk2 team for a library improvement, is not a scalable solution.

In a common Kubernetes environment, ExternalDNS is rarely the only add-on installed. Clusters often run components like autoscaler, Karpenter, VPC CNI, EBS CSI, API Gateway, and many more. Implementing a custom credential resolver in every single one of these projects is unsustainable.

Therefore, there's a strong case for integrating this functionality directly into the SDK library as one of its standard configuration options.

At the meantime, there is a compensation from external-secrets team https://github.com/external-secrets-inc/reloader
This will reload service when needed, probably not what exactly required, but should work for any service.

I strongly feel this feature is critically needed in the SDK,

@ivankatliarchuk
Copy link
Member

I have not tried, but if static credentials expired, the library should in theory refresh credentials when an API call explicitly returns an ExpiredToken or a similar authentication error. Is this currently not happening?

@mloiseleur
Copy link
Collaborator

I don't think getting it added in aws-sdk-go-v2 is viable given they closed that issue

It seems it has moved as a community FR and needs upvote to be prioritized: aws/aws-cli#9034

Did you take a look at external process for credentials ? like what exists for terraform provider ?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. provider Issues or PRs related to a provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expired Token on refreshed AWS keys

5 participants