-
Notifications
You must be signed in to change notification settings - Fork 499
SPLAT-2137: Support Security Group on NLB for Default router on AWS #1802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@mtulio: This pull request references SPLAT-2137 which is a valid jira issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing this enhancement - it's great to see this progress ahead.
I'm just starting to review but had basic questions on the summary so want to wait before I proceed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is generally making sense to me, i've left some comments and questions.
- Configure Ingress rules in the Security Group to allow traffic on the ports defined in the Service's `spec.ports`. The source for these rules will be determined by the `service.beta.kubernetes.io/load-balancer-source-ranges` annotation on the Service (if present, otherwise default to allowing from all IPs). | ||
- Configure Egress rules in the Security Group to allow traffic to the backend pods on the targetPort specified in the Service's `spec.ports` and the health check port. Initially, this should be restricted to the cluster's VPC CIDR or the specific CIDRs of the worker nodes. | ||
- When creating the NLB using the AWS ELBv2 API, the CCM will include the ID of the newly created Security Group in the `SecurityGroups` parameter of the `CreateLoadBalancerInput.` | ||
- When the Service is deleted, the CCM will also delete the associated Security Group, ensuring proper cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if this annotation is added after the Service is created? (ie what happens on update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on it, ensuring I will follow the current state of CCM along side the ALBC to correctly document it. Thanks for raising that question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to answer these questions for upstream, but downstream we could prevent those transitions with VAP
|
||
- Logic in the service controller within the CCM (`pkg/providers/v1/aws.go` and `pkg/providers/v1/aws_loadbalancer.go` ) to recognize and handle the new annotation when the service type is `NLB` (`ServiceAnnotationLoadBalancerType = "service.beta.kubernetes.io/aws-load-balancer-type"` ). | ||
- Functionality within the CCM to create and manage the lifecycle of AWS Security Groups for NLBs, including creating ingress and egress rules based on the service specification. This would likely involve using the AWS SDK for Go to interact with the EC2 API for creating and managing security groups. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if the upstream changes to CCM will need to be behind a feature gate?
given that this will be a new feature to the CCM that the ALBO already controls, i'm guessing that we will need a way to ensure that users who run CCM and ALBO in the same cluster can control the behavior. perhaps a flag to the CCM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, Mike, thanks! That's make sense. I will search how it does in CCM and document here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-ref to the thread that we could enhance the config to improve the EP goal.
|
||
- The CCM's service controller will watch for Service creations and updates. | ||
- When it encounters a Service with the annotation `service.beta.kubernetes.io/aws-load-balancer-managed-security-group: "true"` and `service.beta.kubernetes.io/aws-load-balancer-type: nlb`, the CCM will: | ||
- Create a new AWS Security Group for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should follow a convention like `k8s-elb-a
This is a interesting point, the convention for CCM to create NLB from Services is different than the ALBC, which follow the pattern: k8s-<namespace>-<service_name>-<id>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, I see NLB tags aren't standardized too:
CCM:
kubernetes.io/cluster/clusterID: owned
kubernetes.io/service-name: namespace/service-name
ALBC:
elbv2.k8s.aws/cluster: clusterID
service.k8s.aws/resource: LoadBalancer
service.k8s.aws/stack: namespace/service-name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question to @JoelSpeed @elmiko - do we want to standardize the NLB Tags between controllers too?
IIUC kubernetes.io/cluster/clusterID: owned
was not added in my ALBC exploration because the service was created by ALBO/ALBC which seems not to enforce cluster tags.
region: us-east-1 | ||
lbType: NLB <-- deprecate by platform.aws.ingressController.loadBalancerType? | ||
ingressController: <-- proposing to aggregate CIO configurations | ||
securityGroupEnabled: True <-- new field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I want to have different security groups for ingress vs the rest of the cluster? Is that possible?
Do we need the option for this to be automatic (use the same as you'd expect for default) but also a BYO option where users can specify specific SG IDs to be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if I want to have different security groups for ingress vs the rest of the cluster? Is that possible?
Would you mind elaborate it? I am not sure if I followed correctly as the proposal is already add a dedicated SG to the NLB of the rest of cluster.
Do we need the option for this to be automatic (use the same as you'd expect for default) but also a BYO option where users can specify specific SG IDs to be used?
That's a fair point, but I am note sure if we have customer use case for BYO SG on CIO, and also I wonder if supporting BYO SG would diverge of the main focus of this EP: enable NLB with security group.
BYO SG would increase a bit the implementation scope, specially in the CCM. IIUC By definition when SG IDs are added (BYO SG) through annotations, the CCM (Classic LB), or ALBC, won't manage those SGs' lifecycle. The ALBC also provides an extra annotation (manage-backend-security-group-rules
) to allow managing node rules:
If you specify this annotation, you need to configure the security groups on your Node/Pod to allow inbound traffic from the load balancer. You could also set the manage-backend-security-group-rules if you want the controller to manage the access rule
So what we are targeting is to provide the initial ability of enabling SG on NLB, similar it deploys CLB by default, as requested by managed Services. I am thinking if any additional feature/parity with ALBC would fall into the long-term planning we've been discussing with PMs. Do you think we could phase it? Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the latest version I added the BYO SG workflow as a later phase as opt-in to the Service object, removing the installer/CIO option/API.
annotations: | ||
service.beta.kubernetes.io/aws-load-balancer-type: nlb | ||
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing | ||
service.beta.kubernetes.io/aws-load-balancer-managed-security-group: "true" <-- new annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the annotation scheme look like in the AWS LBC? I thought it just allowed you to specify IDs
I think the upstream change to the CCM wants to mimic the behaviour described in https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/guide/service/annotations/#security-groups
Is our described behaviour here compatible with that, if not, have we deliberately deviated from that pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is our described behaviour here compatible with that
It is not, proposed annotation service.beta.kubernetes.io/aws-load-balancer-managed-security-group
is not the same as BYO SG annotation. To recap BYO SG annotations:
on ALBC:
- service.beta.kubernetes.io/aws-load-balancer-security-groups
- alb.ingress.kubernetes.io/frontend-nlb-security-groups (I really didnt understood the difference from later)
on CCM:
- service.beta.kubernetes.io/aws-load-balancer-security-groups
- service.beta.kubernetes.io/aws-load-balancer-extra-security-groups
, if not, have we deliberately deviated from that pattern?
Yes, it is intentionally proposing a new annotation to signalize the CCM to manage the SG when NLB (allowing users to transition to this config: opt-in). It was added mainly to prevent changing the default behavior of CCM when provisioning NLB.
AFAICT the ALBC does not provide this option as it defaults to SG since v2.6.0 (Aug 10, 2023), and it's not possible to disable it (?).
Alternatively, I can see:
- Changing explicitly the default behavior of NLB to always create SGs (do we want that?)
I believe we can converge to the thread https://github.com/openshift/enhancements/pull/1802/files#r2111532244 where you mentioned the transition and suggested configuration changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest version of this EP we are moving to a global configuration (cloud-config) for CCM, enforced in OpenShift by CCCMO, instead of a "managed" annotation as described above.
The BYO SG flow is also covered in a later phase of this EP, ensuring customers can opt-out the enforced managed SG on NLBs, following existing ALBC flow.
- Configure Ingress rules in the Security Group to allow traffic on the ports defined in the Service's `spec.ports`. The source for these rules will be determined by the `service.beta.kubernetes.io/load-balancer-source-ranges` annotation on the Service (if present, otherwise default to allowing from all IPs). | ||
- Configure Egress rules in the Security Group to allow traffic to the backend pods on the targetPort specified in the Service's `spec.ports` and the health check port. Initially, this should be restricted to the cluster's VPC CIDR or the specific CIDRs of the worker nodes. | ||
- When creating the NLB using the AWS ELBv2 API, the CCM will include the ID of the newly created Security Group in the `SecurityGroups` parameter of the `CreateLoadBalancerInput.` | ||
- When the Service is deleted, the CCM will also delete the associated Security Group, ensuring proper cleanup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be able to answer these questions for upstream, but downstream we could prevent those transitions with VAP
// ServiceAnnotationLoadBalancerManagedSecurityGroup is the annotation used | ||
// on the service to specify the instruct CCM to manage the security group when creating a Network Load Balancer. When enabled, | ||
// the CCM creates the security group and it's rules. This option can not be used with annotations | ||
// "service.beta.kubernetes.io/aws-load-balancer-security-groups" and "service.beta.kubernetes.io/aws-load-balancer-extra-security-groups". | ||
const ServiceAnnotationLoadBalancerManagedSecurityGroup = "service.beta.kubernetes.io/aws-load-balancer-managed-security-group" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this doesn't exist in LBC right? Is this being introduced to allow a transition from a CCM where it does not currently create a security group, to enabling users to opt-in to creating security groups?
Have you considered if it might be better to make this a CCM configuration that an admin would set for the cluster, rather than setting it for each service?
I could see in the future OpenShift changing the default to say that all new NLBs should have a security group created automatically for them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this doesn't exist in LBC right? Is this being introduced to allow a transition from a CCM where it does not currently create a security group, to enabling users to opt-in to creating security groups?
yes and yes. The idea was to prevent disrupt existing flow when creating services with NLB.
Have you considered if it might be better to make this a CCM configuration that an admin would set for the cluster, rather than setting it for each service?
I didn't but this is an excellent idea. It would decrease a lot the amount of API changes proposed in this EP, furthermore helping us in the future by (if) transitioning to ALBC.
@elmiko mentioned about requiring the CCM changes to be under a feature gate, what about if we introduce a FG that will enable SG by default when provisioning NLBs on CCM, so we can enable it on OCP and remove mostly API proposals, and annotations, in this EP?
It would also decrease the UX overhead, and also laser focus in the initial problem.
Would the workflow be like the following options (superficially)?:
openshift-install:
- user sets `platform.aws.lbType` to `NLB` value (currently opt-in)
- CCM config is added on OCP deployments (do we need/expose it through installer manifest?)
- CCM creates SG when gate is enabled when provisioning NLB
ROSA Classic or HCP:
- ensure CCM config is updated (or will it be enabled by default in KCM when API FG is set?)
- (same CCM flow)
No changes in CIO.
Is that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just finished the exploration, and this is the main idea (tl;dr):
-
- Create a new configuration in the cloud config (upstream CCM). Example
-
- Enforce the configuration in the 3CMO. Example
Once new service type loadbalancer NLB is created, the controller will manage an Security Group, attaching it to the new LB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to follow the pattern set out in LBC (https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/security_groups/#security-groups-for-load-balancers)
Which means:
service.beta.kubernetes.io/aws-load-balancer-security-groups
on the service allows a user to specify a pre-existing set of security groups to attach to the front-end of the LB- If the annotation is not set, create and manage a front-end security group for each LB automatically
We don't want to just enable this create and manage front-end LB by default, since that would be a major change.
So, this is where the CCM config option would come in, and allow users to opt-in/out of having a default security group created for each service.
I think that mostly aligns with your suggestions above in this thread, but I think we still want to have the annotation to allow the user to override the behaviour?
Do we need to also account for the shared backend SG behaviour of LBC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I think we still want to have the annotation to allow the user to override the behaviour?
Are you referring to opt-out/override the global config to manage frontend SG creation (proposal of this EP) without a BYO SG approach? Do we need or have a strong reason/use case to do so considering a best practice/recommendation is to assign a SG for NLB? I also wonder if we would be against the ALBC strategy used v2.6.0+ (I really didn't find a configuration to opt-out SG in NLBs in recent ALBC versions).
Do we need to also account for the shared backend SG behaviour of LBC?
I think it would benefit in clusters with high number of services, but if we don't have an strong use case to do so in short-term, I would not increase the amount of features to incorporate to CCM in this EP as the long-term approach on OCP is TBD.
LMK WDYT +@elmiko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once enabled, will the ccm try to create SGs for older LB services?
That's not the idea, only the new Service/LBs, specially NLB supports SGs only on creation, and we don't want to rollout all services when this feature is enabled - at least was not discussed yet as it would disrupt user workloads.
is this saying we don't want to autocreate SGs once the feature is enabled?
Defer to @JoelSpeed to expand see my #1802 (comment)
is it realistic to expect that we can address the backend SG stuff in another change?
I believe yes. My opinion is focus in critical issues now (following ALBC behaviour), and long-term discussions will figure out of those features, and inherit implementations/improvements added to ALBC. WDYT @elmiko @JoelSpeed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once enabled, will the ccm try to create SGs for older LB services?
It is important not to change the old services, they should remain without the SG
is this saying we don't want to autocreate SGs once the feature is enabled?
We should not just blanketly change this in the upstream CCM, it needs to be introduced slowly and opt-in at first, later we may change the default though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joel and Mike - Thanks for your thoughts.
it needs to be introduced slowly and opt-in at first
ACK. My understanding is the cloud-config flag covers that expectation in upstream, and on OCP we can gate it until we look it's ready to default to SG enforced by 3CMO. (current proposal)
Looks like we have a plan/scope defined for this EP. My takeaway from this thread and Slack conversation are:
-
- we are introducing a global cloud-config on CCM allowing to opt-in the managed SG by default across all Service type-loadbalancer NLB
-
- in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)
-
- We don't need to an additional annotation to opt-out SG in the Service level
-
- We are not introducing backend/shared SG as it would be covered in long-term research - and it is not an use case we are working in this EP
LMK if I missed something to wrap up this thread. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)
I would expect users to want to be able to configure this through CIO eventually, cc @Miciah @alebedev87 who might have opinions
Otherwise all agreed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)
It seems to me that this is primarily a question of the EP’s scope. From a quick review, I understand the intent of the EP is to support SG for the load balancer that sits in front of the OCP router. If that’s the case, then the cluster ingress operator should be able to determine when to apply the new annotation (which adds the BYO frontend SG) to the publishing service - similar to what we did for the subnet configuration.
However, if the EP’s scope is more generic and aims to enable frontend SG support for NLB services in CCM, then we likely don’t need to configure the router during installation (as part of this EP, can be done as a follow-up EP).
|
||
> WIP/TBReviewed | ||
|
||
- The implementation in CCM should handle the case where the `service.beta.kubernetes.io/aws-load-balancer-managed-security-group` annotation is set to `true` but the service type is not `NLB` (`aws-load-balancer-type: nlb`). In this scenario, the CCM should likely log a warning mentioning the annotation is supported only on NLB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the CCM do today for annotations that don't apply? I suspect it ignores them
We can use VAP downstream to prevent this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect too, so we don't need to warn/log. I will ensure the existing approach and update this thread. Thanks
|
||
Customers deploying OpenShift on AWS using Network Load Balancers (NLBs) for the default router have expressed the need for a similar security configuration as provided by Classic Load Balancers (CLBs), where a security group is created by CCM and associated with the load balancer. This allows for more granular control over inbound and outbound traffic at the load balancer level, aligning with AWS security best practices and addressing security findings that flag the lack of security groups on NLBs provisioned by the default CCM. | ||
|
||
The default router in OpenShift, an IngressController object managed by Cluster Ingress Controller Operator (CIO), can be created with a Service type Load Balancer NLB instead of default Classic Load Balancer (CLB) during installation by enabling it in the `install-config.yaml`. Currently, the Cloud Controller Manager (CCM), which satisfies Service resources, provisions an AWS Load Balancer of type NLB without a Security Group (SG) directly attached to it. Instead, security rules are managed on the worker nodes' security groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, security rules are managed on the worker nodes' security groups.
What are the benefits of relying on LB security groups over the node sg? Do we get more fine-grained rules that are managed corresponding to the services? Can we reduce the current rules on compute nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits of relying on LB security groups over the node sg?
Do we get more fine-grained rules that are managed corresponding to the services?
User can improve security rules targeting the lb only, instead of opening rules on node's SG. But also a best practice to associate SG to an NLB (minimum privileges approach):
"We recommend that you associate a security group with your Network Load Balancer when you create it."
Can we reduce the current rules on compute nodes?
I don't think this could be a primarily goal, but we can review if it would have some duplicated/unused rule on node's SG.
Action item: I will keep this thread open to make sure this is reflected in the EP.
|
||
#### ROSA Classic | ||
|
||
- TBD: API changes in Hive to read and process the new install-config option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hive itself should not need any changes, as we consume the install-config.yaml as a black box. I think Clusters Service is the component that would deal with this.
That said, it is customary for us to create a QE-only HIVE card to be executed once the upstream work is done, just to make sure we didn't miss anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2uasimojo , I will make sure this is an action item in the test phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @patrickdillon and @JoelSpeed for the review/suggestions. Hopefully I've address your questions.
Perhaps we could focus in the thread where is suggested to change the CCM configuration to enable SG by default in NLBs? if this would be the path forward for this EP (I personally think it is an excellent idea), we could decrease the scope of changes in many components here.
Please let me know your thoughts.
|
||
Customers deploying OpenShift on AWS using Network Load Balancers (NLBs) for the default router have expressed the need for a similar security configuration as provided by Classic Load Balancers (CLBs), where a security group is created by CCM and associated with the load balancer. This allows for more granular control over inbound and outbound traffic at the load balancer level, aligning with AWS security best practices and addressing security findings that flag the lack of security groups on NLBs provisioned by the default CCM. | ||
|
||
The default router in OpenShift, an IngressController object managed by Cluster Ingress Controller Operator (CIO), can be created with a Service type Load Balancer NLB instead of default Classic Load Balancer (CLB) during installation by enabling it in the `install-config.yaml`. Currently, the Cloud Controller Manager (CCM), which satisfies Service resources, provisions an AWS Load Balancer of type NLB without a Security Group (SG) directly attached to it. Instead, security rules are managed on the worker nodes' security groups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the benefits of relying on LB security groups over the node sg?
Do we get more fine-grained rules that are managed corresponding to the services?
User can improve security rules targeting the lb only, instead of opening rules on node's SG. But also a best practice to associate SG to an NLB (minimum privileges approach):
"We recommend that you associate a security group with your Network Load Balancer when you create it."
Can we reduce the current rules on compute nodes?
I don't think this could be a primarily goal, but we can review if it would have some duplicated/unused rule on node's SG.
Action item: I will keep this thread open to make sure this is reflected in the EP.
|
||
> WIP/TBReviewed | ||
|
||
- The implementation in CCM should handle the case where the `service.beta.kubernetes.io/aws-load-balancer-managed-security-group` annotation is set to `true` but the service type is not `NLB` (`aws-load-balancer-type: nlb`). In this scenario, the CCM should likely log a warning mentioning the annotation is supported only on NLB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect too, so we don't need to warn/log. I will ensure the existing approach and update this thread. Thanks
|
||
- Logic in the service controller within the CCM (`pkg/providers/v1/aws.go` and `pkg/providers/v1/aws_loadbalancer.go` ) to recognize and handle the new annotation when the service type is `NLB` (`ServiceAnnotationLoadBalancerType = "service.beta.kubernetes.io/aws-load-balancer-type"` ). | ||
- Functionality within the CCM to create and manage the lifecycle of AWS Security Groups for NLBs, including creating ingress and egress rules based on the service specification. This would likely involve using the AWS SDK for Go to interact with the EC2 API for creating and managing security groups. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-ref to the thread that we could enhance the config to improve the EP goal.
Thanks you all for the feedabck. The EP has been reviewed with the comments, updating the proposal to limit to CCM changes by introducing a cloud-config (global configuration) to opt-in enable the managed front-end security group when creating Service type-LoadBalancer NLB, allowing CCCMO to enforce the default on OpenShift. The proposal also introduce an optional Service annotation to BYO SG will opt-out the manage SG. This PR is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is reading well to me, we probably need to chat about the TBD items but i have a couple suggestions/questions.
|
||
AWS [announced support for Security Groups when deploying an NLB in August 2023][nlb-supports-sg], but the CCM for AWS (within kubernetes/cloud-provider-aws) does not currently implement the feature of automatically creating and managing security groups for `Service` resources type-LoadBalancer using NLBs. While the [AWS Load Balancer Controller (ALBC/LBC)][aws-lbc] project already supports deploying security groups for NLBs, this enhancement focuses on adding minimal, opt-in support to the existing CCM to address immediate customer needs without a full migration to the LBC. This approach aims to provide the necessary functionality without requiring significant changes in other OpenShift components like the Ingress Controller, installer, ROSA, etc. | ||
|
||
Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the beginning of this sentence is a little confusing:
Using a Network Load Balancer is a recommended network-based Load Balancer by AWS,
is this saying that NLB is the recommended way to do load balancing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's recommended way for network-based LBs. Currently AWS offers two LB replacing ELB/Classic (default by CCM): NLB (network-based) and ALB (application-based). So the idea is to mention the NLB is the recommended one. Do you think I need to state that replacement to improve the reading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense, perhaps to make the sentence clearer you could say:
Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created. | |
Using a Network Load Balancer, as opposed to an Application Load Balancer, is the recommended way to do network-based load balancing by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created. |
is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @elmiko , what about this?
Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created. | |
Using a Network Load Balancer, as opposed to an Classic Load Balancer, is the recommended way to do network-based load balancing by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created. |
We can compare NLB with CLB.
- a) decreases the amount of provider-specific changes on CIO; | ||
- b) decreases the amount of maintained code/projects by the team (e.g., ALBC); | ||
- c) enhances new configurations to the Ingress Controller when using NLB; | ||
- d) decreases the amount of images in the core payload; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this decrease in reference to the ALBC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, ALBC + ALBO would be required if CIO defaults to ALBC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might say this as "does not increase the amount of images in the core payload"
|
||
**Phase 1: CCM Support managed security group for Service type-LoadBalancer NLB** | ||
|
||
- Implement support of cloud provider configuration on CCM to managed Security Group by default when creating resource Service type-LoadBalancer NLB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a feature gate in phase 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, feature gate is already created/merged: openshift/api#2354
Wasn't sure if I needed to mention here, I added a reminder in the graduation criteria: https://github.com/openshift/enhancements/pull/1802/files#diff-84882e6fc6fb023742b0ac09960b79620cfea983c45def4739a89fd404cdc05aR359
TODO/reminder: mention the FG when we have the target (DP or TP).
- When the configuration is present in the NLB flow, the CCM will: | ||
- Create a new Security Group instance for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`. | ||
- Create Ingress and Egress rules in the Security Group based on the NLB Listeners' and Target Groups' ports. Egress rules should be restricted to the necessary ports for backend communication (traffic and health check ports). | ||
- Delete the Security Group when the corresponding service is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need any extra checks for the SG's to make sure they are deleted on destruction of a cluster? (i'm thinking about when we destroy clusters using the installer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK no. The SGs must have the cluster owned tags discovered by installer on destroy, like the regular resources created by CCM/Service/Ingresses.
- Change the default OpenShift IPI install flow when deploying the default router using IPI (users still need to explicitly set the `lbType` configuration to `nlb` to automatically consume this feature). | ||
- Change any ROSA code base, both HCP or Classic, to support this feature. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes are going to be proposed upstream for the CCM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, everything related to CCM is upstream first.
|
||
TODO review the following items: | ||
|
||
- The Security Group naming convention should be consistent and informative, including the cluster ID, namespace, and service name to aid in identification and management in the AWS console. (TODO: need review, perhaps we just create the same name as LB?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping the names consistent between Service, load balancer, and security group seems like a good option to make this easier to triage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big question here is if we'll need consistency with ELB/Classic (current CCM), or change in CCM to adapt to the pattern used in ALBC?
IIRC ALBC uses a different naming convention on NLBs, I need to check if SGs follows the same pattern.
I will keep this thread open to bring more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, we should probably follow the ALBC behavior where we can, but also try not to surprise the user.
|
||
## Alternatives (Not Implemented) | ||
|
||
> TODO/TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it's worth mentioning the idea of making the ALBC functionality into a module that can be imported into the CCM as something we should investigate for the future.
Co-authored-by: Michael McCune <[email protected]>
- a) decreases the amount of provider-specific changes on CIO; | ||
- b) decreases the amount of maintained code/projects by the team (e.g., ALBC); | ||
- c) enhances new configurations to the Ingress Controller when using NLB; | ||
- d) decreases the amount of images in the core payload; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, ALBC + ALBO would be required if CIO defaults to ALBC
## Graduation Criteria | ||
|
||
> TODO/TBD | ||
|
||
### Dev Preview -> Tech Preview | ||
|
||
N/A. This feature will be introduced as Tech Preview (TBReviewed). | ||
|
||
### Tech Preview -> GA | ||
|
||
The E2E tests should be consistently passing, and a PR will be created to enable the feature gate by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expand the FG added here openshift/api#2354
Initially we've been asked to go directly to TP, but considering the impact of this change (default to SG) we are considering starting from DP. We are evaluating the velocity in upstream and how fast we can move it.
|
||
**Phase 1: CCM Support managed security group for Service type-LoadBalancer NLB** | ||
|
||
- Implement support of cloud provider configuration on CCM to managed Security Group by default when creating resource Service type-LoadBalancer NLB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, feature gate is already created/merged: openshift/api#2354
Wasn't sure if I needed to mention here, I added a reminder in the graduation criteria: https://github.com/openshift/enhancements/pull/1802/files#diff-84882e6fc6fb023742b0ac09960b79620cfea983c45def4739a89fd404cdc05aR359
TODO/reminder: mention the FG when we have the target (DP or TP).
- Change the default OpenShift IPI install flow when deploying the default router using IPI (users still need to explicitly set the `lbType` configuration to `nlb` to automatically consume this feature). | ||
- Change any ROSA code base, both HCP or Classic, to support this feature. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, everything related to CCM is upstream first.
- When the configuration is present in the NLB flow, the CCM will: | ||
- Create a new Security Group instance for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`. | ||
- Create Ingress and Egress rules in the Security Group based on the NLB Listeners' and Target Groups' ports. Egress rules should be restricted to the necessary ports for backend communication (traffic and health check ports). | ||
- Delete the Security Group when the corresponding service is deleted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK no. The SGs must have the cluster owned tags discovered by installer on destroy, like the regular resources created by CCM/Service/Ingresses.
|
||
TODO review the following items: | ||
|
||
- The Security Group naming convention should be consistent and informative, including the cluster ID, namespace, and service name to aid in identification and management in the AWS console. (TODO: need review, perhaps we just create the same name as LB?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big question here is if we'll need consistency with ELB/Classic (current CCM), or change in CCM to adapt to the pattern used in ALBC?
IIRC ALBC uses a different naming convention on NLBs, I need to check if SGs follows the same pattern.
I will keep this thread open to bring more information.
@mtulio: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
> WIP/TBReviewed | ||
|
||
- **Increased complexity in CCM**: Adding security group management to CCM increases its complexity. Mitigation: Focus on a minimal and well-tested implementation, drawing inspiration from the existing CLB security group management logic in CCM. | ||
- **Potential for inconsistencies with ALBC**: If users later decide to migrate to ALBC, there might be inconsistencies in how security groups are managed. Mitigation: Clearly document the limitations of this approach and the benefits of using ALBC for more advanced scenarios or a broader range of features. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we try to stay consistent with ALBC? Maybe I'm missing some details but the CCM seems to be in a situation similar to ALBC when it started implementing the SG feature for NLBs before 2.6.0 release. Even if the scope of this EP will be only the frontend SG, it can follow the same pattern (auto manage by default, BYO if annotation is used) and use the same knob(s) (annotation(s)).
https://issues.redhat.com/browse/OCPSTRAT-1553
https://issues.redhat.com/browse/SPLAT-2137