SPLAT-2137: Support Security Group on NLB for Default router on AWS #1802

mtulio · 2025-05-20T18:45:47Z

https://issues.redhat.com/browse/OCPSTRAT-1553
https://issues.redhat.com/browse/SPLAT-2137

openshift-ci-robot · 2025-05-20T18:45:52Z

@mtulio: This pull request references SPLAT-2137 which is a valid jira issue.

In response to this:

https://issues.redhat.com/browse/OCPSTRAT-1553
https://issues.redhat.com/browse/SPLAT-2137

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-05-20T18:46:33Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2025-05-20T18:47:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mtulio · 2025-05-20T18:49:39Z

/test all

https://issues.redhat.com/browse/OCPSTRAT-1553 https://issues.redhat.com/browse/SPLAT-2137

enhancements/installer/aws-ingress-nlb-security-group.md

bchandra-ocp

Thanks for sharing this enhancement - it's great to see this progress ahead.

I'm just starting to review but had basic questions on the summary so want to wait before I proceed.

enhancements/installer/aws-ingress-nlb-security-group.md

elmiko

this is generally making sense to me, i've left some comments and questions.

enhancements/installer/aws-ingress-nlb-security-group.md

elmiko · 2025-05-21T20:07:05Z

enhancements/installer/aws-ingress-nlb-security-group.md

+    - Configure Ingress rules in the Security Group to allow traffic on the ports defined in the Service's `spec.ports`. The source for these rules will be determined by the `service.beta.kubernetes.io/load-balancer-source-ranges` annotation on the Service (if present, otherwise default to allowing from all IPs).
+    - Configure Egress rules in the Security Group to allow traffic to the backend pods on the targetPort specified in the Service's `spec.ports` and the health check port. Initially, this should be restricted to the cluster's VPC CIDR or the specific CIDRs of the worker nodes.
+    - When creating the NLB using the AWS ELBv2 API, the CCM will include the ID of the newly created Security Group in the `SecurityGroups` parameter of the `CreateLoadBalancerInput.`
+  - When the Service is deleted, the CCM will also delete the associated Security Group, ensuring proper cleanup.


what happens if this annotation is added after the Service is created? (ie what happens on update)

I am working on it, ensuring I will follow the current state of CCM along side the ALBC to correctly document it. Thanks for raising that question.

We need to be able to answer these questions for upstream, but downstream we could prevent those transitions with VAP

elmiko · 2025-05-21T20:12:52Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+- Logic in the service controller within the CCM (`pkg/providers/v1/aws.go` and `pkg/providers/v1/aws_loadbalancer.go` ) to recognize and handle the new annotation when the service type is `NLB` (`ServiceAnnotationLoadBalancerType = "service.beta.kubernetes.io/aws-load-balancer-type"` ).
+- Functionality within the CCM to create and manage the lifecycle of AWS Security Groups for NLBs, including creating ingress and egress rules based on the service specification. This would likely involve using the AWS SDK for Go to interact with the EC2 API for creating and managing security groups.
+


i wonder if the upstream changes to CCM will need to be behind a feature gate?

given that this will be a new feature to the CCM that the ALBO already controls, i'm guessing that we will need a way to ensure that users who run CCM and ALBO in the same cluster can control the behavior. perhaps a flag to the CCM.

Good catch, Mike, thanks! That's make sense. I will search how it does in CCM and document here.

Cross-ref to the thread that we could enhance the config to improve the EP goal.

enhancements/installer/aws-ingress-nlb-security-group.md

mtulio · 2025-05-22T02:23:30Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+  - The CCM's service controller will watch for Service creations and updates.
+  - When it encounters a Service with the annotation `service.beta.kubernetes.io/aws-load-balancer-managed-security-group: "true"` and `service.beta.kubernetes.io/aws-load-balancer-type: nlb`, the CCM will:
+    - Create a new AWS Security Group for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`.


The name should follow a convention like `k8s-elb-a

This is a interesting point, the convention for CCM to create NLB from Services is different than the ALBC, which follow the pattern: k8s-<namespace>-<service_name>-<id>

Furthermore, I see NLB tags aren't standardized too:
CCM:

kubernetes.io/cluster/clusterID: owned kubernetes.io/service-name: namespace/service-name

ALBC:

elbv2.k8s.aws/cluster: clusterID service.k8s.aws/resource: LoadBalancer service.k8s.aws/stack: namespace/service-name

Question to @JoelSpeed @elmiko - do we want to standardize the NLB Tags between controllers too?

IIUC kubernetes.io/cluster/clusterID: owned was not added in my ALBC exploration because the service was created by ALBO/ALBC which seems not to enforce cluster tags.

JoelSpeed · 2025-05-28T10:31:40Z

enhancements/installer/aws-ingress-nlb-security-group.md

+    region: us-east-1
+    lbType: NLB                   <-- deprecate by platform.aws.ingressController.loadBalancerType?
+    ingressController:            <-- proposing to aggregate CIO configurations
+      securityGroupEnabled: True  <-- new field


What if I want to have different security groups for ingress vs the rest of the cluster? Is that possible?

Do we need the option for this to be automatic (use the same as you'd expect for default) but also a BYO option where users can specify specific SG IDs to be used?

What if I want to have different security groups for ingress vs the rest of the cluster? Is that possible?

Would you mind elaborate it? I am not sure if I followed correctly as the proposal is already add a dedicated SG to the NLB of the rest of cluster.

Do we need the option for this to be automatic (use the same as you'd expect for default) but also a BYO option where users can specify specific SG IDs to be used?

That's a fair point, but I am note sure if we have customer use case for BYO SG on CIO, and also I wonder if supporting BYO SG would diverge of the main focus of this EP: enable NLB with security group.

BYO SG would increase a bit the implementation scope, specially in the CCM. IIUC By definition when SG IDs are added (BYO SG) through annotations, the CCM (Classic LB), or ALBC, won't manage those SGs' lifecycle. The ALBC also provides an extra annotation (manage-backend-security-group-rules) to allow managing node rules:

If you specify this annotation, you need to configure the security groups on your Node/Pod to allow inbound traffic from the load balancer. You could also set the manage-backend-security-group-rules if you want the controller to manage the access rule

So what we are targeting is to provide the initial ability of enabling SG on NLB, similar it deploys CLB by default, as requested by managed Services. I am thinking if any additional feature/parity with ALBC would fall into the long-term planning we've been discussing with PMs. Do you think we could phase it? Thoughts?

in the latest version I added the BYO SG workflow as a later phase as opt-in to the Service object, removing the installer/CIO option/API.

enhancements/installer/aws-ingress-nlb-security-group.md

JoelSpeed · 2025-05-28T10:35:22Z

enhancements/installer/aws-ingress-nlb-security-group.md

+  annotations:
+    service.beta.kubernetes.io/aws-load-balancer-type: nlb
+    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
+    service.beta.kubernetes.io/aws-load-balancer-managed-security-group: "true" <-- new annotation


What does the annotation scheme look like in the AWS LBC? I thought it just allowed you to specify IDs

I think the upstream change to the CCM wants to mimic the behaviour described in https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/guide/service/annotations/#security-groups

Is our described behaviour here compatible with that, if not, have we deliberately deviated from that pattern?

Is our described behaviour here compatible with that

It is not, proposed annotation service.beta.kubernetes.io/aws-load-balancer-managed-security-group is not the same as BYO SG annotation. To recap BYO SG annotations:

on ALBC:

service.beta.kubernetes.io/aws-load-balancer-security-groups

alb.ingress.kubernetes.io/frontend-nlb-security-groups (I really didnt understood the difference from later)

on CCM:

service.beta.kubernetes.io/aws-load-balancer-security-groups

service.beta.kubernetes.io/aws-load-balancer-extra-security-groups

, if not, have we deliberately deviated from that pattern?

Yes, it is intentionally proposing a new annotation to signalize the CCM to manage the SG when NLB (allowing users to transition to this config: opt-in). It was added mainly to prevent changing the default behavior of CCM when provisioning NLB.

AFAICT the ALBC does not provide this option as it defaults to SG since v2.6.0 (Aug 10, 2023), and it's not possible to disable it (?).

Alternatively, I can see:

Changing explicitly the default behavior of NLB to always create SGs (do we want that?)

I believe we can converge to the thread https://github.com/openshift/enhancements/pull/1802/files#r2111532244 where you mentioned the transition and suggested configuration changes.

In the latest version of this EP we are moving to a global configuration (cloud-config) for CCM, enforced in OpenShift by CCCMO, instead of a "managed" annotation as described above.

The BYO SG flow is also covered in a later phase of this EP, ensuring customers can opt-out the enforced managed SG on NLBs, following existing ALBC flow.

JoelSpeed · 2025-05-28T10:39:26Z

enhancements/installer/aws-ingress-nlb-security-group.md

+    - Configure Ingress rules in the Security Group to allow traffic on the ports defined in the Service's `spec.ports`. The source for these rules will be determined by the `service.beta.kubernetes.io/load-balancer-source-ranges` annotation on the Service (if present, otherwise default to allowing from all IPs).
+    - Configure Egress rules in the Security Group to allow traffic to the backend pods on the targetPort specified in the Service's `spec.ports` and the health check port. Initially, this should be restricted to the cluster's VPC CIDR or the specific CIDRs of the worker nodes.
+    - When creating the NLB using the AWS ELBv2 API, the CCM will include the ID of the newly created Security Group in the `SecurityGroups` parameter of the `CreateLoadBalancerInput.`
+  - When the Service is deleted, the CCM will also delete the associated Security Group, ensuring proper cleanup.


We need to be able to answer these questions for upstream, but downstream we could prevent those transitions with VAP

JoelSpeed · 2025-05-28T10:41:05Z

enhancements/installer/aws-ingress-nlb-security-group.md

+// ServiceAnnotationLoadBalancerManagedSecurityGroup is the annotation used
+// on the service to specify the instruct CCM to manage the security group when creating a Network Load Balancer. When enabled,
+// the CCM creates the security group and it's rules. This option can not be used with annotations
+// "service.beta.kubernetes.io/aws-load-balancer-security-groups" and "service.beta.kubernetes.io/aws-load-balancer-extra-security-groups".
+const ServiceAnnotationLoadBalancerManagedSecurityGroup = "service.beta.kubernetes.io/aws-load-balancer-managed-security-group"


So this doesn't exist in LBC right? Is this being introduced to allow a transition from a CCM where it does not currently create a security group, to enabling users to opt-in to creating security groups?

Have you considered if it might be better to make this a CCM configuration that an admin would set for the cluster, rather than setting it for each service?

I could see in the future OpenShift changing the default to say that all new NLBs should have a security group created automatically for them

So this doesn't exist in LBC right? Is this being introduced to allow a transition from a CCM where it does not currently create a security group, to enabling users to opt-in to creating security groups?

yes and yes. The idea was to prevent disrupt existing flow when creating services with NLB.

Have you considered if it might be better to make this a CCM configuration that an admin would set for the cluster, rather than setting it for each service?

I didn't but this is an excellent idea. It would decrease a lot the amount of API changes proposed in this EP, furthermore helping us in the future by (if) transitioning to ALBC.

@elmiko mentioned about requiring the CCM changes to be under a feature gate, what about if we introduce a FG that will enable SG by default when provisioning NLBs on CCM, so we can enable it on OCP and remove mostly API proposals, and annotations, in this EP?

It would also decrease the UX overhead, and also laser focus in the initial problem.

Would the workflow be like the following options (superficially)?:

openshift-install: - user sets `platform.aws.lbType` to `NLB` value (currently opt-in) - CCM config is added on OCP deployments (do we need/expose it through installer manifest?) - CCM creates SG when gate is enabled when provisioning NLB ROSA Classic or HCP: - ensure CCM config is updated (or will it be enabled by default in KCM when API FG is set?) - (same CCM flow) No changes in CIO.

Is that makes sense?

I just finished the exploration, and this is the main idea (tl;dr):

Create a new configuration in the cloud config (upstream CCM). Example

Enforce the configuration in the 3CMO. Example

Once new service type loadbalancer NLB is created, the controller will manage an Security Group, attaching it to the new LB.

I think we want to follow the pattern set out in LBC (https://kubernetes-sigs.github.io/aws-load-balancer-controller/latest/deploy/security_groups/#security-groups-for-load-balancers)

Which means:

service.beta.kubernetes.io/aws-load-balancer-security-groups on the service allows a user to specify a pre-existing set of security groups to attach to the front-end of the LB

If the annotation is not set, create and manage a front-end security group for each LB automatically

We don't want to just enable this create and manage front-end LB by default, since that would be a major change.

So, this is where the CCM config option would come in, and allow users to opt-in/out of having a default security group created for each service.

I think that mostly aligns with your suggestions above in this thread, but I think we still want to have the annotation to allow the user to override the behaviour?

Do we need to also account for the shared backend SG behaviour of LBC?

but I think we still want to have the annotation to allow the user to override the behaviour?

Are you referring to opt-out/override the global config to manage frontend SG creation (proposal of this EP) without a BYO SG approach? Do we need or have a strong reason/use case to do so considering a best practice/recommendation is to assign a SG for NLB? I also wonder if we would be against the ALBC strategy used v2.6.0+ (I really didn't find a configuration to opt-out SG in NLBs in recent ALBC versions).

Do we need to also account for the shared backend SG behaviour of LBC?

I think it would benefit in clusters with high number of services, but if we don't have an strong use case to do so in short-term, I would not increase the amount of features to incorporate to CCM in this EP as the long-term approach on OCP is TBD.

LMK WDYT +@elmiko

once enabled, will the ccm try to create SGs for older LB services?

That's not the idea, only the new Service/LBs, specially NLB supports SGs only on creation, and we don't want to rollout all services when this feature is enabled - at least was not discussed yet as it would disrupt user workloads.

is this saying we don't want to autocreate SGs once the feature is enabled?

Defer to @JoelSpeed to expand see my #1802 (comment)

is it realistic to expect that we can address the backend SG stuff in another change?

I believe yes. My opinion is focus in critical issues now (following ALBC behaviour), and long-term discussions will figure out of those features, and inherit implementations/improvements added to ALBC. WDYT @elmiko @JoelSpeed ?

once enabled, will the ccm try to create SGs for older LB services?

It is important not to change the old services, they should remain without the SG

is this saying we don't want to autocreate SGs once the feature is enabled?

We should not just blanketly change this in the upstream CCM, it needs to be introduced slowly and opt-in at first, later we may change the default though

Joel and Mike - Thanks for your thoughts.

it needs to be introduced slowly and opt-in at first

ACK. My understanding is the cloud-config flag covers that expectation in upstream, and on OCP we can gate it until we look it's ready to default to SG enforced by 3CMO. (current proposal)

Looks like we have a plan/scope defined for this EP. My takeaway from this thread and Slack conversation are:

we are introducing a global cloud-config on CCM allowing to opt-in the managed SG by default across all Service type-loadbalancer NLB

in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)

We don't need to an additional annotation to opt-out SG in the Service level

We are not introducing backend/shared SG as it would be covered in long-term research - and it is not an use case we are working in this EP

LMK if I missed something to wrap up this thread. Thanks!

in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)

I would expect users to want to be able to configure this through CIO eventually, cc @Miciah @alebedev87 who might have opinions

Otherwise all agreed

in later phase (still in this EP) we are introducing/enabling a BYO SG annotation on NLB, and this one will be available in the Service level (not planning to change CIO/Installer)

It seems to me that this is primarily a question of the EP’s scope. From a quick review, I understand the intent of the EP is to support SG for the load balancer that sits in front of the OCP router. If that’s the case, then the cluster ingress operator should be able to determine when to apply the new annotation (which adds the BYO frontend SG) to the publishing service - similar to what we did for the subnet configuration.

However, if the EP’s scope is more generic and aims to enable frontend SG support for NLB services in CCM, then we likely don’t need to configure the router during installation (as part of this EP, can be done as a follow-up EP).

enhancements/installer/aws-ingress-nlb-security-group.md

JoelSpeed · 2025-05-28T10:45:45Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+> WIP/TBReviewed
+
+- The implementation in CCM should handle the case where the `service.beta.kubernetes.io/aws-load-balancer-managed-security-group` annotation is set to `true` but the service type is not `NLB` (`aws-load-balancer-type: nlb`). In this scenario, the CCM should likely log a warning mentioning the annotation is supported only on NLB.


What does the CCM do today for annotations that don't apply? I suspect it ignores them

We can use VAP downstream to prevent this

I suspect too, so we don't need to warn/log. I will ensure the existing approach and update this thread. Thanks

enhancements/installer/aws-ingress-nlb-security-group.md

patrickdillon · 2025-05-28T14:46:48Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+Customers deploying OpenShift on AWS using Network Load Balancers (NLBs) for the default router have expressed the need for a similar security configuration as provided by Classic Load Balancers (CLBs), where a security group is created by CCM and associated with the load balancer. This allows for more granular control over inbound and outbound traffic at the load balancer level, aligning with AWS security best practices and addressing security findings that flag the lack of security groups on NLBs provisioned by the default CCM.
+
+The default router in OpenShift, an IngressController object managed by Cluster Ingress Controller Operator (CIO), can be created with a Service type Load Balancer NLB instead of default Classic Load Balancer (CLB) during installation by enabling it in the `install-config.yaml`. Currently, the Cloud Controller Manager (CCM), which satisfies Service resources, provisions an AWS Load Balancer of type NLB without a Security Group (SG) directly attached to it. Instead, security rules are managed on the worker nodes' security groups.


Instead, security rules are managed on the worker nodes' security groups.

What are the benefits of relying on LB security groups over the node sg? Do we get more fine-grained rules that are managed corresponding to the services? Can we reduce the current rules on compute nodes?

What are the benefits of relying on LB security groups over the node sg?
Do we get more fine-grained rules that are managed corresponding to the services?

User can improve security rules targeting the lb only, instead of opening rules on node's SG. But also a best practice to associate SG to an NLB (minimum privileges approach):

"We recommend that you associate a security group with your Network Load Balancer when you create it."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-security-groups.html#security-group-considerations

Can we reduce the current rules on compute nodes?

I don't think this could be a primarily goal, but we can review if it would have some duplicated/unused rule on node's SG.

Action item: I will keep this thread open to make sure this is reflected in the EP.

enhancements/installer/aws-ingress-nlb-security-group.md

2uasimojo · 2025-05-28T17:47:37Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+#### ROSA Classic
+
+- TBD: API changes in Hive to read and process the new install-config option.


Hive itself should not need any changes, as we consume the install-config.yaml as a black box. I think Clusters Service is the component that would deal with this.

That said, it is customary for us to create a QE-only HIVE card to be executed once the upstream work is done, just to make sure we didn't miss anything.

Thanks @2uasimojo , I will make sure this is an action item in the test phase.

mtulio

Thanks @patrickdillon and @JoelSpeed for the review/suggestions. Hopefully I've address your questions.

Perhaps we could focus in the thread where is suggested to change the CCM configuration to enable SG by default in NLBs? if this would be the path forward for this EP (I personally think it is an excellent idea), we could decrease the scope of changes in many components here.

Please let me know your thoughts.

cc @elmiko @rvanderp3 @Miciah

enhancements/installer/aws-ingress-nlb-security-group.md

mtulio · 2025-05-28T18:56:33Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+Customers deploying OpenShift on AWS using Network Load Balancers (NLBs) for the default router have expressed the need for a similar security configuration as provided by Classic Load Balancers (CLBs), where a security group is created by CCM and associated with the load balancer. This allows for more granular control over inbound and outbound traffic at the load balancer level, aligning with AWS security best practices and addressing security findings that flag the lack of security groups on NLBs provisioned by the default CCM.
+
+The default router in OpenShift, an IngressController object managed by Cluster Ingress Controller Operator (CIO), can be created with a Service type Load Balancer NLB instead of default Classic Load Balancer (CLB) during installation by enabling it in the `install-config.yaml`. Currently, the Cloud Controller Manager (CCM), which satisfies Service resources, provisions an AWS Load Balancer of type NLB without a Security Group (SG) directly attached to it. Instead, security rules are managed on the worker nodes' security groups.


What are the benefits of relying on LB security groups over the node sg?
Do we get more fine-grained rules that are managed corresponding to the services?

User can improve security rules targeting the lb only, instead of opening rules on node's SG. But also a best practice to associate SG to an NLB (minimum privileges approach):

"We recommend that you associate a security group with your Network Load Balancer when you create it."

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-security-groups.html#security-group-considerations

Can we reduce the current rules on compute nodes?

I don't think this could be a primarily goal, but we can review if it would have some duplicated/unused rule on node's SG.

Action item: I will keep this thread open to make sure this is reflected in the EP.

enhancements/installer/aws-ingress-nlb-security-group.md

mtulio · 2025-05-28T20:53:54Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+> WIP/TBReviewed
+
+- The implementation in CCM should handle the case where the `service.beta.kubernetes.io/aws-load-balancer-managed-security-group` annotation is set to `true` but the service type is not `NLB` (`aws-load-balancer-type: nlb`). In this scenario, the CCM should likely log a warning mentioning the annotation is supported only on NLB.


I suspect too, so we don't need to warn/log. I will ensure the existing approach and update this thread. Thanks

enhancements/installer/aws-ingress-nlb-security-group.md

mtulio · 2025-05-28T21:01:34Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+- Logic in the service controller within the CCM (`pkg/providers/v1/aws.go` and `pkg/providers/v1/aws_loadbalancer.go` ) to recognize and handle the new annotation when the service type is `NLB` (`ServiceAnnotationLoadBalancerType = "service.beta.kubernetes.io/aws-load-balancer-type"` ).
+- Functionality within the CCM to create and manage the lifecycle of AWS Security Groups for NLBs, including creating ingress and egress rules based on the service specification. This would likely involve using the AWS SDK for Go to interact with the EC2 API for creating and managing security groups.
+


Cross-ref to the thread that we could enhance the config to improve the EP goal.

mtulio · 2025-06-06T02:11:15Z

Thanks @patrickdillon and @JoelSpeed for the review/suggestions. Hopefully I've address your questions.

Perhaps we could focus in the thread where is suggested to change the CCM configuration to enable SG by default in NLBs? if this would be the path forward for this EP (I personally think it is an excellent idea), we could decrease the scope of changes in many components here.

Please let me know your thoughts.

cc @elmiko @rvanderp3 @Miciah

Thanks you all for the feedabck. The EP has been reviewed with the comments, updating the proposal to limit to CCM changes by introducing a cloud-config (global configuration) to opt-in enable the managed front-end security group when creating Service type-LoadBalancer NLB, allowing CCCMO to enforce the default on OpenShift. The proposal also introduce an optional Service annotation to BYO SG will opt-out the manage SG.

This PR is ready for review.

elmiko

this is reading well to me, we probably need to chat about the TBD items but i have a couple suggestions/questions.

elmiko · 2025-06-12T16:21:53Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+AWS [announced support for Security Groups when deploying an NLB in August 2023][nlb-supports-sg], but the CCM for AWS (within kubernetes/cloud-provider-aws) does not currently implement the feature of automatically creating and managing security groups for `Service` resources type-LoadBalancer using NLBs. While the [AWS Load Balancer Controller (ALBC/LBC)][aws-lbc] project already supports deploying security groups for NLBs, this enhancement focuses on adding minimal, opt-in support to the existing CCM to address immediate customer needs without a full migration to the LBC. This approach aims to provide the necessary functionality without requiring significant changes in other OpenShift components like the Ingress Controller, installer, ROSA, etc.
+
+Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.


the beginning of this sentence is a little confusing:

Using a Network Load Balancer is a recommended network-based Load Balancer by AWS,

is this saying that NLB is the recommended way to do load balancing?

it's recommended way for network-based LBs. Currently AWS offers two LB replacing ELB/Classic (default by CCM): NLB (network-based) and ALB (application-based). So the idea is to mention the NLB is the recommended one. Do you think I need to state that replacement to improve the reading?

that makes sense, perhaps to make the sentence clearer you could say:

Suggested change

Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.

Using a Network Load Balancer, as opposed to an Application Load Balancer, is the recommended way to do network-based load balancing by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.

is that accurate?

Hey @elmiko , what about this?

Suggested change

Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.

Using a Network Load Balancer, as opposed to an Classic Load Balancer, is the recommended way to do network-based load balancing by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.

We can compare NLB with CLB.

elmiko · 2025-06-12T16:23:49Z

enhancements/installer/aws-ingress-nlb-security-group.md

+  - a) decreases the amount of provider-specific changes on CIO;
+  - b) decreases the amount of maintained code/projects by the team (e.g., ALBC);
+  - c) enhances new configurations to the Ingress Controller when using NLB;
+  - d) decreases the amount of images in the core payload;


is this decrease in reference to the ALBC?

Correct, ALBC + ALBO would be required if CIO defaults to ALBC

i might say this as "does not increase the amount of images in the core payload"

enhancements/installer/aws-ingress-nlb-security-group.md

elmiko · 2025-06-12T17:17:42Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+**Phase 1: CCM Support managed security group for Service type-LoadBalancer NLB**
+
+- Implement support of cloud provider configuration on CCM to managed Security Group by default when creating resource Service type-LoadBalancer NLB.


will there be a feature gate in phase 1?

Yes, feature gate is already created/merged: openshift/api#2354
Wasn't sure if I needed to mention here, I added a reminder in the graduation criteria: https://github.com/openshift/enhancements/pull/1802/files#diff-84882e6fc6fb023742b0ac09960b79620cfea983c45def4739a89fd404cdc05aR359

TODO/reminder: mention the FG when we have the target (DP or TP).

elmiko · 2025-06-12T17:21:52Z

enhancements/installer/aws-ingress-nlb-security-group.md

+  - When the configuration is present in the NLB flow, the CCM will:
+    - Create a new Security Group instance for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`.
+    - Create Ingress and Egress rules in the Security Group based on the NLB Listeners' and Target Groups' ports. Egress rules should be restricted to the necessary ports for backend communication (traffic and health check ports).
+    - Delete the Security Group when the corresponding service is deleted.


do we need any extra checks for the SG's to make sure they are deleted on destruction of a cluster? (i'm thinking about when we destroy clusters using the installer)

AFAIK no. The SGs must have the cluster owned tags discovered by installer on destroy, like the regular resources created by CCM/Service/Ingresses.

elmiko · 2025-06-12T17:23:00Z

enhancements/installer/aws-ingress-nlb-security-group.md

+- Change the default OpenShift IPI install flow when deploying the default router using IPI (users still need to explicitly set the `lbType` configuration to `nlb` to automatically consume this feature).
+- Change any ROSA code base, both HCP or Classic, to support this feature.
+
+## Proposal


these changes are going to be proposed upstream for the CCM?

Correct, everything related to CCM is upstream first.

elmiko · 2025-06-12T17:27:39Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+TODO review the following items:
+
+- The Security Group naming convention should be consistent and informative, including the cluster ID, namespace, and service name to aid in identification and management in the AWS console. (TODO: need review, perhaps we just create the same name as LB?)


keeping the names consistent between Service, load balancer, and security group seems like a good option to make this easier to triage.

The big question here is if we'll need consistency with ELB/Classic (current CCM), or change in CCM to adapt to the pattern used in ALBC?

IIRC ALBC uses a different naming convention on NLBs, I need to check if SGs follows the same pattern.

I will keep this thread open to bring more information.

sounds good, we should probably follow the ALBC behavior where we can, but also try not to surprise the user.

elmiko · 2025-06-12T18:29:11Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+## Alternatives (Not Implemented)
+
+> TODO/TBD


i think it's worth mentioning the idea of making the ALBC functionality into a module that can be imported into the CCM as something we should investigate for the future.

Co-authored-by: Michael McCune <[email protected]>

mtulio · 2025-06-16T21:20:47Z

enhancements/installer/aws-ingress-nlb-security-group.md

+  - a) decreases the amount of provider-specific changes on CIO;
+  - b) decreases the amount of maintained code/projects by the team (e.g., ALBC);
+  - c) enhances new configurations to the Ingress Controller when using NLB;
+  - d) decreases the amount of images in the core payload;


Correct, ALBC + ALBO would be required if CIO defaults to ALBC

mtulio · 2025-06-16T21:23:41Z

enhancements/installer/aws-ingress-nlb-security-group.md

+## Graduation Criteria
+
+> TODO/TBD
+
+### Dev Preview -> Tech Preview
+
+N/A. This feature will be introduced as Tech Preview (TBReviewed).
+
+### Tech Preview -> GA
+
+The E2E tests should be consistently passing, and a PR will be created to enable the feature gate by default.


Expand the FG added here openshift/api#2354

Initially we've been asked to go directly to TP, but considering the impact of this change (default to SG) we are considering starting from DP. We are evaluating the velocity in upstream and how fast we can move it.

mtulio · 2025-06-16T21:24:42Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+**Phase 1: CCM Support managed security group for Service type-LoadBalancer NLB**
+
+- Implement support of cloud provider configuration on CCM to managed Security Group by default when creating resource Service type-LoadBalancer NLB.


Yes, feature gate is already created/merged: openshift/api#2354
Wasn't sure if I needed to mention here, I added a reminder in the graduation criteria: https://github.com/openshift/enhancements/pull/1802/files#diff-84882e6fc6fb023742b0ac09960b79620cfea983c45def4739a89fd404cdc05aR359

TODO/reminder: mention the FG when we have the target (DP or TP).

mtulio · 2025-06-16T21:25:24Z

enhancements/installer/aws-ingress-nlb-security-group.md

+- Change the default OpenShift IPI install flow when deploying the default router using IPI (users still need to explicitly set the `lbType` configuration to `nlb` to automatically consume this feature).
+- Change any ROSA code base, both HCP or Classic, to support this feature.
+
+## Proposal


Correct, everything related to CCM is upstream first.

mtulio · 2025-06-16T21:26:54Z

enhancements/installer/aws-ingress-nlb-security-group.md

+  - When the configuration is present in the NLB flow, the CCM will:
+    - Create a new Security Group instance for the NLB. The name should follow a convention like `k8s-elb-a<generated-name-from-service-uid>`.
+    - Create Ingress and Egress rules in the Security Group based on the NLB Listeners' and Target Groups' ports. Egress rules should be restricted to the necessary ports for backend communication (traffic and health check ports).
+    - Delete the Security Group when the corresponding service is deleted.


AFAIK no. The SGs must have the cluster owned tags discovered by installer on destroy, like the regular resources created by CCM/Service/Ingresses.

mtulio · 2025-06-16T21:28:58Z

enhancements/installer/aws-ingress-nlb-security-group.md

+
+TODO review the following items:
+
+- The Security Group naming convention should be consistent and informative, including the cluster ID, namespace, and service name to aid in identification and management in the AWS console. (TODO: need review, perhaps we just create the same name as LB?)


The big question here is if we'll need consistency with ELB/Classic (current CCM), or change in CCM to adapt to the pattern used in ALBC?

IIRC ALBC uses a different naming convention on NLBs, I need to check if SGs follows the same pattern.

I will keep this thread open to bring more information.

openshift-ci · 2025-06-16T22:00:32Z

@mtulio: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

alebedev87 · 2025-06-20T17:00:25Z

enhancements/installer/aws-ingress-nlb-security-group.md

+> WIP/TBReviewed
+
+- **Increased complexity in CCM**: Adding security group management to CCM increases its complexity. Mitigation: Focus on a minimal and well-tested implementation, drawing inspiration from the existing CLB security group management logic in CCM.
+- **Potential for inconsistencies with ALBC**: If users later decide to migrate to ALBC, there might be inconsistencies in how security groups are managed. Mitigation: Clearly document the limitations of this approach and the benefits of using ALBC for more advanced scenarios or a broader range of features.


Shouldn't we try to stay consistent with ALBC? Maybe I'm missing some details but the CCM seems to be in a situation similar to ALBC when it started implementing the SG feature for NLBs before 2.6.0 release. Even if the scope of this EP will be only the frontend SG, it can follow the same pattern (auto manage by default, BYO if annotation is used) and use the same knob(s) (annotation(s)).

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 20, 2025

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 20, 2025

SPLAT-2137: Proposal to support SGs on Ingress' NLB

83a5ecf

https://issues.redhat.com/browse/OCPSTRAT-1553 https://issues.redhat.com/browse/SPLAT-2137

mtulio force-pushed the SPLAT-2137 branch from bf4a202 to 83a5ecf Compare May 20, 2025 19:42

mtulio commented May 20, 2025

View reviewed changes

enhancements/installer/aws-ingress-nlb-security-group.md Outdated Show resolved Hide resolved

bchandra-ocp reviewed May 21, 2025

View reviewed changes

enhancements/installer/aws-ingress-nlb-security-group.md Outdated Show resolved Hide resolved

elmiko reviewed May 21, 2025

View reviewed changes

mtulio commented May 22, 2025

View reviewed changes

Addressing feedback from review

21271bb

JoelSpeed reviewed May 28, 2025

View reviewed changes

patrickdillon reviewed May 28, 2025

View reviewed changes

2uasimojo reviewed May 28, 2025

View reviewed changes

mtulio commented May 28, 2025

View reviewed changes

update: limiting the proposal to global config to enforce SG on NLB

5b1fcea

mtulio changed the title ~~WIP/SPLAT-2137: Support Security Group on NLB for Default router on AWS~~ SPLAT-2137: Support Security Group on NLB for Default router on AWS Jun 6, 2025

mtulio marked this pull request as ready for review June 6, 2025 02:07

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 6, 2025

openshift-ci bot requested review from jeffdyoung and rvanderp3 June 6, 2025 02:07

elmiko reviewed Jun 13, 2025

View reviewed changes

Apply suggestions from code review

b47d5b5

Co-authored-by: Michael McCune <[email protected]>

mtulio commented Jun 16, 2025

View reviewed changes

alebedev87 reviewed Jun 20, 2025

View reviewed changes


		- Logic in the service controller within the CCM (`pkg/providers/v1/aws.go` and `pkg/providers/v1/aws_loadbalancer.go` ) to recognize and handle the new annotation when the service type is `NLB` (`ServiceAnnotationLoadBalancerType = "service.beta.kubernetes.io/aws-load-balancer-type"` ).
		- Functionality within the CCM to create and manage the lifecycle of AWS Security Groups for NLBs, including creating ingress and egress rules based on the service specification. This would likely involve using the AWS SDK for Go to interact with the EC2 API for creating and managing security groups.


		> WIP/TBReviewed

		- The implementation in CCM should handle the case where the `service.beta.kubernetes.io/aws-load-balancer-managed-security-group` annotation is set to `true` but the service type is not `NLB` (`aws-load-balancer-type: nlb`). In this scenario, the CCM should likely log a warning mentioning the annotation is supported only on NLB.


		Customers deploying OpenShift on AWS using Network Load Balancers (NLBs) for the default router have expressed the need for a similar security configuration as provided by Classic Load Balancers (CLBs), where a security group is created by CCM and associated with the load balancer. This allows for more granular control over inbound and outbound traffic at the load balancer level, aligning with AWS security best practices and addressing security findings that flag the lack of security groups on NLBs provisioned by the default CCM.

		The default router in OpenShift, an IngressController object managed by Cluster Ingress Controller Operator (CIO), can be created with a Service type Load Balancer NLB instead of default Classic Load Balancer (CLB) during installation by enabling it in the `install-config.yaml`. Currently, the Cloud Controller Manager (CCM), which satisfies Service resources, provisions an AWS Load Balancer of type NLB without a Security Group (SG) directly attached to it. Instead, security rules are managed on the worker nodes' security groups.


		#### ROSA Classic

		- TBD: API changes in Hive to read and process the new install-config option.


		AWS [announced support for Security Groups when deploying an NLB in August 2023][nlb-supports-sg], but the CCM for AWS (within kubernetes/cloud-provider-aws) does not currently implement the feature of automatically creating and managing security groups for `Service` resources type-LoadBalancer using NLBs. While the [AWS Load Balancer Controller (ALBC/LBC)][aws-lbc] project already supports deploying security groups for NLBs, this enhancement focuses on adding minimal, opt-in support to the existing CCM to address immediate customer needs without a full migration to the LBC. This approach aims to provide the necessary functionality without requiring significant changes in other OpenShift components like the Ingress Controller, installer, ROSA, etc.

		Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.

	Using a Network Load Balancer is a recommended network-based Load Balancer by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.
	Using a Network Load Balancer, as opposed to an Application Load Balancer, is the recommended way to do network-based load balancing by AWS, and attaching a Security Group to an NLB is a security best practice. NLBs also do not support attaching security groups after they are created.


		Phase 1: CCM Support managed security group for Service type-LoadBalancer NLB

		- Implement support of cloud provider configuration on CCM to managed Security Group by default when creating resource Service type-LoadBalancer NLB.


		TODO review the following items:

		- The Security Group naming convention should be consistent and informative, including the cluster ID, namespace, and service name to aid in identification and management in the AWS console. (TODO: need review, perhaps we just create the same name as LB?)

SPLAT-2137: Support Security Group on NLB for Default router on AWS #1802

Are you sure you want to change the base?

SPLAT-2137: Support Security Group on NLB for Default router on AWS #1802

Uh oh!

Conversation

mtulio commented May 20, 2025

Uh oh!

openshift-ci-robot commented May 20, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented May 20, 2025

Uh oh!

openshift-ci bot commented May 20, 2025

Uh oh!

mtulio commented May 20, 2025

Uh oh!

Uh oh!

bchandra-ocp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtulio May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtulio Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mtulio Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

openshift-ci-robot commented May 20, 2025 •

edited by openshift-ci bot

Loading

mtulio May 28, 2025 •

edited

Loading

mtulio Jun 4, 2025 •

edited

Loading

mtulio Jun 17, 2025 •

edited

Loading

alebedev87 Jun 20, 2025 •

edited

Loading