Skip to content

Egress Connection Timeouts Design #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

rari459
Copy link

@rari459 rari459 commented Jan 16, 2025

As mentioned in the discussion with Cilium Community on 1/8/2025, we propose an additional timeouts field to CiliumEgressGatewayPolicy to allow users to control egress connection timeouts values at the CEGP level. This lays out approaches for designing this feature.

As mentioned in the discussion with Cilium Community on 1/8/2025, we propose an additional timeouts field to CiliumEgressGatewayPolicy to allow users to control egress connection timeouts values at the CEGP level. 
This lays out approaches for designing this feature.

Signed-off-by: rari459 <[email protected]>
@rari459 rari459 changed the title Create Egress Connection Timeouts Design for Review Egress Connection Timeouts Design Jan 16, 2025
Replace struct of connection timeouts with just one __u32 value with represents lifetime of the connection in seconds.

Signed-off-by: rari459 <[email protected]>
Change back to struct of connection timeout values instead of only one uint32 timeout value.

Signed-off-by: rari459 <[email protected]>
Copy link
Member

@julianwiedmann julianwiedmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this sort of low-level datapath optimization is an interesting idea! I left some initial thought inline to better understand the proposal.

Comment on lines +20 to +25

Currently, Cilium relies heavily on default operating system connection timeout settings and offers control over connection timeouts at cluster or node level. It is not optimal for all workloads to be bound to these node level timeouts, especially with respect to egress gateways where prolonged idle connections can contribute to port exhaustion on the NAT gateway.

Modifying CiliumEgressPolicy to include an optional timeout field would allow us to ingest custom timeouts and give users additional control over egress connections.


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some High-level comment

  1. do you have feedback from potential users on whether they're comfortable fiddling with such low-level values? Making this a per-CEGP configuration feels like a reasonable granularity, but I'm unsure whether admins will know what timeouts to use. And how to tune them.
  2. I haven't grasped whether you're proposing to use the custom timeouts only on the GW node, or also for the CT entry on the worker nodes?
  3. As your motivation is to avoid port exhaustion on the gateway - what interaction do you see with the CT GC engine? Getting reliable results from configuring low CT lifetimes would require an appropriate GC timer, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will try to collect some data on potential users via the Slack Channel and update.
  2. Since Egress Policy Map is available at both nodes, I think it makes the most sense to update timeouts for ct entries for both src and gw nodes.
  3. Right now we do not make any changes to the GC engine. The report interval for GC is set to 5s by default and is configurable. I think this config can be left for the user and out of the scope for this feature, if the user sets low CT lifetimes they should configure the CT report interval to be lower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we do not make any changes to the GC engine. The report interval for GC is set to 5s by default and is configurable.

huh, I thought we were defaulting to 0s (== dynamic interval) ?

I think this config can be left for the user and out of the scope for this feature, if the user sets low CT lifetimes they should configure the CT report interval to be lower.

But we're dealing with different personas here? Setting a per-policy GC option will only be effective if the node-wide configuration is also tuned accordingly. So I don't think we can just declare this aspect out-of-scope.

Copy link
Author

@rari459 rari459 Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I confused GC Interval with Ct Report Interval.
CT GC interval is set dynamically and proportionally by taking into account the percent of entries deleted from the ct_table in each cycle.

I believe this would be sufficient for the initial implementation of this feature as it keeps the footprint of this change small. If a large number of ct_entries are being deleted in each interval on the GW Node due to low CT timeouts, then GC interval should happen more frequently.

I'm proposing this simpler approach for now to minimize the footprint of this change. However, I agree that a more comprehensive review of the GC interval mechanism could be beneficial down the road. Perhaps we can explore that as a follow-up enhancement to this feature?



* Add an optional timeout field to CEGP.
* Modify CEGP ingestion logic to add new timeout fields to EGRESS_POLICY_MAP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe extending the map value will require a map migration. At that point, let's consider what additional values we should place into the map value - I've wanted to store the ifindex of the egress interface for a long time :).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, where can I get a list of proposed additional values to add for this map?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think such a list exists, it's something we can raise with @cilium/egress-gateway once work on an implementation has started.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any such upgrade will require creating a new policy map at time of upgrade, populating it with realized EGW state and then finally doing a "swap" of the maps by loading the new datapath + repinning maps.

@julianwiedmann Aside from including ifindex for the short term, should we just invest in trying to make that process easily repeatable in the future instead of racking our brains trying to think of what fields we would like to add to the policy k/v?

<tr>
<td style="background-color: null">Engineering Investment
</td>
<td style="background-color: #ffe599">Modify SNAT Datapath and conntrack creation to ingest and write custom timeouts to egress nat conntrack entries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context EGW connections, what would be the implication of a conntrack entry timing out via one of these timeouts?

Without a timeout on the socket level wouldn't we possibly see hanging connections?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To broaden a bit, I think we'd need to detail how the various mechanisms for enforcing timeouts would play out in practice - especially with regards to intended use cases.

For example, are we looking for timeout functionality from/in-parity with kernel sysctl settings (ex. ipv4 timeout) such that the connection is immediately terminated following the expiry?

That might help with making a decision.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to keep the behavior consistent with current cilium implementation. Today ct timeouts are configurable at the cluster/node level and these timeouts are not passed up to the socket. This feature just looks to improve the granularity of the existing functionality.

CT timeouts are especially useful for the EGW node because they can free up ports from stale connections and prevent exhaustion. This is the primary motivator for the feature; so passing these timeouts up to socket level is not a goal of this feature.

struct egress_gw_policy_entry {
__u32 egress_ip;
__u32 gateway_ip;
__u8 custom_timeouts_specified;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating entry code could present some complexity with regard to upgrading Cilium. Possible alternative approach could be to enable the lookup code only if the feature is enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, we can introduce a new cilium config flag to enable this feature.

@tommyp1ckles tommyp1ckles self-requested a review January 30, 2025 18:29
Co-authored-by: Bill Mulligan <[email protected]>
Signed-off-by: rari459 <[email protected]>
@nathanjsweet
Copy link
Member

Ongoing discussion is taking place here: cilium/cilium#37454

@nathanjsweet nathanjsweet added the dont-merge/discussion Active discussions should be resolved before merging this PR label Mar 12, 2025
@julianwiedmann julianwiedmann self-requested a review April 9, 2025 09:01
@tommyp1ckles
Copy link

@rari459 Just a few more comments, seems like we have a couple things we need to update on from the discussion over on cilium/cilium#37454.

Overall, from my perspective this proposal makes sense, with the ux changes regarding the EgressGatewayPolicy I think this provides a good way to expose this somewhat low level configuration that some users may be interested in.

name: egress-sample
spec:
[other fields] ...
connection-timeouts: |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per this discussion, it was mentioned that we should instead use a general "custom" profile approach where users can specify fine grain controls explicitly if they desire.

Are we still planning on having that here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the expected users of this field are power users, I wanted to see if we could simplify the field to only accept custom timeouts. The advantage to keeping the profile and custom fields is that we could specify some preset connection timeout settings but this leads to the issue of actually having a hard-coded set of connection timeout configurations.

To avoid this, I merged the custom field into the connection timeouts field but I am open to keeping the profile and custom fields if we feel that having preset fields is needed. In that case there still is the question of what those preset fields should be.



* Add an optional timeout field to CEGP.
* Modify CEGP ingestion logic to add new timeout fields to EGRESS_POLICY_MAP

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any such upgrade will require creating a new policy map at time of upgrade, populating it with realized EGW state and then finally doing a "swap" of the maps by loading the new datapath + repinning maps.

@julianwiedmann Aside from including ifindex for the short term, should we just invest in trying to make that process easily repeatable in the future instead of racking our brains trying to think of what fields we would like to add to the policy k/v?


The custom field within the custom timeouts configuration offers a mechanism for advanced users to inject specific, implementation-defined settings for Connection Tracking (CT). This field accepts arbitrary text, which the agent can interpret as a set of parameters specific to the currently running Cilium version. This approach allows for fine-grained control over CT behavior for power users who have a deep understanding of the underlying system. By keeping these advanced settings outside of the formal API, we gain flexibility to evolve and potentially rework CT parameters in future Cilium versions without introducing breaking API changes.

It is crucial to understand that utilizing this custom field signifies an acceptance of risk. Users who delve into these implementation details are expected to (a) fully acknowledge and bear the responsibility for any instability or unexpected behavior resulting from their custom configurations, and (b) proactively monitor upstream Cilium development for changes in these settings. Furthermore, these power users are encouraged to actively participate in the Cilium community, providing feedback on their experiences and helping to guide the evolution of these advanced options. While we will strive to release-note any modifications to the interpretation of this field, there are fewer guarantees regarding the stability and compatibility of these settings across different Cilium versions. Strong and consistent feedback from multiple users regarding the utility and effectiveness of specific custom settings may lead to their consideration for formal inclusion in future API revisions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can likely express this by putting this into a seperate documentation page and having a warning about the implications of fiddling with these settings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll move this out.

@rari459 rari459 requested a review from tommyp1ckles April 25, 2025 17:33
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this latest version takes into account the concerns I raised in cilium/cilium#37454 regarding API commitments. Minor suggestion to reinforce below on that.

Other than that I defer to @cilium/egress-gateway reviewers for feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dont-merge/discussion Active discussions should be resolved before merging this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants