Skip to content

Conversation

FloThinksPi
Copy link
Member

@FloThinksPi FloThinksPi commented Jun 25, 2025

Click Here for a better reviewable/readable version.

Related RFC-0041

@FloThinksPi FloThinksPi changed the title RFC-0040 Enhance Stack Handling [RFC] Enhance Stack Handling Jun 25, 2025
@beyhan beyhan added the rfc CFF community RFC label Jun 25, 2025
@beyhan beyhan requested review from a team, rkoster, beyhan, stephanme, ameowlia and ChrisMcGowan and removed request for a team June 25, 2025 13:12
@rkoster
Copy link
Contributor

rkoster commented Jun 25, 2025

Hi @FloThinksPi,

Thank you for drafting RFC-0040! It provides a comprehensive overview of enhancements to stack handling in Cloud Foundry, and I appreciate the effort you’ve put into addressing these critical issues.

However, I feel the RFC could benefit from being split into individual, logically scoped RFCs to improve focus and facilitate discussion. For example:

  1. Improved Logical Stack Management: Covering changes to CF API for stack states like deprecated, locked, disabled, and their timestamps.

  2. Bring Your Own Stack: Discussing the introduction of user-managed stacks via container image references and the related CF API changes.

  3. Stack Release Cycle Alignment: Addressing the proposal to build stacks for every Ubuntu LTS release.

  4. Org-Scoped Stack Management: Introducing organization-specific stack visibility and control.

Each topic is significant and detailed enough to warrant its own RFC. Splitting them would allow contributors to dive deeper into specific areas, streamline discussions, and prioritize implementation efforts more effectively.

On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy.

I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments.

Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time.

@Gerg Gerg self-requested a review June 25, 2025 21:26
@rkoster rkoster moved this from Inbox to In Progress in CF Community Jul 1, 2025
@FloThinksPi
Copy link
Member Author

Thanks for the Feedback @rkoster,

I can split it in multiple RFCs however i think then the big picture is lost for which the RFCs are made as the sum of the 3 RFCs/Proposals only together brings improvements for the problem. I would rather like to accept or dismiss individual proposals as the result of the review and then document it in the proposals which one is accepted and which one is not. And then Accept or Dismiss the RFC as a whole. Maybe we can open multiple https://github.com/cloudfoundry/community/discussions for each proposal and link them here, would that help in the review process to structure it better ?

E.g. for the Proposals:
RFC-0040 - Proposal: Improve logical stack management in CF API (cloud_controller_ng)
RFC-0040 - Proposal: Bring your own Stack
RFC-0040 - Proposal: Provide a stack with every ubuntu LTS

The last proposal is just a "Note for Future Reference" not an actual proposal to commit to but to pull out in case its desired in some months/years as extension scenario.

I`ll answer in the Discussions on the other questions you had just to try that also out :)

@beyhan beyhan requested review from cweibel and removed request for ChrisMcGowan July 15, 2025 05:58
@beyhan beyhan added the toc label Jul 15, 2025
@stephanme
Copy link
Member

I'm not sure that additional discussion threads help. The RFC README clearly states that the discussion shall happen in the PR.

I share @rkoster's opinion that this RFC is rather big and deserves a split. "Improve logical stack management in CF API" and "Bring your own stack" are proposals that bring value w/o each other and which can be implemented independently.

"Provide a stack with every ubuntu LTS" is in my eyes not a good RFC candidate. The idea is clear and has been brought up in the past but in the end it is a question of resources and commitment. Concrete RFCs for the next stack like rfc-0039-noble-based-cflinuxfs5 are more helpful as they indicate commitment by the author and make clear that work will really start.

The RFC README doesn't say anything about the scope/size of an RFC and we have RFC of all sizes (from smaller process related RFC like introducing the reviewer role up to long running CF API v2 removal and manifest v2). I would strive for not-too-big RFCs so that the corresponding implementation issues can get closed one day.

@FloThinksPi
Copy link
Member Author

@rkoster @stephanme alright split to #1251 and dropped the stack release proposal entirely.

@FloThinksPi
Copy link
Member Author

@rkoster to answer to your comment

I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments.

We could extend the feature flag to have 3 modes - off, admin-only and on.
Also we could allow an admin to set a docker://URL stack as system stack in the stacks table. This one can only be done by admins.
So then this feature to consume a stack from the outside could actually be an admin only thing. However not sure how much value this adds for an operator/admin. Note that by default the feature-flag anyway will be disabled so an operator firstly has to decide to provide this function in his foundation deliberately. In an airgapped environment an operator would not enable this feature i guess since its pointless. When they would run their own container registry in that environment they could also just use the usual bosh mechanism to deploy the stack onto the diego cells anyway.

On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy.

I also thought intensively over that, in the end we actually already allowed this when we introduced "bring your own buildpack" to some degree. If you run your own buildpack - or even a system one - you anyway have to regularly restage to consume patches to your buildpack or languages libs. In case your application does not do dependency version pinning at all or does just pin direct dependencies but not transitive ones. E.g. patched log4j libs as an example everyone might know of. Even though CF gives you some things automatically in my experience users didn't know that sometimes they have to restage to consume new libs as they (naively)thought the system takes care if it. For example take the python buildpack - you only get a CVE patched in the python interpreter if you restage! Same goes for the Go Compiler, Ruby or Java JRE.

What may thus be desirable when allowing freedom like custom buildpacks or custom stacks is to explain the boundary conditions very well and make them transparent to the user. Can be maybe added to the RFC/a new one that we make this more transparent than today either in the API with a special flag(e.g. security-lifecycle-status: or something like that that shows you a restage is required) visible in cf apps or as Documentation.

  • System Buildpack ➕ System Stack ➡️ You regularly have to restage to deploy new updates of the buildpack or you apps dependencies(if not pinned)
  • Custom Buildpack ➕ System Stack ➡️ You regularly have to bump the buildpack version AND have to restage to deploy new updates to the buildpack or you apps dependencies(if not pinned)
  • Custom Buildpack ➕ Custom Stack ➡️ You regularly have to bump the buildpack version AND the Stack version AND have to restage to deploy new updates of the buildpack or you apps dependencies(if not pinned)

Deliberately in the RFC only Custom Stack + Custom buildpack shall be supported so bring your own everything. And firstly the operator has to allow its users with the feature flag to use this feature. Secondly a user has to willingly move away from the system buildpack and system stack - this cannot be an accidental thing. So if better documented i think if a user wants to deliberately care for his buildpack and stack an operator may be enabled to allow him that. As an operator of a foundation compliance can still be validated with a scan over the CF API programmatically in case the operator is responsible for the apps on the foundation. Or decide to not enable that feature at all in case he as concerns.

Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time.

Since we run at large scale with the docker feature flag enabled on our Foundations we got quite some experience already. In CF the custom stack proposal from a diego point of view is nothing else than a docker app. Diego is - as outlined in the RFC - unware of lifecycles thats a CF API concept. It just knows LRPs and Tasks with either a baseLayer beeing on the disk or from a container registry see https://github.com/cloudfoundry/bbs/blob/main/docs/031-defining-lrps.md maybe i missed something here but that means we can derive the behaviour of custom stacks for diego from the already known behaviour of docker apps withing CF since from a diego point of view they are identical.
Also the update behaviour of the stack is the same as with lifecycle docker. E.g. if you specify a container with tag latest on a app restart you run with the newest stack similar as when we bump the system stack currently.
We simply just allow to combine lifecycle docker and buildpack in the sense that the CF API schedules a staging task with a baseLayer from a container registry.

Thinking about that now - maybe it makes sense for the feature flag for custom stacks to be only activate-able when diego_docker https://docs.cloudfoundry.org/adminguide/docker.html feature flag is also set. Opinions on that ?

Copy link
Member

@beyhan beyhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the RFC draft creation process and change the name of the file to rfc-draft-enhance-stack-handling.md because our automation generates and assigns the RFC number when it is accepted and merged.

@FloThinksPi
Copy link
Member Author

FloThinksPi commented Jul 24, 2025

For reference is spend a few minutes today to make the adoption in the cloud_controller_ng https://github.com/cloudfoundry/cloud_controller_ng/pull/4475/files its not ready yet, likely has failing tests etc. but gives a quick overview whats about to change with this RFC code wise roughly.

programmed client side by creating new CF Applications.
However all existing apps using a locked stack SHOULD continue to run.

- Mark a stack as disabled -> prevent using the stack for any app
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we would get much value from having both "locked" and "disabled", especially since "locked" will de-facto break some blue-green deployments. It'd be simpler to understand (and implement) if we reduced it down to two functional states: locked/unlocked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We brought in this intermediary state to have a (optional) step in between. An operator can also decide to just move from depricated to disabled straight without setting a timestamp for locked. That being said the locked stack is important for very very large foundations where the disabled stack would create to much disturbance at once and support load. With the gradual exposure with a locked state one can do this at least in 2 steps. Optimally i would liked to propose this as an per organisation setting however that is a much larger change in the CC and in the user expeience i thought so i`d like to firstly bring this in on the same logical level that exists today - globally as the stacks are only global assets. After that and after gathering more experience one may come forward with additional optimizations to the workflow/UX/process like:

  • Individual Stack states per Org
  • A Stacks visibillity mapping like with services which steer which org/space can use which stack
  • etc.

But i think its more valuable to firstly improve something in the global stack level before goint into org scoped world. So the locked state in short is just to have not one big step in the deprication process but rather 2 smaller ones to reduce/distribute effects on large consumer bases.

Copy link
Member

@stephanme stephanme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments (reference to wrong RFC number) otherwise LGTM.

occur. This RFC proposes improvements in CF to shift this unavailability
towards lifecycle operations early and not actual app downtime - making it a
more pleasant experience for CF users and operators alike.
To mitigate the downsides of this approach, RFC-0041 proposes to provide custom stacks functionality.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RFC number of #1251 is not yet known. Maybe use the PR as reference.
(there are multiple references to RFC-0041 further down)

@stephanme
Copy link
Member

We start the final comment period with the goal to accept this RFC on 9th Sept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc CFF community RFC toc
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants