[RFC] Enhance Stack Handling #1220

FloThinksPi · 2025-06-25T12:28:44Z

Click Here for a better reviewable/readable version.

rkoster · 2025-06-25T14:14:52Z

Thank you for drafting RFC-0040! It provides a comprehensive overview of enhancements to stack handling in Cloud Foundry, and I appreciate the effort you’ve put into addressing these critical issues.

However, I feel the RFC could benefit from being split into individual, logically scoped RFCs to improve focus and facilitate discussion. For example:

Improved Logical Stack Management: Covering changes to CF API for stack states like deprecated, locked, disabled, and their timestamps.
Bring Your Own Stack: Discussing the introduction of user-managed stacks via container image references and the related CF API changes.
Stack Release Cycle Alignment: Addressing the proposal to build stacks for every Ubuntu LTS release.
Org-Scoped Stack Management: Introducing organization-specific stack visibility and control.

Each topic is significant and detailed enough to warrant its own RFC. Splitting them would allow contributors to dive deeper into specific areas, streamline discussions, and prioritize implementation efforts more effectively.

On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy.

I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments.

Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time.

FloThinksPi · 2025-07-08T14:15:02Z

Thanks for the Feedback @rkoster,

I can split it in multiple RFCs however i think then the big picture is lost for which the RFCs are made as the sum of the 3 RFCs/Proposals only together brings improvements for the problem. I would rather like to accept or dismiss individual proposals as the result of the review and then document it in the proposals which one is accepted and which one is not. And then Accept or Dismiss the RFC as a whole. Maybe we can open multiple https://github.com/cloudfoundry/community/discussions for each proposal and link them here, would that help in the review process to structure it better ?

E.g. for the Proposals:
RFC-0040 - Proposal: Improve logical stack management in CF API (cloud_controller_ng)
RFC-0040 - Proposal: Bring your own Stack
RFC-0040 - Proposal: Provide a stack with every ubuntu LTS

The last proposal is just a "Note for Future Reference" not an actual proposal to commit to but to pull out in case its desired in some months/years as extension scenario.

I`ll answer in the Discussions on the other questions you had just to try that also out :)

stephanme · 2025-07-15T14:30:08Z

I'm not sure that additional discussion threads help. The RFC README clearly states that the discussion shall happen in the PR.

I share @rkoster's opinion that this RFC is rather big and deserves a split. "Improve logical stack management in CF API" and "Bring your own stack" are proposals that bring value w/o each other and which can be implemented independently.

"Provide a stack with every ubuntu LTS" is in my eyes not a good RFC candidate. The idea is clear and has been brought up in the past but in the end it is a question of resources and commitment. Concrete RFCs for the next stack like rfc-0039-noble-based-cflinuxfs5 are more helpful as they indicate commitment by the author and make clear that work will really start.

The RFC README doesn't say anything about the scope/size of an RFC and we have RFC of all sizes (from smaller process related RFC like introducing the reviewer role up to long running CF API v2 removal and manifest v2). I would strive for not-too-big RFCs so that the corresponding implementation issues can get closed one day.

FloThinksPi · 2025-07-16T14:28:11Z

@rkoster @stephanme alright split to #1251 and dropped the stack release proposal entirely.

FloThinksPi · 2025-07-16T14:30:02Z

@rkoster to answer to your comment

I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments.

We could extend the feature flag to have 3 modes - off, admin-only and on.
Also we could allow an admin to set a docker://URL stack as system stack in the stacks table. This one can only be done by admins.
So then this feature to consume a stack from the outside could actually be an admin only thing. However not sure how much value this adds for an operator/admin. Note that by default the feature-flag anyway will be disabled so an operator firstly has to decide to provide this function in his foundation deliberately. In an airgapped environment an operator would not enable this feature i guess since its pointless. When they would run their own container registry in that environment they could also just use the usual bosh mechanism to deploy the stack onto the diego cells anyway.

On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy.

I also thought intensively over that, in the end we actually already allowed this when we introduced "bring your own buildpack" to some degree. If you run your own buildpack - or even a system one - you anyway have to regularly restage to consume patches to your buildpack or languages libs. In case your application does not do dependency version pinning at all or does just pin direct dependencies but not transitive ones. E.g. patched log4j libs as an example everyone might know of. Even though CF gives you some things automatically in my experience users didn't know that sometimes they have to restage to consume new libs as they (naively)thought the system takes care if it. For example take the python buildpack - you only get a CVE patched in the python interpreter if you restage! Same goes for the Go Compiler, Ruby or Java JRE.

What may thus be desirable when allowing freedom like custom buildpacks or custom stacks is to explain the boundary conditions very well and make them transparent to the user. Can be maybe added to the RFC/a new one that we make this more transparent than today either in the API with a special flag(e.g. security-lifecycle-status: or something like that that shows you a restage is required) visible in cf apps or as Documentation.

System Buildpack ➕ System Stack ➡️ You regularly have to restage to deploy new updates of the buildpack or you apps dependencies(if not pinned)
Custom Buildpack ➕ System Stack ➡️ You regularly have to bump the buildpack version AND have to restage to deploy new updates to the buildpack or you apps dependencies(if not pinned)
Custom Buildpack ➕ Custom Stack ➡️ You regularly have to bump the buildpack version AND the Stack version AND have to restage to deploy new updates of the buildpack or you apps dependencies(if not pinned)

Deliberately in the RFC only Custom Stack + Custom buildpack shall be supported so bring your own everything. And firstly the operator has to allow its users with the feature flag to use this feature. Secondly a user has to willingly move away from the system buildpack and system stack - this cannot be an accidental thing. So if better documented i think if a user wants to deliberately care for his buildpack and stack an operator may be enabled to allow him that. As an operator of a foundation compliance can still be validated with a scan over the CF API programmatically in case the operator is responsible for the apps on the foundation. Or decide to not enable that feature at all in case he as concerns.

Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time.

Since we run at large scale with the docker feature flag enabled on our Foundations we got quite some experience already. In CF the custom stack proposal from a diego point of view is nothing else than a docker app. Diego is - as outlined in the RFC - unware of lifecycles thats a CF API concept. It just knows LRPs and Tasks with either a baseLayer beeing on the disk or from a container registry see https://github.com/cloudfoundry/bbs/blob/main/docs/031-defining-lrps.md maybe i missed something here but that means we can derive the behaviour of custom stacks for diego from the already known behaviour of docker apps withing CF since from a diego point of view they are identical.
Also the update behaviour of the stack is the same as with lifecycle docker. E.g. if you specify a container with tag latest on a app restart you run with the newest stack similar as when we bump the system stack currently.
We simply just allow to combine lifecycle docker and buildpack in the sense that the CF API schedules a staging task with a baseLayer from a container registry.

Thinking about that now - maybe it makes sense for the feature flag for custom stacks to be only activate-able when diego_docker https://docs.cloudfoundry.org/adminguide/docker.html feature flag is also set. Opinions on that ?

beyhan

Please follow the RFC draft creation process and change the name of the file to rfc-draft-enhance-stack-handling.md because our automation generates and assigns the RFC number when it is accepted and merged.

toc/rfc/rfc-0040-enhance-stack-handling.md

clearer

FloThinksPi · 2025-07-24T14:52:05Z

For reference is spend a few minutes today to make the adoption in the cloud_controller_ng https://github.com/cloudfoundry/cloud_controller_ng/pull/4475/files its not ready yet, likely has failing tests etc. but gives a quick overview whats about to change with this RFC code wise roughly.