-
Notifications
You must be signed in to change notification settings - Fork 219
[RFC] Enhance Stack Handling #1220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @FloThinksPi, Thank you for drafting RFC-0040! It provides a comprehensive overview of enhancements to stack handling in Cloud Foundry, and I appreciate the effort you’ve put into addressing these critical issues. However, I feel the RFC could benefit from being split into individual, logically scoped RFCs to improve focus and facilitate discussion. For example:
Each topic is significant and detailed enough to warrant its own RFC. Splitting them would allow contributors to dive deeper into specific areas, streamline discussions, and prioritize implementation efforts more effectively. On a higher level I'm also worried about handing responsibility for stacks in the hands of app developers. One of the great features of CF is that platform operators can fix CVEs in for example OpenSSL with a single bosh deploy. I understand the desire to fix this, but maybe a compromise could be found in making stack management an admin only feature (using the cf cli). Ideally using a blobstore first approach so it works in airgapped environments. Last but not least I'm wondering about the roll out mechanisms for handeling custom stack updates, given these changes are not orchestrated through bosh, how do we ensure we don't over we Diego? There needs to be some sort of global max_in_flight setting to control the total number of apps that are being restarted at the same time. |
Thanks for the Feedback @rkoster, I can split it in multiple RFCs however i think then the big picture is lost for which the RFCs are made as the sum of the 3 RFCs/Proposals only together brings improvements for the problem. I would rather like to accept or dismiss individual proposals as the result of the review and then document it in the proposals which one is accepted and which one is not. And then Accept or Dismiss the RFC as a whole. Maybe we can open multiple https://github.com/cloudfoundry/community/discussions for each proposal and link them here, would that help in the review process to structure it better ? E.g. for the Proposals: The last proposal is just a "Note for Future Reference" not an actual proposal to commit to but to pull out in case its desired in some months/years as extension scenario. I`ll answer in the Discussions on the other questions you had just to try that also out :) |
I'm not sure that additional discussion threads help. The RFC README clearly states that the discussion shall happen in the PR. I share @rkoster's opinion that this RFC is rather big and deserves a split. "Improve logical stack management in CF API" and "Bring your own stack" are proposals that bring value w/o each other and which can be implemented independently. "Provide a stack with every ubuntu LTS" is in my eyes not a good RFC candidate. The idea is clear and has been brought up in the past but in the end it is a question of resources and commitment. Concrete RFCs for the next stack like rfc-0039-noble-based-cflinuxfs5 are more helpful as they indicate commitment by the author and make clear that work will really start. The RFC README doesn't say anything about the scope/size of an RFC and we have RFC of all sizes (from smaller process related RFC like introducing the reviewer role up to long running CF API v2 removal and manifest v2). I would strive for not-too-big RFCs so that the corresponding implementation issues can get closed one day. |
@rkoster @stephanme alright split to #1251 and dropped the stack release proposal entirely. |
@rkoster to answer to your comment
We could extend the feature flag to have 3 modes -
I also thought intensively over that, in the end we actually already allowed this when we introduced "bring your own buildpack" to some degree. If you run your own buildpack - or even a system one - you anyway have to regularly restage to consume patches to your buildpack or languages libs. In case your application does not do dependency version pinning at all or does just pin direct dependencies but not transitive ones. E.g. patched log4j libs as an example everyone might know of. Even though CF gives you some things automatically in my experience users didn't know that sometimes they have to restage to consume new libs as they (naively)thought the system takes care if it. For example take the python buildpack - you only get a CVE patched in the python interpreter if you restage! Same goes for the Go Compiler, Ruby or Java JRE. What may thus be desirable when allowing freedom like custom buildpacks or custom stacks is to explain the boundary conditions very well and make them transparent to the user. Can be maybe added to the RFC/a new one that we make this more transparent than today either in the API with a special flag(e.g.
Deliberately in the RFC only Custom Stack + Custom buildpack shall be supported so bring your own everything. And firstly the operator has to allow its users with the feature flag to use this feature. Secondly a user has to willingly move away from the system buildpack and system stack - this cannot be an accidental thing. So if better documented i think if a user wants to deliberately care for his buildpack and stack an operator may be enabled to allow him that. As an operator of a foundation compliance can still be validated with a scan over the CF API programmatically in case the operator is responsible for the apps on the foundation. Or decide to not enable that feature at all in case he as concerns.
Since we run at large scale with the docker feature flag enabled on our Foundations we got quite some experience already. In CF the custom stack proposal from a diego point of view is nothing else than a docker app. Diego is - as outlined in the RFC - unware of lifecycles thats a CF API concept. It just knows LRPs and Tasks with either a baseLayer beeing on the disk or from a container registry see https://github.com/cloudfoundry/bbs/blob/main/docs/031-defining-lrps.md maybe i missed something here but that means we can derive the behaviour of custom stacks for diego from the already known behaviour of docker apps withing CF since from a diego point of view they are identical. Thinking about that now - maybe it makes sense for the feature flag for custom stacks to be only activate-able when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the RFC draft creation process and change the name of the file to rfc-draft-enhance-stack-handling.md
because our automation generates and assigns the RFC number when it is accepted and merged.
For reference is spend a few minutes today to make the adoption in the cloud_controller_ng https://github.com/cloudfoundry/cloud_controller_ng/pull/4475/files its not ready yet, likely has failing tests etc. but gives a quick overview whats about to change with this RFC code wise roughly. |
toc/rfc/rfc-0040-enhance-stack-handling/current_stack_usage.png
Outdated
Show resolved
Hide resolved
programmed client side by creating new CF Applications. | ||
However all existing apps using a locked stack SHOULD continue to run. | ||
|
||
- Mark a stack as disabled -> prevent using the stack for any app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we would get much value from having both "locked" and "disabled", especially since "locked" will de-facto break some blue-green deployments. It'd be simpler to understand (and implement) if we reduced it down to two functional states: locked/unlocked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We brought in this intermediary state to have a (optional) step in between. An operator can also decide to just move from depricated to disabled straight without setting a timestamp for locked. That being said the locked stack is important for very very large foundations where the disabled stack would create to much disturbance at once and support load. With the gradual exposure with a locked state one can do this at least in 2 steps. Optimally i would liked to propose this as an per organisation setting however that is a much larger change in the CC and in the user expeience i thought so i`d like to firstly bring this in on the same logical level that exists today - globally as the stacks are only global assets. After that and after gathering more experience one may come forward with additional optimizations to the workflow/UX/process like:
- Individual Stack states per Org
- A Stacks visibillity mapping like with services which steer which org/space can use which stack
- etc.
But i think its more valuable to firstly improve something in the global stack level before goint into org scoped world. So the locked state in short is just to have not one big step in the deprication process but rather 2 smaller ones to reduce/distribute effects on large consumer bases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments (reference to wrong RFC number) otherwise LGTM.
occur. This RFC proposes improvements in CF to shift this unavailability | ||
towards lifecycle operations early and not actual app downtime - making it a | ||
more pleasant experience for CF users and operators alike. | ||
To mitigate the downsides of this approach, RFC-0041 proposes to provide custom stacks functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC number of #1251 is not yet known. Maybe use the PR as reference.
(there are multiple references to RFC-0041 further down)
We start the final comment period with the goal to accept this RFC on 9th Sept. |
Click Here for a better reviewable/readable version.
Related RFC-0041