-
Notifications
You must be signed in to change notification settings - Fork 398
Revert "build_docker_config added, enables augmentation of the build pod's docker config" #1293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…pod's docker config"
If we need some time to figure out why this didn't work I'd propose to merge this and make a second attempt to land this feature. Maybe we can even find a test that would have caught this to stop it from coming back? |
oh. 😬 |
The change in secret name is breaking Edit (again)... |
Hmmm this would have happened if the binderhub Pods didn't restart as part of the upgrade, and started using the new k8s secret name for the new build pods being created. Since the build pods keep increasing and fail with pending due to failure to see the old k8s secret around, it indicates to me that the software that creates the build pods didn't reconfigure itself with the new k8s secret name at least. Is it the binderhub pod that creates the build pods? |
The build pods should be created by the BinderHub application Lines 239 to 240 in 0b4462c
|
It sounds like restarting the binderhub pods should be the way to fix this/not have this happen. The weird thing is that a change like this (new binderhub code -> new binderhub container image) means the binderhub pods will restart. I unfortunately didn't look at the age of the binderhub pods when things were broken so I don't know for sure if they restarted or not. However now the bhub pods are ~15h old which I think means they restarted when we reverted the deploy. So probably they also restarted when we originally deployed it. Does anyone understand why even with restarting the bhub pods we still ended up with build pods not being able to start over a period of a few hours? |
This is known:
ConclusionMybinder.org deployed a new version. This worked out well for all federation members except for GKE. A rolling upgrade was made but the new binder replicas that started failed to become Ready. The Why did the GKE deployment fail, why didn't the binderhub pod replica become ready? I'm not sure. I note that the GKE prod use 2 replicas, while other deployments use 1 and the staging use 1-3 based on a HPA (horizontal pod autoscaler). Sugggested action pointI suggest to try redeploy the latest binderhub version to mybinder.org-deploy and babysitting the deployment, observing specifically how the rollout of binderhub turns out and for what reason. |
Is there some magic we can add to the mybinder.org-deploy pipeline to scream/tweet/post a message to gitter when a production deploy fails? |
Attempted another deploy, same result. Digging around the helm chart a bit I found binderhub/helm-chart/binderhub/templates/deployment.yaml Lines 55 to 58 in 0b4462c
|
@betatim what are the logs from the binder pods? Are both restarted and have successfully become ready - in other words, both fully updated? |
No, they remain in the "ContainerCreating" status |
jupyterhub/mybinder.org-deploy#1900 (comment) has some notes |
Sorry guys, again! This was my sloppiness! |
@g-braeunlich bugs happen :) @betatim I merged the bugfix in #1294 and closed this PR by doing so, hoping to help anyone using latest version to run into issues. |
@consideRatio is your change in #1294 what I found in #1293 (comment)? Thanks for the fix. Goes to show that even "simple" changes can be tricky to get right :D |
Ahhhh yes it is, i missed your comment! Sorry for seemingly ignoring it @betatim! |
No worries. It remains a unsolved mystery (for me) why this didn't fail on mybinder.org's staging deployment, but yeah. Let's see what happens this time 😂 |
This proposes to revert #1255. When we deployed this to mybinder.org today we ended up with all(?) build pods getting stuck in a pending state for about two hours. At which point we reverted the deploy.
The marker just before 15:30h is the deploy. The blue shaded graph shows the number of "running" build requests ("running" includes "container creating"). At 17:00h I deleted all build pods and reverted the deploy.
Build pods that were stuck during this time contained the following in the events:
The summary of the original PR says that there is no breaking change and that maybe for a brief moment new build pods would be using the wrong secret/a secret that doesn't exist. Does someone involved in the PR have an idea what happened/went wrong when deploying this?