Revert "build_docker_config added, enables augmentation of the build pod's docker config" #1293

betatim · 2021-05-03T15:07:53Z

This proposes to revert #1255. When we deployed this to mybinder.org today we ended up with all(?) build pods getting stuck in a pending state for about two hours. At which point we reverted the deploy.

The marker just before 15:30h is the deploy. The blue shaded graph shows the number of "running" build requests ("running" includes "container creating"). At 17:00h I deleted all build pods and reverted the deploy.

Build pods that were stuck during this time contained the following in the events:

Events:
  Type     Reason       Age                 From                                         Message
  ----     ------       ----                ----                                         -------
  Warning  FailedMount  42m (x3 over 56m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[default-token-jcbqc docker-socket docker-push-secret]: timed out waiting for the condition
  Warning  FailedMount  26m (x3 over 31m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[docker-push-secret default-token-jcbqc docker-socket]: timed out waiting for the condition
  Warning  FailedMount  17m (x11 over 61m)  kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[docker-socket docker-push-secret default-token-jcbqc]: timed out waiting for the condition
  Warning  FailedMount  2m (x38 over 63m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  MountVolume.SetUp failed for volume "docker-push-secret" : secret "binder-push-secret" not found

The summary of the original PR says that there is no breaking change and that maybe for a brief moment new build pods would be using the wrong secret/a secret that doesn't exist. Does someone involved in the PR have an idea what happened/went wrong when deploying this?

…pod's docker config"

betatim · 2021-05-03T15:08:18Z

cc @g-braeunlich @consideRatio @manics

betatim · 2021-05-03T15:12:01Z

If we need some time to figure out why this didn't work I'd propose to merge this and make a second attempt to land this feature. Maybe we can even find a test that would have caught this to stop it from coming back?

g-braeunlich · 2021-05-03T15:28:16Z

oh. 😬

manics · 2021-05-03T15:48:29Z

The change in secret name is breaking
#1255 (review)
though the error unmounted volumes=[docker-push-secret] doesn't match, so maybe that's a red herring?

Edit (again)...
https://github.com/jupyterhub/binderhub/pull/1255/files#diff-398e15cdaefcf366bd9fde04a2bcb2cd82840b68c28b5a9c632e55fad93abe0f
changes docker-push-secret to docker-config

consideRatio · 2021-05-03T17:07:00Z

Hmmm this would have happened if the binderhub Pods didn't restart as part of the upgrade, and started using the new k8s secret name for the new build pods being created.

Since the build pods keep increasing and fail with pending due to failure to see the old k8s secret around, it indicates to me that the software that creates the build pods didn't reconfigure itself with the new k8s secret name at least. Is it the binderhub pod that creates the build pods?

manics · 2021-05-03T17:24:20Z

The build pods should be created by the BinderHub application

binderhub/binderhub/build.py

Lines 239 to 240 in 0b4462c

    
           def submit(self): 
        
               """Submit a build pod to create the image for the repository."""

betatim · 2021-05-04T06:46:00Z

It sounds like restarting the binderhub pods should be the way to fix this/not have this happen. The weird thing is that a change like this (new binderhub code -> new binderhub container image) means the binderhub pods will restart. I unfortunately didn't look at the age of the binderhub pods when things were broken so I don't know for sure if they restarted or not. However now the bhub pods are ~15h old which I think means they restarted when we reverted the deploy. So probably they also restarted when we originally deployed it.

Does anyone understand why even with restarting the bhub pods we still ended up with build pods not being able to start over a period of a few hours?

consideRatio · 2021-05-04T16:14:12Z

This is known:

Tests succeeded in the binderhub CI/CD pipeline, including the creation of a build pod. So, new pods successfully referenced the new k8s Secret name. The tests we have are for fresh installations though.
mybinder.org make use of two replicas of the binder pod (running binderhub) and will perform a rolling upgrade of the binder pod running the binderhub software.
helm upgrade --wait is used by the mybinder.org-deploy CI/CD script should ensure that deployments startup successfully and become ready.
There is no reference to the old k8s Secret name, so it indicates that an old replica of the binder pod is still running.
This suggests that the upgrade must have failed. Inspecting mybinder.org-deploy pipelines, it seems like it did! It failed waiting for the binder Deployment!

Conclusion

Mybinder.org deployed a new version. This worked out well for all federation members except for GKE. A rolling upgrade was made but the new binder replicas that started failed to become Ready. The helm upgrade was made, but the new replica never became ready, while the old k8s secret name was removed and the new k8s secret name was added.

Why did the GKE deployment fail, why didn't the binderhub pod replica become ready? I'm not sure. I note that the GKE prod use 2 replicas, while other deployments use 1 and the staging use 1-3 based on a HPA (horizontal pod autoscaler).

Sugggested action point

I suggest to try redeploy the latest binderhub version to mybinder.org-deploy and babysitting the deployment, observing specifically how the rollout of binderhub turns out and for what reason.

manics · 2021-05-04T19:52:39Z

Is there some magic we can add to the mybinder.org-deploy pipeline to scream/tweet/post a message to gitter when a production deploy fails?

betatim · 2021-05-07T14:26:15Z

Attempted another deploy, same result.

Digging around the helm chart a bit I found

binderhub/helm-chart/binderhub/templates/deployment.yaml

Lines 55 to 58 in 0b4462c

    
           {{- if .Values.config.BinderHub.use_registry }} 
        
           - name: docker-secret 
        
             secret: 
        
               secretName: binder-push-secret

which makes me think the name of the secret was changed in some but not all places.

consideRatio · 2021-05-07T14:30:48Z

@betatim what are the logs from the binder pods? Are both restarted and have successfully become ready - in other words, both fully updated?

betatim · 2021-05-07T15:04:59Z

No, they remain in the "ContainerCreating" status

betatim · 2021-05-07T15:06:19Z

jupyterhub/mybinder.org-deploy#1900 (comment) has some notes

consideRatio · 2021-05-07T18:42:43Z

@betatim I think I found the issue. I thought i reviewed this by doing a full search across the git repo but i probably searched within some folder only...

I think #1294 will fix this.

g-braeunlich · 2021-05-08T07:42:31Z

Sorry guys, again! This was my sloppiness!

consideRatio · 2021-05-11T00:37:11Z

@g-braeunlich bugs happen :)

@betatim I merged the bugfix in #1294 and closed this PR by doing so, hoping to help anyone using latest version to run into issues.

betatim · 2021-05-11T08:07:10Z

@consideRatio is your change in #1294 what I found in #1293 (comment)?

Thanks for the fix. Goes to show that even "simple" changes can be tricky to get right :D

consideRatio · 2021-05-11T08:13:17Z

@consideRatio is your change in #1294 what I found in #1293 (comment)?

Thanks for the fix. Goes to show that even "simple" changes can be tricky to get right :D

Ahhhh yes it is, i missed your comment! Sorry for seemingly ignoring it @betatim!

betatim · 2021-05-11T08:17:47Z

No worries.

It remains a unsolved mystery (for me) why this didn't fail on mybinder.org's staging deployment, but yeah. Let's see what happens this time 😂

Revert "build_docker_config added, enables augmentation of the build …

0f7cbbd

…pod's docker config"

betatim mentioned this pull request May 7, 2021

binderhub: 0.2.0-n557.h46ddaac...0.2.0-n562.h0b4462c jupyterhub/mybinder.org-deploy#1900

Merged

consideRatio mentioned this pull request May 7, 2021

fix: update a reference to a k8s Secret that changed name #1294

Merged

consideRatio closed this in #1294 May 11, 2021

consideRatio deleted the revert-1255-master branch May 25, 2022 14:19

Revert "build_docker_config added, enables augmentation of the build pod's docker config" #1293

Revert "build_docker_config added, enables augmentation of the build pod's docker config" #1293

Uh oh!

Conversation

betatim commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented May 3, 2021

Uh oh!

betatim commented May 3, 2021

Uh oh!

g-braeunlich commented May 3, 2021

Uh oh!

manics commented May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

consideRatio commented May 3, 2021

Uh oh!

manics commented May 3, 2021

Uh oh!

betatim commented May 4, 2021

Uh oh!

consideRatio commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Conclusion

Sugggested action point

Uh oh!

manics commented May 4, 2021

Uh oh!

betatim commented May 7, 2021

Uh oh!

consideRatio commented May 7, 2021

Uh oh!

betatim commented May 7, 2021

Uh oh!

betatim commented May 7, 2021

Uh oh!

consideRatio commented May 7, 2021

Uh oh!

g-braeunlich commented May 8, 2021

Uh oh!

consideRatio commented May 11, 2021

Uh oh!

betatim commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

consideRatio commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betatim commented May 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

betatim commented May 3, 2021 •

edited

Loading

manics commented May 3, 2021 •

edited

Loading

consideRatio commented May 4, 2021 •

edited

Loading

betatim commented May 11, 2021 •

edited

Loading

consideRatio commented May 11, 2021 •

edited

Loading

betatim commented May 11, 2021 •

edited

Loading