Skip to content

Conversation

betatim
Copy link
Member

@betatim betatim commented May 3, 2021

This proposes to revert #1255. When we deployed this to mybinder.org today we ended up with all(?) build pods getting stuck in a pending state for about two hours. At which point we reverted the deploy.

Screenshot 2021-05-03 at 17 05 26

The marker just before 15:30h is the deploy. The blue shaded graph shows the number of "running" build requests ("running" includes "container creating"). At 17:00h I deleted all build pods and reverted the deploy.

Build pods that were stuck during this time contained the following in the events:

Events:
  Type     Reason       Age                 From                                         Message
  ----     ------       ----                ----                                         -------
  Warning  FailedMount  42m (x3 over 56m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[default-token-jcbqc docker-socket docker-push-secret]: timed out waiting for the condition
  Warning  FailedMount  26m (x3 over 31m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[docker-push-secret default-token-jcbqc docker-socket]: timed out waiting for the condition
  Warning  FailedMount  17m (x11 over 61m)  kubelet, gke-prod-user-202009-b9c03ca0-vivm  Unable to attach or mount volumes: unmounted volumes=[docker-push-secret], unattached volumes=[docker-socket docker-push-secret default-token-jcbqc]: timed out waiting for the condition
  Warning  FailedMount  2m (x38 over 63m)   kubelet, gke-prod-user-202009-b9c03ca0-vivm  MountVolume.SetUp failed for volume "docker-push-secret" : secret "binder-push-secret" not found

The summary of the original PR says that there is no breaking change and that maybe for a brief moment new build pods would be using the wrong secret/a secret that doesn't exist. Does someone involved in the PR have an idea what happened/went wrong when deploying this?

@betatim
Copy link
Member Author

betatim commented May 3, 2021

@betatim
Copy link
Member Author

betatim commented May 3, 2021

If we need some time to figure out why this didn't work I'd propose to merge this and make a second attempt to land this feature. Maybe we can even find a test that would have caught this to stop it from coming back?

@g-braeunlich
Copy link
Contributor

oh. 😬

@manics
Copy link
Member

manics commented May 3, 2021

The change in secret name is breaking
#1255 (review)
though the error unmounted volumes=[docker-push-secret] doesn't match, so maybe that's a red herring?

Edit (again)...
https://github.com/jupyterhub/binderhub/pull/1255/files#diff-398e15cdaefcf366bd9fde04a2bcb2cd82840b68c28b5a9c632e55fad93abe0f
changes docker-push-secret to docker-config

@consideRatio
Copy link
Member

Hmmm this would have happened if the binderhub Pods didn't restart as part of the upgrade, and started using the new k8s secret name for the new build pods being created.

Since the build pods keep increasing and fail with pending due to failure to see the old k8s secret around, it indicates to me that the software that creates the build pods didn't reconfigure itself with the new k8s secret name at least. Is it the binderhub pod that creates the build pods?

@manics
Copy link
Member

manics commented May 3, 2021

The build pods should be created by the BinderHub application

def submit(self):
"""Submit a build pod to create the image for the repository."""

@betatim
Copy link
Member Author

betatim commented May 4, 2021

It sounds like restarting the binderhub pods should be the way to fix this/not have this happen. The weird thing is that a change like this (new binderhub code -> new binderhub container image) means the binderhub pods will restart. I unfortunately didn't look at the age of the binderhub pods when things were broken so I don't know for sure if they restarted or not. However now the bhub pods are ~15h old which I think means they restarted when we reverted the deploy. So probably they also restarted when we originally deployed it.

Does anyone understand why even with restarting the bhub pods we still ended up with build pods not being able to start over a period of a few hours?

@consideRatio
Copy link
Member

consideRatio commented May 4, 2021

This is known:

  1. Tests succeeded in the binderhub CI/CD pipeline, including the creation of a build pod. So, new pods successfully referenced the new k8s Secret name. The tests we have are for fresh installations though.
  2. mybinder.org make use of two replicas of the binder pod (running binderhub) and will perform a rolling upgrade of the binder pod running the binderhub software.
  3. helm upgrade --wait is used by the mybinder.org-deploy CI/CD script should ensure that deployments startup successfully and become ready.
  4. There is no reference to the old k8s Secret name, so it indicates that an old replica of the binder pod is still running.
  5. This suggests that the upgrade must have failed. Inspecting mybinder.org-deploy pipelines, it seems like it did! It failed waiting for the binder Deployment!

Conclusion

Mybinder.org deployed a new version. This worked out well for all federation members except for GKE. A rolling upgrade was made but the new binder replicas that started failed to become Ready. The helm upgrade was made, but the new replica never became ready, while the old k8s secret name was removed and the new k8s secret name was added.

Why did the GKE deployment fail, why didn't the binderhub pod replica become ready? I'm not sure. I note that the GKE prod use 2 replicas, while other deployments use 1 and the staging use 1-3 based on a HPA (horizontal pod autoscaler).

Sugggested action point

I suggest to try redeploy the latest binderhub version to mybinder.org-deploy and babysitting the deployment, observing specifically how the rollout of binderhub turns out and for what reason.

@manics
Copy link
Member

manics commented May 4, 2021

Is there some magic we can add to the mybinder.org-deploy pipeline to scream/tweet/post a message to gitter when a production deploy fails?

@betatim
Copy link
Member Author

betatim commented May 7, 2021

Attempted another deploy, same result.

Digging around the helm chart a bit I found

{{- if .Values.config.BinderHub.use_registry }}
- name: docker-secret
secret:
secretName: binder-push-secret
which makes me think the name of the secret was changed in some but not all places.

@consideRatio
Copy link
Member

@betatim what are the logs from the binder pods? Are both restarted and have successfully become ready - in other words, both fully updated?

@betatim
Copy link
Member Author

betatim commented May 7, 2021

No, they remain in the "ContainerCreating" status

@betatim
Copy link
Member Author

betatim commented May 7, 2021

jupyterhub/mybinder.org-deploy#1900 (comment) has some notes

@consideRatio
Copy link
Member

@betatim I think I found the issue. I thought i reviewed this by doing a full search across the git repo but i probably searched within some folder only...

I think #1294 will fix this.

@g-braeunlich
Copy link
Contributor

Sorry guys, again! This was my sloppiness!

@consideRatio
Copy link
Member

@g-braeunlich bugs happen :)

@betatim I merged the bugfix in #1294 and closed this PR by doing so, hoping to help anyone using latest version to run into issues.

@betatim
Copy link
Member Author

betatim commented May 11, 2021

@consideRatio is your change in #1294 what I found in #1293 (comment)?

Thanks for the fix. Goes to show that even "simple" changes can be tricky to get right :D

@consideRatio
Copy link
Member

consideRatio commented May 11, 2021

@consideRatio is your change in #1294 what I found in #1293 (comment)?

Thanks for the fix. Goes to show that even "simple" changes can be tricky to get right :D

Ahhhh yes it is, i missed your comment! Sorry for seemingly ignoring it @betatim!

@betatim
Copy link
Member Author

betatim commented May 11, 2021

No worries.

It remains a unsolved mystery (for me) why this didn't fail on mybinder.org's staging deployment, but yeah. Let's see what happens this time 😂

@consideRatio consideRatio deleted the revert-1255-master branch May 25, 2022 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants