Skip to content

Conversation

mvandenburgh
Copy link
Member

Docs for these two settings: https://docs.gitlab.com/runner/executors/kubernetes/#configure-the-number-of-request-attempts-to-the-kubernetes-api

Job system failures like this one, i.e. an error that looks like error dialing backend: remote error: tls: internal error, indicate that the pipeline pod failed to receive a response from the k8s/EKS API server. It's still unclear why this is happening, but one potential explanation is that the default timeout for EKS API requests (2 seconds) is getting exceeded.

Long term, I would like to set up https://docs.aws.amazon.com/eks/latest/best-practices/control_plane_monitoring.html so we can get more insight into what's going on with the control plane.

Comment on lines +104 to +115
retry_backoff_max = 30000
# This is the default retry limit. We override this for specific classes of
# errors below.
retry_limit = 5
[runners.kubernetes.retry_limits]
# Retry this type of error 10 times instead of 5.
# This error usually occurs when the EKS API server times out or
# is unreachable. Presumably the server will eventually become
# available again, so we want to give the pod plenty of time to retry.
"tls: internal error" = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry_backoff_max seems to just control the maximum value the retry interval can reach. Do you know what value the retry interval starts at? And how the backoff is incremented? Is it doubled each time, etc.?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants