Buildkite Agent Scaler

An AWS lambda function that handles the scaling of an Amazon Autoscaling Group (ASG) based on metrics provided by the Buildkite Agent Metrics API.

In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules. 🚀

Why?

The Elastic CI Stack depends on being able to scale up quickly from zero instances in response to scheduled Buildkite jobs. Amazon's AutoScaling primitives have a number of limitations that we wanted more granular control over:

The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with a minimum period of 60 seconds between.
Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work well with custom metrics like we use.

How does it work?

The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the results sets the DesiredCount to exactly what is needed. This allows much faster scale up.

Configuration

Availability-based scaling

The scaler monitors agent availability to handle situations where EC2 instances are healthy but Buildkite agents aren't connecting. This can happen due to network issues, agent configuration problems, or instance startup delays.

AVAILABILITY_THRESHOLD (default: 0.5)

When jobs are queued, the scaler checks if the percentage of connected agents meets this threshold. For example, with 4 agents per instance and 2 instances running (8 expected agents), if only 3 agents are online, that's 37.5% availability.

When availability drops below the threshold and the ASG has converged (actual instances match desired), the scaler adds one instance to help recover availability.

Set AVAILABILITY_THRESHOLD=0 to disable availability-based scaling. The scaler will then scale based only on job count.

Threshold tuning:

Lower threshold (e.g., 0.3): Tolerates slower agent connection times, reduces instance churn
Higher threshold (e.g., 0.8): Aggressive scaling to maintain high availability when agents are expected to connect quickly
Disabled (0): Job-based scaling only, suitable when agents connect reliably

Gracefully scaling in

🚧 For Elastic CI Stack, there's now available a dedicated and experimental mode configured with ELASTIC_CI_MODE variable. You can read more about it in here. 🚧

Whilst the lambda does support scaling in via setting DesiredCount, Amazon ASGs appear to not send Lifecycle Hooks before terminating instances, so jobs in progress are interrupted.

Instead, in the Elastic CI Stack we run the scaler with scale-in disabled (DISABLE_SCALE_IN) and rely on the recent addition in buildkite-agent v3.10.0 of --disconnect-after-idle-timeout in the Agent combined with a systemd PostStop script to terminate the instance and atomically decrease the DesiredCount after the agent has been idle for a time period. We've found it to work really well, and is less complicated than relying on lifecycled and Lifecycle Hooks.

See the forum post for more details.

Publishing Cloudwatch Metrics

The scaler collects its own metrics and doesn't require buildkite-agent-metrics. It supports optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset of the metrics that the buildkite-agent-metrics binary collects:

Buildkite > (Org, Queue) > ScheduledJobsCount
Buildkite > (Org, Queue) > RunningJobCount

Running as an AWS Lambda

An AWS Lambda bundle is created and published as part of the build process. The lambda will require the following IAM permissions:

cloudwatch:PutMetricData
autoscaling:DescribeAutoScalingGroups
autoscaling:DescribeScalingActivities
autoscaling:SetDesiredCapacity

Its handler is bootstrap, it uses a provided.al2 runtime and requires the following env vars:

BUILDKITE_AGENT_TOKEN or BUILDKITE_AGENT_TOKEN_SSM_KEY
BUILDKITE_QUEUE
AGENTS_PER_INSTANCE
ASG_NAME

If BUILDKITE_AGENT_TOKEN_SSM_KEY is set, the token will be read from AWS Systems Manager Parameter Store GetParameter which can also read from AWS Secrets Manager.

aws lambda create-function \
  --function-name buildkite-agent-scaler \
  --memory 128 \
  --role arn:aws:iam::account-id:role/execution_role \
  --runtime provided.al2 \
  --zip-file fileb://handler.zip \
  --handler bootstrap

Development

This project uses mise to manage development tooling ensuring all the tooling needed is installed with one step, and in expected versions. To install mise, execute ./bin/mise bootstrap script or follow mise documentation. Run mise install to install all the required tooling defined in mise.toml.

Running agent-scaler locally

$ mise exec go -- go run . \
  --asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
  --agent-token "$BUILDKITE_AGENT_TOKEN"

Using Clusters

The BUILDKITE_AGENT_TOKEN is scoped to a specific cluster. It's best to create a unique token for the cluster being targeted by the scaler.

The scaler is set up automatically by the Elastic CI Stack's CloudFormation templates, which reference the agent token and a queue name. A Lambda function running the scaler is then generated using these references (e.g., BUILDKITE_AGENT_TOKEN_SSM_KEY and BUILDKITE_QUEUE).

Name		Name	Last commit message	Last commit date
Latest commit History 506 Commits
.buildkite		.buildkite
.github		.github
bin		bin
buildkite		buildkite
docs		docs
lambda		lambda
scaler		scaler
version		version
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
go.mod		go.mod
go.sum		go.sum
lefthook.yml		lefthook.yml
main.go		main.go
mise.toml		mise.toml
template.yaml		template.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Buildkite Agent Scaler

Why?

How does it work?

Configuration

Availability-based scaling

Gracefully scaling in

Publishing Cloudwatch Metrics

Running as an AWS Lambda

Development

Running agent-scaler locally

Using Clusters

Copyright

About

Uh oh!

Releases 30

Packages

Uh oh!

Contributors 37

Languages

License

buildkite/buildkite-agent-scaler

Folders and files

Latest commit

History

Repository files navigation

Buildkite Agent Scaler

Why?

How does it work?

Configuration

Availability-based scaling

Gracefully scaling in

Publishing Cloudwatch Metrics

Running as an AWS Lambda

Development

Running agent-scaler locally

Using Clusters

Copyright

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 30

Packages 0

Uh oh!

Contributors 37

Languages

Packages