Skip to content

bug: GetRegularWorkspaceCreateMetrics creates too much DB load #1115

@cstyan

Description

@cstyan

As of November 6 this query is our 2nd most expensive query over the last 7d in terms of load over time.

It's not called too frequently (30k calls in a week) or returning a large data set (avg 92 rows), but it's avg. latency is > 160ms and the p99 is more like 740ms. The main source of the cost here is sequential scans of provisioner_jobs and workspace_builds in order to find the first successful build per workspace.

Image

Currently for workspace builds we have an explicit index on only the initiator ID and the actual build ID (PK), then implicit indices as a result of UNIQUE constraints for (workspace_id, name), (workspace_id, build_number), and (job_id). For provisioner_jobs we have explicit indices on started_at, job_status, and worker_id, organization_id, completed_at).

We do not have indices on either workspace_builds.transition nor provisioner_jobs.job_status and thus have to do sequential scans of both tables in order to fetch the relevant data. For ourselves internally this is scans of ~80k rows (33k returned) and ~92k rows (70k returned) respectively. The join and sort after this are also relatively expensive.

The join (from the plan) on pj.id = wb.job_id also results in a relatively high # of matches (~30k, so roughly 90% of workspace_build start builds succeeded).

I would propose adding indices similar to these:

  1. btree index for workspace_builds on (workspace_id, build_number, id). This index will already be ordered so should be able to eliminate the need for the sort after the join. AI is also telling me that if we include the other fields from the query in the index, those being template_version_preset_id, initiator_id, and job_id, we can speed the query up/reduce the cost further by eliminating the need for heap fetches from the workspace_builds table itself, but I'm not sure of this optimization nor whether we want the extra index size that would come from this.
  2. an index for provisioner_jobs either on (job_status, id) OR on (id) WHERE job_status = 'succeeded'. The latter would help only this query, a more general index on (job_status, id) would help potentially multiple queries such as GetWorkspacesEligibleForTransition, GetFailedWorkspaceBuildsByTemplateID, CountPendingNonActivePrebuilds, and more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions