Skip to content
This repository was archived by the owner on Jan 23, 2025. It is now read-only.

Conversation

cristiangreco
Copy link

Execute ListMetrics calls in separate goroutines (one for each metric),
in a similar way to how GetMetricData requests are handled.

This change removes semaphore usage (it's not used in GetMetricData as
well) for the sake of making it easier to reason about the code. It can
easily be added back in case we hit any issue around e.g. rate limits.

In local tests, the speedup seems to be particularly effective when
requesting 4 or more metrics in parallel (e.g. especially with EC2/EBS).

kgeckhart and others added 13 commits October 7, 2021 08:24
Eliminate global labelMap and build a labelSet for metrics in ensureLabelConsistencyForMetrics
* Pass logger to the different structs

* Use a builder for the AWS services

* Use job specific loggers to propagate properties
* Merge latest release 0.34.0-alpha

* Add missing metric
* Move AWS metrics back to prometheus.go and expose them as array in update.go

* The go workflow should run on PR's for live

* Remove unused field
Execute ListMetrics calls in separate goroutines (one for each metric),
in a similar way to how GetMetricData requests are handled.

This change removes semaphore usage (it's not used in GetMetricData as
well) for the sake of making it easier to reason about the code. It can
easily be added back in case we hit any issue around e.g. rate limits.

In local tests, the speedup seems to be particularly effective when
requesting 4 or more metrics in parallel (e.g. especially with EC2/EBS).
Copy link

@kgeckhart kgeckhart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change removes semaphore usage (it's not used in GetMetricData as well) for the sake of making it easier to reason about the code. It can easily be added back in case we hit any issue around e.g. rate limits.

Slight concern, errors aren't propagated out of the parallel sections of the code ATM so we might want to make sure we are monitoring errors a little closer after this.

Copy link

@IfSentient IfSentient left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about ratelimiting from AWS in the case of large numbers of metrics?

metricsList, err := getFullMetricsList(ctx, svc.Namespace, m, clientCloudwatch)

if err != nil {
level.Error(logger).Log("msg", "Failed to get full metric list", "err", err, "metric_name", m.Name, "namespace", svc.Namespace)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a prom metrics to track both this errors and the scenario below with zero resources?

Copy link

@ferruvich ferruvich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we roll this out to a single cluster first to observe any impact?

LGTM as we're going to do so

@CLAassistant
Copy link

CLAassistant commented Jun 15, 2022

CLA assistant check
All committers have signed the CLA.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants