feat: reimplement history data export #1642

woehrl01 · 2025-02-25T07:53:15Z

This is a reimplementation of #986 by @luismy based on the latest codebase and the suggestions from @kgeckhart

Your comments are highly appreciated

Signed-off-by: Lukas Wöhrl <[email protected]>

pkg/config/config.go

jotweh · 2025-02-28T09:45:00Z

Could you also adjust one of the examples with the new AddHistoricalMetrics flag and different values for period and length?

Signed-off-by: Lukas Wöhrl <[email protected]>

kgeckhart

Great work on updating this! I had a few minor comments and will see about getting some more attention on it.

docs/configuration.md

pkg/promutil/migrate.go

kgeckhart · 2025-03-06T14:09:41Z

pkg/promutil/migrate.go

+					if dataPoint == nil && metric.MetricMigrationParams.AddCloudwatchTimestamp {
+						// If we did not get a datapoint then the timestamp is a default value making it unusable in the
+						// exported metric. Attempting to put a fake timestamp on the metric will likely conflict with
+						// future CloudWatch timestamps which are always in the past. It's safer to skip here than guess
+						continue
+					}


I think there's a slight change in behavior here, if the feature is disabled and the first datapoint is nil we will bypass the break at the end potentially exporting an old data point as current. Could you add a test for this use case to see if it happens?

Your right, there must be a different behaviour here depending if we want to export historic data. Thanks for pointing that out!

pkg/clients/cloudwatch/v2/client.go

pkg/clients/cloudwatch/v1/client.go

woehrl01 · 2025-03-06T14:32:03Z

Thanks for your comments @kgeckhart. Sounds good, will do.

Signed-off-by: Lukas Wöhrl <[email protected]>

woehrl01 · 2025-03-28T10:36:42Z

@kgeckhart anything I can help with?

cristiangreco · 2025-03-28T14:46:33Z

docs/configuration.md

-| `-metrics-per-query` | Number of metrics made in a single GetMetricsData request | `500` |
-| `-labels-snake-case`  | Output labels on metrics in snake case instead of camel case | `false` |
-| `-profiling.enabled` | Enable the /debug/pprof endpoints for profiling | `false` |
+| Flag                                                  | Description                                                                                                                          | Default value    |


Do you mind keeping the changes here to the minimum please?

cristiangreco · 2025-03-28T14:49:40Z

pkg/clients/cloudwatch/client.go

+	Datapoints []DatapointWithTimestamp
+}
+
+type DatapointWithTimestamp struct {


DatapointWithTimestamp seems redundant, can we drop the WithTimestamp suffix?

Also, please pick either DataPoint or Datapoint and change it consistently through the PR (including the config option).

It's a data point with a timestamp, there is already a data point without a timestamp. I'm happy for any suggestion of a better name. But I think this is the most expressive one and easy to understand. What do you suggest?

I personally think it's more readable something like

type Datapoint struct { Value *float64 Timestamp time.Time }

cristiangreco · 2025-03-28T14:53:01Z

pkg/clients/cloudwatch/client.go

+func NewDataPoint(datapoint *float64, timestamp time.Time) DatapointWithTimestamp {
+	return DatapointWithTimestamp{
+		Timestamp: timestamp,
+		Datapoint: datapoint,
+	}
+}
+
+func SingleDataPoint(datapoint *float64, timestamp time.Time) []DatapointWithTimestamp {
+	return []DatapointWithTimestamp{NewDataPoint(datapoint, timestamp)}
+}


These helpers seems a bit of an overkill for such a small struct, can we avoid them?

I don't mind, I believe this improves readability especially in the tests.

cristiangreco · 2025-03-28T14:53:55Z

pkg/model/model.go

 type GetMetricDataResult struct {
-	Statistic string
-	Datapoint *float64
+	Statistic  string
+	Datapoints []DatapointWithTimestamp
+}
+
+type DatapointWithTimestamp struct {
 	Timestamp time.Time
+	Datapoint *float64
+}
+
+func NewDataPoint(datapoint *float64, timestamp time.Time) DatapointWithTimestamp {
+	return DatapointWithTimestamp{
+		Timestamp: timestamp,
+		Datapoint: datapoint,
+	}
+}
+
+func SingleDataPoint(datapoint *float64, timestamp time.Time) []DatapointWithTimestamp {
+	return []DatapointWithTimestamp{NewDataPoint(datapoint, timestamp)}
 }


Same comments as in config.go apply here.

cristiangreco · 2025-03-28T15:06:33Z

pkg/config/config.go

 func (c *ScrapeConf) Validate(logger *slog.Logger) (model.JobsConfig, error) {
 	if c.Discovery.Jobs == nil && c.Static == nil && c.CustomNamespace == nil {
-		return model.JobsConfig{}, fmt.Errorf("At least 1 Discovery job, 1 Static or one CustomNamespace must be defined")
+		return model.JobsConfig{}, fmt.Errorf("at least 1 Discovery job, 1 Static or one CustomNamespace must be defined")


Nitpick: this change is unrelated to the PR.

It unrelated but is a lint error. Should I revert?

What lint error is this solving?

The standard golang lint error that an error text should start with a lowercase letter.

cristiangreco · 2025-03-28T15:55:02Z

pkg/clients/cloudwatch/v2/client.go

+			ID:         *metricDataResult.Id,
+			Datapoints: make([]cloudwatch_client.DatapointWithTimestamp, 0, len(metricDataResult.Timestamps)),
+		}
+		for i := 0; i < len(metricDataResult.Timestamps); i++ {


I feel this new behaviour should be used only if exportAllDataPoints is enabled. If not, we should keep the previous behaviour and avoid copying around the additional values/timestamps from the CW response.

This is required because this function is not aware of any configuration. This is in my opinion the cleaned implementation. See comments from @kgeckhart in the linked pr

I think Kyle's comment was mainly around allowing the client to return a slice of Datapoint + Timestamp, but he can confirm.

My concern is around useless mem allocations when exportAllDataPoints is not enabled. I think the option can be made available to the aws client, e.g. via CloudwatchData or GetMetricDataProcessingParams.

I'm pretty sure there is no significant useless memory allocation. All the values are already in memory because it's part of the api response. So we just allocate the slice a bit bigger, because we use a struct of slice and not a reference to the struct slice, there is a single memory alloc for the underlying array, everything else is just copying over values. So this just keeps the already returned data for a few function calls longer in memory, to unify the struct across v1 and v2.

Feels like an unessary add of complexity, to pass around those parameters and adding additional tests for validate that behaviour.

Following up on my earlier comment: even if we assume returning a slice of 5 structs increases memory usage, the overhead is still reasonable, around ~128 MB for 1 million metrics. That comes from storing 4 additional data points per metric, each at 32 bytes (a time.Time value + a *float64 pointer). It's just a larger backing array and a few more pointers to scan during GC. This only becomes relevant if we're allocating and retaining millions of these slices, which we're not.

And if we were dealing with that many, querying 1 million CloudWatch metrics, the real issue wouldn't be memory. Even with 5 data points per request, that's 200K metric fetches, costing ~$2 per query. At that scale, CloudWatch costs and API limits are a far bigger concern than a ~100 MB memory difference.

We run YACE in a stateless multi-tenant fashion so querying a very large amount of metrics in a short period of time often happens. I was concerned about the potential memory overhead for this change and had shared that with Cristian.

This wasn't the area I was concerned about initially as we keep our length + period setup to only produce a single datapoint as much as possible. I do agree with Cristian, if we stick to only mapping a single datapoint when the setting is disabled it will ensure there's a minimal overhead to those who upgrade to latest without using the feature.

I was primarily concerned about overhead from switching to a single datapoint -> a slice for our use case. I don't think we should do anything about it now. I wrote a OneOrMany[T] that benchmarks nicely vs a single entry slice and would PR it separately if needed.

@kgeckhart I just blindly merged yesterday your suggested changes, to get this PR of my plate. Still I had a look today, about what you changed around your memory concerns. Looking at the current changes, I don't see any change which would reduce memory usage at all. Despite your suggestion, it still creates a slice everytime and it always creates it with full size for all data points. There is now just added complexity around passing over a flag, to stop the loop early, which you could argue reduced CPU overhead, but is likely neglectable at the scale where the cloudwatch costs would explode.

Before we are moving into premature optimisations, have you actually measured and proved that this actually is an issue? (see my comments above why I doubt that based on numbers). You mention that you are running this at large scale. Maybe you can run the initial PR, side by side for 1-5% of your workload (depending on the scale) and share some realworld memory impact?

@woehrl01 thanks for taking a look at it and providing feedback.

Despite your suggestion, it still creates a slice everytime and it always creates it with full size for all data points.

This was an oversight on my part as I was hastily making the change before I needed to go to an event yesterday 😅. My intention was to make sure to only allocate exactly what is necessary and nothing more which I can do with another minor change

There is now just added complexity around passing over a flag, to stop the loop early, which you could argue reduced CPU overhead, but is likely neglectable at the scale where the cloudwatch costs would explode.

IMO the added complexity is rather minimal and easily testable which is why I went through with the change.

Before we are moving into premature optimisations, have you actually measured and proved that this actually is an issue? (see my comments above why I doubt that based on numbers). You mention that you are running this at large scale. Maybe you can run the initial PR, side by side for 1-5% of your workload (depending on the scale) and share some realworld memory impact?

When Cristian brought it up I didn't quite get it at first because we do run it at scale but we do it in such a way that we should only ever get one data point back. We do this intentionally for performance reasons, why ask for data you won't use? I don't know if it was Cristian's intent but my realization was that introducing this feature should not present a noticeable negative impact on the larger community just by upgrading. This exporter is embedded in to places like https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.exporter.cloudwatch/ which I know is used by customers at a scale large enough to incur some rather hefty CloudWatch costs.

Going from a single data point to a slice presents some memory increase that is unlikely to be noticeable for most. We don't know how many people are setting up their configs with a length that is larger than their period (length > period = more than 1 data point) and to what degree they are doing it (2x, 3x, 10x?). This is part of the complexity of building the configs CloudWatch has some incredibly odd behaviors for different metrics so the configs get cargo culted in a way that is not optimal but works.

Could this guard be unnecessary, yes but the complexity of adding it feels acceptable as a means to try to provide a stable experience to existing users.

Since I messed up the DCO + need another change we will have to figure out what to do with this PR. But first it would probably be good to have @cristiangreco take a look to make sure there's nothing further.

@kgeckhart @cristiangreco what's the plan with this PR?
Don't get me wrong, I understand we're all busy and this is OSS after all. So no pressure here please... ;) I would just highly appreciate some short indication of you guys if and when you plan to move ahead with this PR to be able to make plans on our end as well.

From what I see, this PR seems to be waiting on some action from your side and there is nothing really left the community could support with? If that is wrong and there is still something todo, please let me know, I'm happy to contribute as well.

cristiangreco

Hey, thanks for the PR! I've left a few comments, please let me know what you think. I'd like to have another pass after those points are addressed.

kgeckhart · 2025-05-13T14:05:44Z

@woehrl01 just wanted to check-in to see if this is something you're still interested in continuing. It's a really valuable feature to have and your contribution is greatly appreciated!

woehrl01 · 2025-05-16T18:31:25Z

@kgeckhart I'm open doing the minimal changes discussed above. If you expect to change the whole memory stuff, someone else has to continue based in this PR.

If you are fine with those changes, I can allocate some time next week.

ecktom · 2025-07-13T16:01:58Z

First of all, Thanks @woehrl01 for picking this topic up again.
We've been playing around with this branch in our environments and it does a decent job and allows us to reduce AWS costs significantly. I can't really see any negative things around the memory allocation but we're not even close to some scale of millions of metrics though. So what's missing here to get this over the line before it's outdated again? ;)

kgeckhart · 2025-07-18T12:43:13Z

I am not a maintainer ATM but quite frankly I was rather baffled by the response from @woehrl01 regarding the suggested change to only map a single value unless the flag is enabled. I didn't see it as a large or overly complicated requested change. I will also say if this PR was merged without it we would likely immediately follow it with the change which seems rather silly to have to do.

woehrl01 · 2025-07-18T14:56:42Z

Just to clarify, I work on contributions like this in my spare time, so I try to keep the scope focused and the time investment reasonable. I’d appreciate it if that could be respected when commenting on PRs.

If you believe the suggested change is necessary, you're absolutely welcome to take it on and follow up with your own PR. That’s the beauty of open source collaboration.

Also, I want to point out that the wording in your comment came across as unnecessarily confrontational. Using phrases like “baffled by the response” doesn't help foster a respectful or constructive discussion. Please keep the tone collaborative. We're all here to improve the project together.

kgeckhart · 2025-07-18T17:17:07Z

@woehrl01 a change was request to the PR and after some back and forth regarding the change and why it was necessary you ultimately rejected to make it stating,

If you expect to change the whole memory stuff, someone else has to continue based in this PR.

Based on your latest reply it seems as though your concern with making the requested change is that you might not have the time to do it, is that correct? It's your work up to this point and I think it's important you keep your name on it vs someone else copying it to make further changes. If you are able to invest the time necessary to resolve as many of the requested changes as possible that would be helpful to the conversation regarding next steps.

woehrl01 · 2025-07-18T17:27:24Z

I don't want to invest more time into modifying this PR. I also don't care if my name is afterwards in any commit history, as it's ultimately based on two other people's ideas as mentioned in the PR body. I only care that this is a useful addition to the project. I consider here two ways, either fork the whole PR and open up a third one. Or create a PR to this PR which adds the changes you want to make. Then I will simply merge that PR, so that this PR is ready to be merged.

kgeckhart · 2025-07-18T19:50:46Z

@woehrl01 I have created a PR to your branch which should resolve all the code review feedback, woehrl01#1.

Thank you again for your effort on this feature.

Code review feedback

woehrl01 · 2025-07-18T20:16:56Z

@kgeckhart I just merged your PR.

cristiangreco

Thanks @woehrl01 and @kgeckhart for your great work here. I feel at this point it's worth merging this work as-is and move on from there.

cristiangreco · 2025-09-22T15:14:11Z

@kgeckhart I went ahead and merged it even without DCO signature, as you've contributed to this repo in the past anyway and we had talked about this specific change in person.

woehrl01 force-pushed the historical-data branch 4 times, most recently from a084066 to a2cf6b3 Compare February 25, 2025 09:00

woehrl01 marked this pull request as ready for review February 26, 2025 21:02

feat: reimplement history data export

6401f0f

Signed-off-by: Lukas Wöhrl <[email protected]>

woehrl01 force-pushed the historical-data branch from a2cf6b3 to 6401f0f Compare February 26, 2025 21:19

fix: use struct directly instead of referenes to reduce gc preassure

e089bca

Signed-off-by: Lukas Wöhrl <[email protected]>

woehrl01 force-pushed the historical-data branch from f0365f1 to e089bca Compare February 27, 2025 05:15

jotweh reviewed Feb 28, 2025

View reviewed changes

pkg/config/config.go Outdated Show resolved Hide resolved

improve comments and add example

ce858fb

Signed-off-by: Lukas Wöhrl <[email protected]>

woehrl01 force-pushed the historical-data branch from b584713 to ce858fb Compare February 28, 2025 09:56

kgeckhart reviewed Mar 6, 2025

View reviewed changes

woehrl01 added 2 commits March 7, 2025 08:23

rename to exportAllDataPoints and added test

1d3536c

Signed-off-by: Lukas Wöhrl <[email protected]>

correctly handle nil value and timestamp combinations

96e5285

Signed-off-by: Lukas Wöhrl <[email protected]>

woehrl01 force-pushed the historical-data branch from 81f15f0 to 96e5285 Compare March 7, 2025 07:42

cristiangreco requested changes Mar 28, 2025

View reviewed changes

Code review feedback

52f7b4e

Merge pull request #1 from kgeckhart/historical-data

88404b3

Code review feedback

cristiangreco approved these changes Sep 22, 2025

View reviewed changes

cristiangreco merged commit 8d9850d into prometheus-community:master Sep 22, 2025
1 check failed

Uh oh!

feat: reimplement history data export #1642

feat: reimplement history data export #1642

Conversation

woehrl01 commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jotweh commented Feb 28, 2025

Uh oh!

kgeckhart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

woehrl01 commented Mar 6, 2025

Uh oh!

woehrl01 commented Mar 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

woehrl01 Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

woehrl01 Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cristiangreco left a comment

Choose a reason for hiding this comment

Uh oh!

kgeckhart commented May 13, 2025

Uh oh!

woehrl01 commented May 16, 2025

Uh oh!

ecktom commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woehrl01 commented Feb 25, 2025 •

edited

Loading

woehrl01 Mar 29, 2025 •

edited

Loading

woehrl01 Jul 19, 2025 •

edited

Loading

ecktom commented Jul 13, 2025 •

edited

Loading

kgeckhart commented Jul 18, 2025 •

edited

Loading