Skip to content

Exponential histogram observations mismatch #6006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sfc-gh-dguy opened this issue Mar 7, 2025 · 8 comments
Open

Exponential histogram observations mismatch #6006

sfc-gh-dguy opened this issue Mar 7, 2025 · 8 comments
Labels
registry: otlp OpenTelemetry Protocol (OTLP) registry-related waiting for feedback We need additional information before we can continue

Comments

@sfc-gh-dguy
Copy link

Describe the bug
We upgraded to Micrometer 1.14.0 and started using exponential histograms around 2 months ago.
Since day 1 of using exponential histograms we noticed this error on our metrics backend (Grafana GEM):

ts=2025-03-05T05:07:37.899310792Z caller=grpc_logging.go:76 level=warn method=/cortex.Ingester/Push duration=3.033186ms msg=gRPC err="user=default-tenant: err: 8 observations found in buckets, but the Count field is 10: histogram's observation count should equal the number of observations found in the buckets (in absence of NaN). timestamp=2025-03-05T05:07:36.402Z, series=[{__name__ proxy_convergence_duration}] (err-mimir-native-histogram-count-mismatch) (sampled 1/10)

It's saying total count of observations in buckets doen't match the Count field. This seems like a bug

  • Micrometer version: 1.14.0
  • Micrometer registry: OTLP
  • OS: linux
  • Java version: 11
@shakuzen
Copy link
Member

I don't know if it matters, but to get more information, is this with delta or cumulative aggregation temporality?

@shakuzen
Copy link
Member

Could you also share more info about how exponential histogram metrics get from your application to the backend? Is the OtlpMeterRegistry publishing to a Grafana endpoint for OTLP metrics?

@sfc-gh-dguy
Copy link
Author

I don't know if it matters, but to get more information, is this with delta or cumulative aggregation temporality?

cumulative

@sfc-gh-dguy
Copy link
Author

Could you also share more info about how exponential histogram metrics get from your application to the backend? Is the OtlpMeterRegistry publishing to a Grafana endpoint for OTLP metrics?

OtlpMeterRegistry is publishing to a local otel collector on the machine, which sends it in OTLP to another otel collector on a different machine, which sends it to the backend (GEM) in prometheus remote write.

@shakuzen
Copy link
Member

Thanks for the added details, @sfc-gh-dguy.
If that is the setup, can you verify the issue at the first OTel collector that is receiving OTLP metrics from Micrometer? Is it logging a similar error? I'm not sure if it will make a difference, but could you also please try with the latest patch version? That's 1.14.5 at the time of this comment. If you could get the metric data Micrometer is sending to the collector when the issue happens, we can determine if the cause is in Micrometer or the collector. I tried to look for possible causes in the code but I'm not sure what could cause this. Maybe the conversion the collector is doing from OTLP to Prometheus remote write is not taking into consideration the zero count.
/cc @lenin-jaganathan

@shakuzen shakuzen added waiting for feedback We need additional information before we can continue registry: otlp OpenTelemetry Protocol (OTLP) registry-related and removed waiting-for-triage labels Mar 27, 2025
Copy link

github-actions bot commented Apr 4, 2025

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 7 days this issue will be closed.

@sfc-gh-dguy
Copy link
Author

The error comes from a grafana GEM component (ingester). I don't see this error in the OTeL collector that is receiving OTLP metrics from Micrometer.
I'll try to upgrade to 1.14.5 and see if it resolves it.

@shakuzen
Copy link
Member

shakuzen commented Apr 8, 2025

If you can reproduce the issue with a low volume (at best, one exponential histogram in a publish), and you could do that with logging configured on the OTel Collector, you can check whether the data for the exponential histogram is valid at each step (Micrometer → OTel Collector 1, Collector 1 → Collector 2, Collector 2 → GEM) to identify where the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
registry: otlp OpenTelemetry Protocol (OTLP) registry-related waiting for feedback We need additional information before we can continue
Projects
None yet
Development

No branches or pull requests

2 participants