Skip to content

Add metric to check the current number of VT's #6009

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

ceremo
Copy link

@ceremo ceremo commented Mar 11, 2025

Related to #5950

Signed-off-by: César Revert <[email protected]>
@ceremo ceremo marked this pull request as ready for review March 11, 2025 16:08
Copy link
Member

@shakuzen shakuzen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request.

this.recordingStream.onEvent(END_EVENT, event -> activeCounter.decrement());

Gauge.builder(VT_ACTIVE_METRIC_NAME, activeCounter::doubleValue)
.description("The number of active virtual threads")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than a gauge, I wonder if it might be more insightful to have two counters, one for started, one for ended. The active could be derived by taking the difference. And this would allow tracking the rate of virtual threads being started/stopped. Thoughts? Maybe in some use cases we would have to worry about overflow (assuming cumulative counters)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to have both counters. However, these types of events (start/end) are numerous, so they would increase rapidly, and overflow could become a common scenario.

Copy link

@Indresh2410 Indresh2410 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ceremo @shakuzen IMO, if we have the ability to track rate of virtual threads being started along with active threads count, we could have an option to autoscale JVM apps based on rate

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for autoscaling, the important metric is whether the active (virtual) thread count is continually increasing. If you just look at the rate of threads starting without looking at the rate of threads finishing, I'm not sure it tells much that is important for autoscaling because if you have a high rate of starting threads but finishing is keeping up, there is no problem and no autoscaling needed. Given that, maybe the rate of starting or stopping on its own isn't that interesting and the gauge of active virtual threads is most useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @shakuzen . The above makes sense. But lets say we have active virtual threads count. Based on the traffic, we'll have to finetune our thresholds accordingly for active virtual threads

Whereas if we calculate % of virtual threads being used like (active threads/total threads created)*100, we can do 1 time setup where if threshold is greater than 60% (An example), autoscaling should kick in, which generally means active threads are increased (Rate of Finishing threads has slowed down)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the traffic, we'll have to finetune our thresholds accordingly for active virtual threads

I don't think so. You might want to check for an absolute threshold of active virtual threads that is cause for concern, and you may want to check over multiple step intervals that the active virtual thread count is increasing significantly.

Whereas if we calculate % of virtual threads being used like (active threads/total threads created)*100, we can do 1 time setup where if threshold is greater than 60%

Say normally you have between 10-100 active threads, but when there is an issue it climbs to 300, 500, 2000, etc active threads. If your application just started and there is some issue, checking the ratio of active threads to total created may catch the issue. But if the issue happens long after the application started, the ratio no longer catches it because the total threads created is monotonic; it increases over the lifetime of the application. For this reason, using such a ratio is not a good way to catch irregularities.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shakuzen . One final doubt. Apologies for dragging the conversation

But if the issue happens long after the application started, the ratio no longer catches it because the total threads created is monotonic; it increases over the lifetime of the application. For this reason, using such a ratio is not a good way to catch irregularities.

Since virtual threads created will be a counter, won't the below expression help for us?

sum(active_thread_count)/sum(rate(virtual_threads_created[1m]))

Because taking rate of a counter will give the number of threads created over a period of interval right(1m)? Even if application runs for days/months? (Assuming active_thread_count is a gauge)

@Indresh2410
Copy link

@shakuzen / @ceremo any help on confirmation if we are going with two metrics for creation and deletion ? Having them will be really helpful

@jonatan-ivanov
Copy link
Member

I was wondering if we should make RecordingConfig public and if we do that, will these additional config + enable/disable parameters get out of hand over time. I came up with a few ideas:

  1. Abstract JfrMeterBinder
    I think this is something we could do regardless of this PR to support JFR in general. We could move the common logic of RecordingStream handling into a generic class and implementations can enable/disable what they want and register their meters and event handlers, see this draft: main...jonatan-ivanov:micrometer:jfr-binder. This also means that we would make RecordingStream public on our public API surface.

  2. Generic JfrMeterBinder
    Similarly to the abstract class, we can let the users inject their own "registrar" (BiConsumer<MeterRegistry, RecordingStream>) and "configurer" (Consumer<RecordingStream>). See the previous draft: main...jonatan-ivanov:micrometer:jfr-binder but instead of the method definitions we would have injectable functional interfaces.

  3. In addition to either #1 or #2 or none of them, we could have an enum for these events and users can define in the ctor which ones should be enabled (or disabled) (something similar to IgnoredMeters in DefaultMeterObservationHandler).

What do you think?

@ceremo
Copy link
Author

ceremo commented Mar 24, 2025

I was wondering if we should make RecordingConfig public and if we do that, will these additional config + enable/disable parameters get out of hand over time. I came up with a few ideas:
...
What do you think?

The way I see it, if we want to create an abstract solution that's not only useful for listening to Virtual Threads JFR events, we should allow reusing the RecordingStream in the different implementations of the JfrMeterBinder, having a different RecordingStream for each type sounds not optimal. Related to this, the abstract class should not call the start of the RecordingStream, if this is shared, and executes more than once, it will fail:

Recording can only be started once.
java.lang.IllegalStateException: Recording can only be started once.

@shakuzen
Copy link
Member

having a different RecordingStream for each type sounds not optimal.

Is it costly to have multiple RecordingStream instances? I wondered about the same thing in the original PR adding the virtual thread metrics but I didn't look into it to get to an answer. While being more efficient is a goal, the benefits from the isolation of not sharing it (which may introduce the ability of one binder to cause issues for another binder, depending on how things are exposed) feel more important if the only efficiency is allocating several bytes for one more RecordingStream object per binder. Presumably different binders will be listening to different JFR events anyway.

As for the "enabled" booleans, I don't think we need these for the existing metrics - those should have near negligible overhead and users don't need to bind this MeterBinder if they don't want that. The reason why it is important to separately be able to decide about enabling the start/stop virtual thread metrics is that the overhead can be more significant with high throughput using virtual threads. We have MeterFilter which lets users disable metrics they don't want, but there may be instrumentation overhead involved even if the metric is disabled. We could avoid calling RecordingStream#onEvent if the meter is disabled (a no-op). There may be overhead in enabling the event as well. If we could lazily enable the event only when the meter that needs it is not a no-op, then I think we can eliminate the overhead via a MeterFilter without needing an explicit "enabled" boolean. But it would be better for instrumentation with a potentially significant overhead to be opt-in separate from the rest of the instrumentation, so maybe an "enabled" boolean is still the best for this, unless we want to make a separate binder for instrumenting virtual threads start/stop.

@ceremo
Copy link
Author

ceremo commented Mar 26, 2025

Is it costly to have multiple RecordingStream instances? I wondered about the same thing in the original PR adding the virtual thread metrics but I didn't look into it to get to an answer. While being more efficient is a goal, the benefits from the isolation of not sharing it (which may introduce the ability of one binder to cause issues for another binder, depending on how things are exposed) feel more important if the only efficiency is allocating several bytes for one more RecordingStream object per binder. Presumably different binders will be listening to different JFR events anyway.

Each RecordingStream requires its own memory allocation and appears to need at least 3 threads ("JFR Periodic Tasks",
"JFR Event Stream" and "JFR Recorder Thread") to process the events. What I'm suggesting is to support both scenarios:

  • the default one, RecordingStream created internally for each binder, not reusable
  • the custom one, allowing to set a RecordingStream, enabling reuse across different binders

@shakuzen
Copy link
Member

There are some valid points of feedback being discussed, but it feels like we're getting farther from delivering on the original feature request rather than closer. That may be for the better in the long-term, but I want to lay out the options to proceed as I see them. We can 1) deliver the original feature (an additional opt-in meter for current number of virtual threads) without delivering another feature (generic JFR event-based meter binder abstraction). However, if we intend to introduce the generic JFR MeterBinder later and apply it to VirtualThreadMetrics, we need to keep that in mind when considering API changes now. Alternatively, we can 2) put this new meter feature on hold while we figure out the generic JFR MeterBinder. Or 3) try to tackle both features at once. There is also the release schedule to consider. A feature like generic JFR MeterBinder probably would be better to deliver in a milestone so there is more opportunity for feedback - right now our next feature release is a Release Candidate. So, delivering a generic JFR MeterBinder probably is best to delay until Micrometer 1.16.

I think my preference is for 2, and if we do that we should focus in a separate PR on reviewing what the API for a generic JFR MeterBinder looks like before deciding how this PR ends up - unless we can deliver this PR in a way that would not cause trouble for adapting to the generic JFR MeterBinder later in a backward compatible way.

@ceremo what are your thoughts? Are you okay putting this on hold or would you like to try to get this PR merged in time for 1.15 (without tackling a generic JFR MeterBinder feature)?

@ceremo
Copy link
Author

ceremo commented Mar 27, 2025

@ceremo what are your thoughts? Are you okay putting this on hold or would you like to try to get this PR merged in time for 1.15 (without tackling a generic JFR MeterBinder feature)?

No problem keeping this on hold. Defining a new API for a generic JFR MeterBinder makes sense, so I would also choose the second alternative.

@shakuzen shakuzen added the blocked An issue that's blocked on an external project change label Mar 27, 2025
@shakuzen
Copy link
Member

I've marked this as blocked and opened #6065, targeted at 1.16.0-M1.

@shakuzen
Copy link
Member

Superseded by #6104, which is based on a new MBean in Java 24 rather than JFR events.

@shakuzen shakuzen closed this Apr 14, 2025
@jonatan-ivanov jonatan-ivanov added superseded An issue that has been superseded by another and removed blocked An issue that's blocked on an external project change labels Apr 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
superseded An issue that has been superseded by another
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants