-
Notifications
You must be signed in to change notification settings - Fork 1k
Add metric to check the current number of VT's #6009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: César Revert <[email protected]>
Signed-off-by: César Revert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pull request.
...er-java21/src/main/java/io/micrometer/java21/instrument/binder/jdk/VirtualThreadMetrics.java
Outdated
Show resolved
Hide resolved
...er-java21/src/main/java/io/micrometer/java21/instrument/binder/jdk/VirtualThreadMetrics.java
Outdated
Show resolved
Hide resolved
this.recordingStream.onEvent(END_EVENT, event -> activeCounter.decrement()); | ||
|
||
Gauge.builder(VT_ACTIVE_METRIC_NAME, activeCounter::doubleValue) | ||
.description("The number of active virtual threads") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than a gauge, I wonder if it might be more insightful to have two counters, one for started, one for ended. The active could be derived by taking the difference. And this would allow tracking the rate of virtual threads being started/stopped. Thoughts? Maybe in some use cases we would have to worry about overflow (assuming cumulative counters)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to have both counters. However, these types of events (start/end) are numerous, so they would increase rapidly, and overflow could become a common scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for autoscaling, the important metric is whether the active (virtual) thread count is continually increasing. If you just look at the rate of threads starting without looking at the rate of threads finishing, I'm not sure it tells much that is important for autoscaling because if you have a high rate of starting threads but finishing is keeping up, there is no problem and no autoscaling needed. Given that, maybe the rate of starting or stopping on its own isn't that interesting and the gauge of active virtual threads is most useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @shakuzen . The above makes sense. But lets say we have active virtual threads count. Based on the traffic, we'll have to finetune our thresholds accordingly for active virtual threads
Whereas if we calculate % of virtual threads being used like (active threads/total threads created)*100, we can do 1 time setup where if threshold is greater than 60% (An example), autoscaling should kick in, which generally means active threads are increased (Rate of Finishing threads has slowed down)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the traffic, we'll have to finetune our thresholds accordingly for active virtual threads
I don't think so. You might want to check for an absolute threshold of active virtual threads that is cause for concern, and you may want to check over multiple step intervals that the active virtual thread count is increasing significantly.
Whereas if we calculate % of virtual threads being used like (active threads/total threads created)*100, we can do 1 time setup where if threshold is greater than 60%
Say normally you have between 10-100 active threads, but when there is an issue it climbs to 300, 500, 2000, etc active threads. If your application just started and there is some issue, checking the ratio of active threads to total created may catch the issue. But if the issue happens long after the application started, the ratio no longer catches it because the total threads created is monotonic; it increases over the lifetime of the application. For this reason, using such a ratio is not a good way to catch irregularities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shakuzen . One final doubt. Apologies for dragging the conversation
But if the issue happens long after the application started, the ratio no longer catches it because the total threads created is monotonic; it increases over the lifetime of the application. For this reason, using such a ratio is not a good way to catch irregularities.
Since virtual threads created will be a counter, won't the below expression help for us?
sum(active_thread_count)/sum(rate(virtual_threads_created[1m]))
Because taking rate of a counter will give the number of threads created over a period of interval right(1m)? Even if application runs for days/months? (Assuming active_thread_count is a gauge)
Signed-off-by: César Revert <[email protected]>
Signed-off-by: César Revert <[email protected]>
I was wondering if we should make
What do you think? |
The way I see it, if we want to create an abstract solution that's not only useful for listening to Virtual Threads JFR events, we should allow reusing the
|
Is it costly to have multiple RecordingStream instances? I wondered about the same thing in the original PR adding the virtual thread metrics but I didn't look into it to get to an answer. While being more efficient is a goal, the benefits from the isolation of not sharing it (which may introduce the ability of one binder to cause issues for another binder, depending on how things are exposed) feel more important if the only efficiency is allocating several bytes for one more RecordingStream object per binder. Presumably different binders will be listening to different JFR events anyway. As for the "enabled" booleans, I don't think we need these for the existing metrics - those should have near negligible overhead and users don't need to bind this MeterBinder if they don't want that. The reason why it is important to separately be able to decide about enabling the start/stop virtual thread metrics is that the overhead can be more significant with high throughput using virtual threads. We have MeterFilter which lets users disable metrics they don't want, but there may be instrumentation overhead involved even if the metric is disabled. We could avoid calling |
Each RecordingStream requires its own memory allocation and appears to need at least 3 threads ("JFR Periodic Tasks",
|
There are some valid points of feedback being discussed, but it feels like we're getting farther from delivering on the original feature request rather than closer. That may be for the better in the long-term, but I want to lay out the options to proceed as I see them. We can 1) deliver the original feature (an additional opt-in meter for current number of virtual threads) without delivering another feature (generic JFR event-based meter binder abstraction). However, if we intend to introduce the generic JFR MeterBinder later and apply it to VirtualThreadMetrics, we need to keep that in mind when considering API changes now. Alternatively, we can 2) put this new meter feature on hold while we figure out the generic JFR MeterBinder. Or 3) try to tackle both features at once. There is also the release schedule to consider. A feature like generic JFR MeterBinder probably would be better to deliver in a milestone so there is more opportunity for feedback - right now our next feature release is a Release Candidate. So, delivering a generic JFR MeterBinder probably is best to delay until Micrometer 1.16. I think my preference is for 2, and if we do that we should focus in a separate PR on reviewing what the API for a generic JFR MeterBinder looks like before deciding how this PR ends up - unless we can deliver this PR in a way that would not cause trouble for adapting to the generic JFR MeterBinder later in a backward compatible way. @ceremo what are your thoughts? Are you okay putting this on hold or would you like to try to get this PR merged in time for 1.15 (without tackling a generic JFR MeterBinder feature)? |
No problem keeping this on hold. Defining a new API for a generic JFR MeterBinder makes sense, so I would also choose the second alternative. |
I've marked this as blocked and opened #6065, targeted at 1.16.0-M1. |
Superseded by #6104, which is based on a new MBean in Java 24 rather than JFR events. |
Related to #5950