-
Notifications
You must be signed in to change notification settings - Fork 16
[DRAFT] Describe Trace Snapshot Profiling #343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,3 +51,69 @@ | |
|
|
||
| See [integration_context.md](integration_context.md) for specifics about | ||
| exchanging additional context between AppD and splunk-otel based agents. | ||
|
|
||
| ## Trace Snapshot Profiling | ||
|
|
||
| **Status**: [Experimental](../README.md#versioning-and-status-of-the-specification) | ||
|
|
||
| This section describes the behavior for Splunk instrumentation libraries | ||
|
Check failure on line 59 in specification/behaviors.md
|
||
| that contain trace snapshot profiling features. | ||
|
|
||
| ### Instrumentation Source | ||
|
|
||
| Agents MUST specify set the `profiling.instrumentation.source` value to `snapshot` | ||
|
|
||
| ### Starting Trace Profiler | ||
|
|
||
| The OpenTelemetry Baggage entry for `splunk.trace.snapshot.volume` MUST be used to | ||
|
Check failure on line 68 in specification/behaviors.md
|
||
| decide whether to profile a trace. A value of `higest` is the signal to begin | ||
|
Check failure on line 69 in specification/behaviors.md
|
||
| profiling where as a value of `off` is an explicit signal to not profile. | ||
|
|
||
| When profiling a trace, the profiler SHOULD be started when an entry span is | ||
|
Check failure on line 72 in specification/behaviors.md
|
||
| detected. An entry span is defined as either the root span of the trace or | ||
|
Check failure on line 73 in specification/behaviors.md
|
||
| any other span within a trace whose parent span is remote. | ||
|
|
||
| When a trace is profiled agents MUST add the span attribute `splunk.snapshot.profiling` | ||
|
Check failure on line 76 in specification/behaviors.md
|
||
| with a value of `true` to the entry span. | ||
|
|
||
| Agents SHOULD take an initial stack trace sample when starting to profile a trace. | ||
|
|
||
| ### Trace Profiling | ||
|
Check failure on line 81 in specification/behaviors.md
|
||
| An instrumentation library that has trace snapshot profiling capabilities MUST | ||
| be able to sample call stacks for specific trace ids at a fixed interval. | ||
|
|
||
| When a language runtime supports threading, stacks MUST be sampled only for | ||
|
Check failure on line 85 in specification/behaviors.md
|
||
| trace ids selected for snapshotting. The samples for profiled threads SHOULD be | ||
| taken instantaneously and MAY be taken at separate times. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't follow the meaning of this sentence. Also taking a stack trace is not an instantaneous operation by any means. In java for example stack trace is taken at a safepoint, which means that all threads are suspended (apparently recent vms are able to suspend only a single thread, but I don't know whether it is used when taking a stack trace). |
||
|
|
||
| Agents MUST sample threads associated with the entry span for the duration of | ||
| the span's life. | ||
|
|
||
| ### Call Stack Span Association | ||
|
|
||
| Agents SHOULD keep track of the current span for each profiled thread. Agents | ||
| are RECOMMENDED to use the OpenTelemetry Context for determining when the current | ||
| span changes. | ||
|
|
||
| When available, agents MUST use the span id from the profiled thread's current span | ||
| as the span id. | ||
|
|
||
| ### Stopping Trace Profiler | ||
| Trace profiling MUST be stopped when the entry span of a service ends. | ||
|
|
||
| Agents SHOULD take a final stack trace sample when stopping profiling | ||
| for a trace. | ||
|
|
||
| ### Exporting Stack Traces | ||
| It is RECOMMENDED to export stack traces in batches to take advantage of the pprof | ||
| data format. | ||
|
|
||
| Agents SHOULD attempt to export any remaining stack traces during the Agent shutdown phase. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure whether this requirement makes sense, idk whether it can be easily implemented for all languages. |
||
|
|
||
| ### Call Stack Ingest | ||
|
|
||
| Call stacks MUST be ingested as [OpenTelemetry | ||
| Logs](https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/logs). | ||
| The logs containing profiling data MUST be sent via OTLP. Instrumentation | ||
| libraries SHOULD reuse persistent OTLP connections from other signals (traces, | ||
| metrics). | ||
|
Comment on lines
+117
to
+119
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although it is copied from https://github.com/signalfx/gdi-specification/blob/main/specification/behaviors.md#call-stack-ingest wanted to point out that I suspect that this is not true for the java implementation. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -173,6 +173,7 @@ instance using the following environment variables: | |
| | `SPLUNK_PROFILER_MEMORY_ENABLED` | false | Whether memory profiling is enabled. [2] [6] | | ||
| | `SPLUNK_REALM` | `none` | Which realm to send exported data. [3] | | ||
| | `SPLUNK_TRACE_RESPONSE_HEADER_ENABLED` | true | Whether `Server-Timing` header is added to HTTP responses. [4] | | ||
| | `SPLUNK_SNAPSHOT_PROFILER_ENABLED` | false | Whether Trace Snapshot CPU profiling is enabled. [2] [5] | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be covered/superseded by #353 . |
||
|
|
||
| - [1]: Not user required if another system performs the authentication. For | ||
| example, instrumentation libraries SHOULD send data to a locally running | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -272,3 +272,20 @@ For each `cpu` sample: | |
| in milliseconds if this sample represents a periodic event | ||
| - label `thread.state` of type `string` OPTIONALLY can be set to describe | ||
| the state of the thread | ||
|
|
||
| ## Trace Snapshot Profiling | ||
|
|
||
| **Status**: [Experimental](../README.md#versioning-and-status-of-the-specification) | ||
|
|
||
| Unless stated otherwise Agents MUST follow the `Profiling `ResourceLogs` Message` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
is this a typo? |
||
| semantic conventions. | ||
|
|
||
| Trace Snapshot Profiling Configuration Options | ||
|
|
||
| | Name | Type | Description | Valid Values | Default | | ||
| |----------------------------------------------|--------|-----------------------------------------------------|------------------------------| ------- | | ||
| | `splunk.snapshot.profiler.enabled` | string | Enable or Disable trace snapshot profiling | `true` or `false` | `false` | | ||
| | `splunk.snapshot.profiler.sampling.interval` | string | Interval in which to take trace stack trace samples | Any valid duration `string` | `10ms` | | ||
|
Comment on lines
+287
to
+288
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Superseded in #353. |
||
|
|
||
| The span attribute `splunk.snapshot.profiling` with a value of `true` indicates that | ||
| a trace within a service has been profiled. | ||
|
Comment on lines
+290
to
+291
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe point out that the attribute should be se on the local root span? |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed before the initial sample will not contain user code. The stack trace is likely to be identical for all requests. What use does it have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case with Node.js, these stack traces will never be correlated to the given trace ID, as we're only collecting stacktraces that are sampled during an active span.