From 496a88c8c6060920a38ea1948b3461fc4845df6e Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Wed, 15 Oct 2025 15:02:48 +0200 Subject: [PATCH 1/7] feat: Create 8tcol bring your own cache config --- .../bring-your-own-cache.mdx | 213 ++++++++++++++++++ .../infinite-tracing-introduction.mdx | 12 + 2 files changed, 225 insertions(+) create mode 100644 src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx create mode 100644 src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx new file mode 100644 index 00000000000..1975cfa4a86 --- /dev/null +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx @@ -0,0 +1,213 @@ +--- +title: Bring your own cache +tags: + - Distributed tracing + - Infinite Tracing + - On-premise + - Redis + - Cache configuration +metaDescription: 'Configure Redis-compatible caches for Infinite Tracing on-premise tail sampling processor to enable high-availability and distributed processing' +redirects: [] +freshnessValidatedDate: never +--- + + +New Relic's Infinite Tracing Processor is an implementation of the OpenTelemetry Collector [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor). In addition to upstream features, it supports highly scalable distributed processing by using a distributed cache for shared state storage. This documentation describes the supported cache implementations and their configuration. + +# Supported caches + +The processor supports any Redis-compatible cache implementation. It has been tested and validated with Redis and Valkey in both single-instance and cluster configurations. + +For production deployments, we recommend using cluster mode (sharded) to ensure high availability and scalability. To enable the cache, add the following configuration to your `tail_sampling` processor section: + +```yaml + tail_sampling: + redis: + enabled: true + addr: redis://localhost:6379/0 + prefix: "itc" + max_traces_per_batch: 50 # this impacts instance memory, because batches are loaded in memory +``` + +When `enabled` is set to `false`, the collector will use in-memory processing instead. The `addr` parameter must specify a valid Redis-compatible server address using the standard format: + +```shell +redis[s]://[[username][:password]@][host][:port][/db-number] +``` + +Alternatively, you can embed the password directly in the `addr` parameter and omit the separate `password` field: + +```yaml + tail_sampling: + redis: + enabled: true + addr: redis://:local@localhost:6379/0 + password: '' + prefix: "itc" +``` + +The processor is implemented in Go and uses the [go-redis](https://github.com/redis/go-redis/tree/v9) client library. + +# Redis-compatible cache requirements + +The processor uses the cache as distributed storage for the following trace data: + +- Trace and span attributes +- Active trace data +- Sampling decision cache + +The processor executes **Lua scripts** to interact with the Redis cache atomically. Lua script support is typically enabled by default in Redis-compatible caches. No additional configuration is required unless you have explicitly disabled this feature. + +## Sizing and performance + +Proper Redis instance sizing is critical for optimal performance. The following example demonstrates how to calculate memory requirements based on a sample `tail_sampling` configuration: + +```yaml + tail_sampling: + decision_wait: 30s + decision_cache: + non_sampled_cache_size: 1_000_000 + sampled_cache_size: 1_000_000 + redis: + enabled: true + addr: redis://localhost:6379/0 + password: local + prefix: "itc" + max_traces_per_batch: 50 +``` + +To complete the calculation, you must also estimate your workload characteristics: +- **Spans per second**: Assumed throughput of 10,000 spans/sec +- **Average span size**: Assumed size of 900 bytes (marshaled protobuf format) + +### Memory estimation formula + +``` +Total Memory = (Trace Data) + (Decision Caches) + (Overhead) +``` + +#### 1. Trace data storage + +Trace data is stored temporarily in Redis during the `decision_wait` period: + +- **Per-span storage**: ~900 bytes (marshaled protobuf) +- **Storage duration**: `decision_wait * 2` (default TTL) +- **Formula**: `Memory = spans_per_second × decision_wait_seconds × 900 bytes` + +**Example calculation**: At 10,000 spans/second with a 30-second `decision_wait`: +``` +10,000 spans/sec × 30 sec × 900 bytes = 270 MB +``` + +#### 2. Decision cache storage + +The number of cached entries is controlled by the `sampled_cache_size` and `non_sampled_cache_size` configuration parameters. The LRU caches store trace IDs (16 bytes each) plus Redis data structure overhead: + +- **Sampled cache**: `1,000,000 trace IDs × ~50 bytes = ~50 MB` +- **Non-sampled cache**: `1,000,000 trace IDs × ~50 bytes = ~50 MB` +- **Total decision cache**: ~100 MB (default) + + + +#### 3. Batch processing overhead + +- **Current batch queue**: Minimal (trace IDs + scores in sorted set) +- **In-flight batches**: `max_traces_per_batch × average_spans_per_trace × 900 bytes` + +**Example calculation**: 50 traces per batch with 20 spans per trace on average: +``` +50 × 20 × 900 bytes = 900 KB per batch +``` + +Batch size also impacts memory usage and processing efficiency. + +### Complete sizing example + +Based on the configuration above with the following workload parameters: +- **Throughput**: 10,000 spans/second +- **Average span size**: 900 bytes + +| Component | Memory Required | +|-----------|----------------| +| Trace data (active) | 270 MB | +| Decision caches | 100 MB | +| Batch processing | ~1 MB | +| Redis overhead (20%) | ~74 MB | +| **Total** | **~445 MB** | + + + **Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider: + - Provisioning **2-3x the calculated memory** to accommodate traffic spikes and growth + - Using Redis cluster mode for horizontal scaling + - Monitoring actual memory usage and adjusting capacity accordingly + + +### Performance considerations + +- **Network latency**: Round-trip time between the collector and Redis directly impacts sampling throughput. Deploy Redis instances with low-latency network connectivity to the collector. +- **Lua script execution**: All cache operations use atomic Lua scripts executed server-side, ensuring data consistency and optimal performance. +- **Cluster mode**: Distributing load across multiple Redis nodes increases throughput and provides fault tolerance for high-availability deployments. + +# Limitations and evictions + + + **Performance bottleneck**: Redis and network communication are typically the limiting factors for processor performance. The speed and reliability of your Redis cache are essential for proper collector operation. Ensure your Redis instance has sufficient resources and maintains low-latency network connectivity to the collector. + + +The processor stores trace data temporarily in Redis while making sampling decisions. Understanding data management and eviction policies is critical for optimal performance. + +## Data stored in Redis + +The processor stores the following data structures in Redis: + +1. **Trace spans**: Stored as lists using protobuf-marshaled trace data +2. **Decision cache**: Separate LRU caches for sampled and non-sampled trace IDs +3. **Current batch queue**: Sorted set tracking traces waiting for sampling decisions +4. **In-flight batches**: Temporary storage for traces being evaluated + +## TTL and expiration + + + **Important**: The `decision_wait` configuration parameter directly impacts all TTL values in this section. Adjusting `decision_wait` will proportionally change Redis memory usage and data retention times. For example, doubling `decision_wait` from 30s to 60s will approximately double your Redis memory requirements for trace data. + + +- **Trace data TTL**: Set to `decision_wait * 2` by default + - Ensures trace data persists long enough for evaluation + - Automatically expires after the decision is made + - Example: With `decision_wait: 30s`, traces expire after 60 seconds + +- **In-flight timeout**: Set to `decision_wait * 4` by default + - Protects against orphaned batches from processor failures + - Orphaned batches are automatically recovered and re-queued + - Example: With `decision_wait: 30s`, in-flight batches timeout after 120 seconds + +## LRU eviction for decision caches + +The decision caches implement a Least Recently Used (LRU) eviction strategy using Lua scripts: + +- **Sampled cache**: Default capacity of 1,000,000 trace IDs +- **Non-sampled cache**: Default capacity of 1,000,000 trace IDs + +When a cache reaches its maximum capacity, the least recently accessed trace IDs are automatically evicted. This approach ensures: +- Recent sampling decisions remain available for late-arriving spans +- Memory usage remains within configured bounds +- Consistent cache performance under load + +Configure cache sizes through the following parameters: + +```yaml +tail_sampling: + decision_cache: + sampled_cache_size: 1000000 + non_sampled_cache_size: 1000000 +``` + +## Batch processing + +The processor handles traces in batches to optimize performance: + +- **Maximum traces per batch**: Default of 50, configurable via `max_traces_per_batch` +- **Atomic batch operations**: Batches are retrieved atomically from the current queue +- **Failure recovery**: Failed batches are automatically recovered and re-queued after the in-flight timeout expires + + diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx new file mode 100644 index 00000000000..1678b24bea6 --- /dev/null +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx @@ -0,0 +1,12 @@ +--- +title: (fix this titlle) Introduction to Infinite Tracing +tags: + - Understand dependencies + - Distributed tracing + - Infinite Tracing +metaDescription: 'FIXME' +redirects: + - /docs/on-premise-infinite-tracing +freshnessValidatedDate: never +--- + From d3184a40f80a3ff248a2e86cdc4d9459bfc875bb Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Wed, 15 Oct 2025 17:20:09 +0200 Subject: [PATCH 2/7] Create menu --- src/nav/distributed-tracing.yml | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/nav/distributed-tracing.yml b/src/nav/distributed-tracing.yml index afea1a07fd9..8ab5338f9df 100644 --- a/src/nav/distributed-tracing.yml +++ b/src/nav/distributed-tracing.yml @@ -35,6 +35,12 @@ pages: path: /docs/distributed-tracing/infinite-tracing/infinite-tracing-configure-proxy-support - title: 'Configure SSL for Java 7, 8' path: /docs/distributed-tracing/other-requirements/infinite-tracing-configuring-ssl-java-7-8 + - title: Infinite Tracing Collector + pages: + - title: Introduction to Infinite Tracing Collector + path: /docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction + - title: Bring your own distributed cache + path: /docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache - title: Trace API pages: - title: Introduction to the Trace API From 208678628c2636389ce60755de436a90e3f0561e Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Thu, 16 Oct 2025 16:41:27 +0200 Subject: [PATCH 3/7] adjust doc to new config --- .../bring-your-own-cache.mdx | 145 +++++++++++------- 1 file changed, 90 insertions(+), 55 deletions(-) diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx index 1975cfa4a86..23a44ccc91b 100644 --- a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx @@ -18,32 +18,37 @@ New Relic's Infinite Tracing Processor is an implementation of the OpenTelemetry The processor supports any Redis-compatible cache implementation. It has been tested and validated with Redis and Valkey in both single-instance and cluster configurations. -For production deployments, we recommend using cluster mode (sharded) to ensure high availability and scalability. To enable the cache, add the following configuration to your `tail_sampling` processor section: +For production deployments, we recommend using cluster mode (sharded) to ensure high availability and scalability. To enable distributed caching, add the `distributed_cache` configuration to your `tail_sampling` processor section: ```yaml tail_sampling: - redis: - enabled: true - addr: redis://localhost:6379/0 - prefix: "itc" - max_traces_per_batch: 50 # this impacts instance memory, because batches are loaded in memory + decision_wait: 30s + distributed_cache: + connection: + address: redis://localhost:6379/0 + password: 'local' + trace_window_expiration: 30s + suffix: "itc" + max_traces_per_batch: 50 ``` -When `enabled` is set to `false`, the collector will use in-memory processing instead. The `addr` parameter must specify a valid Redis-compatible server address using the standard format: + + **Configuration behavior**: When `distributed_cache` is configured, the processor automatically uses the distributed cache for state management. If `distributed_cache` is omitted entirely, the collector will use in-memory processing instead. There is no separate `enabled` flag. + + +The `address` parameter must specify a valid Redis-compatible server address using the standard format: ```shell redis[s]://[[username][:password]@][host][:port][/db-number] ``` -Alternatively, you can embed the password directly in the `addr` parameter and omit the separate `password` field: +Alternatively, you can embed credentials directly in the `address` parameter: ```yaml tail_sampling: - redis: - enabled: true - addr: redis://:local@localhost:6379/0 - password: '' - prefix: "itc" + distributed_cache: + connection: + address: redis://:yourpassword@localhost:6379/0 ``` The processor is implemented in Go and uses the [go-redis](https://github.com/redis/go-redis/tree/v9) client library. @@ -65,14 +70,12 @@ Proper Redis instance sizing is critical for optimal performance. The following ```yaml tail_sampling: decision_wait: 30s - decision_cache: - non_sampled_cache_size: 1_000_000 - sampled_cache_size: 1_000_000 - redis: - enabled: true - addr: redis://localhost:6379/0 - password: local - prefix: "itc" + distributed_cache: + connection: + address: redis://localhost:6379/0 + password: 'local' + trace_window_expiration: 30s + suffix: "itc" max_traces_per_batch: 50 ``` @@ -88,24 +91,31 @@ Total Memory = (Trace Data) + (Decision Caches) + (Overhead) #### 1. Trace data storage -Trace data is stored temporarily in Redis during the `decision_wait` period: +Trace data is stored temporarily in Redis during the trace window period: - **Per-span storage**: ~900 bytes (marshaled protobuf) -- **Storage duration**: `decision_wait * 2` (default TTL) -- **Formula**: `Memory = spans_per_second × decision_wait_seconds × 900 bytes` +- **Storage duration**: Controlled by `traces_ttl` (default: 240s) +- **Active window**: Controlled by `trace_window_expiration` (default: 30s) +- **Formula**: `Memory ≈ spans_per_second × trace_window_expiration × 900 bytes` -**Example calculation**: At 10,000 spans/second with a 30-second `decision_wait`: +**Example calculation**: At 10,000 spans/second with a 30-second `trace_window_expiration`: ``` 10,000 spans/sec × 30 sec × 900 bytes = 270 MB ``` +Note: This calculation estimates memory for actively accumulating traces. The actual Redis memory may be higher due to traces waiting in the evaluation queue or being processed. + #### 2. Decision cache storage -The number of cached entries is controlled by the `sampled_cache_size` and `non_sampled_cache_size` configuration parameters. The LRU caches store trace IDs (16 bytes each) plus Redis data structure overhead: +When using `distributed_cache`, the decision caches are stored in Redis without explicit size limits. Instead, Redis uses its native LRU eviction policy (configured via `maxmemory-policy`) to manage memory. Each trace ID requires approximately 50 bytes of storage: + +- **Sampled cache**: Managed by Redis LRU eviction +- **Non-sampled cache**: Managed by Redis LRU eviction +- **Typical overhead per trace ID**: ~50 bytes -- **Sampled cache**: `1,000,000 trace IDs × ~50 bytes = ~50 MB` -- **Non-sampled cache**: `1,000,000 trace IDs × ~50 bytes = ~50 MB` -- **Total decision cache**: ~100 MB (default) + + **Memory management**: Configure Redis with `maxmemory-policy allkeys-lru` to allow automatic eviction of old decision cache entries when memory limits are reached. The decision cache keys use TTL-based expiration (controlled by `cache_ttl`) rather than fixed size limits. + @@ -130,10 +140,10 @@ Based on the configuration above with the following workload parameters: | Component | Memory Required | |-----------|----------------| | Trace data (active) | 270 MB | -| Decision caches | 100 MB | +| Decision caches | Variable (LRU-managed) | | Batch processing | ~1 MB | -| Redis overhead (20%) | ~74 MB | -| **Total** | **~445 MB** | +| Redis overhead (20%) | ~54 MB | +| **Total (minimum)** | **~325 MB + decision cache** | **Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider: @@ -167,41 +177,66 @@ The processor stores the following data structures in Redis: ## TTL and expiration - - **Important**: The `decision_wait` configuration parameter directly impacts all TTL values in this section. Adjusting `decision_wait` will proportionally change Redis memory usage and data retention times. For example, doubling `decision_wait` from 30s to 60s will approximately double your Redis memory requirements for trace data. +When using `distributed_cache`, the TTL configuration differs from the in-memory processor. The following parameters control data expiration: + + + **Key difference from in-memory mode**: When `distributed_cache` is configured, `trace_window_expiration` replaces `decision_wait` for determining when traces are evaluated. The `trace_window_expiration` parameter defines a sliding window: each time new spans arrive for a trace, the trace remains active for another `trace_window_expiration` period. This incremental approach keeps traces with ongoing activity alive longer than those that have stopped receiving spans. -- **Trace data TTL**: Set to `decision_wait * 2` by default - - Ensures trace data persists long enough for evaluation - - Automatically expires after the decision is made - - Example: With `decision_wait: 30s`, traces expire after 60 seconds +### TTL hierarchy and defaults -- **In-flight timeout**: Set to `decision_wait * 4` by default - - Protects against orphaned batches from processor failures - - Orphaned batches are automatically recovered and re-queued - - Example: With `decision_wait: 30s`, in-flight batches timeout after 120 seconds +The processor uses a cascading TTL structure, with each level providing protection for the layer below: -## LRU eviction for decision caches +1. **`trace_window_expiration`** (default: 30s) + - Configures how long to wait after the last span arrives before evaluating a trace + - Acts as a sliding window: resets each time new spans arrive for a trace + - Defined via `distributed_cache.trace_window_expiration` -The decision caches implement a Least Recently Used (LRU) eviction strategy using Lua scripts: +2. **`in_flight_timeout`** (default: `trace_window_expiration * 4` = 120s) + - Maximum time a batch can be processed before being considered orphaned + - Orphaned batches are automatically recovered and re-queued + - Defined via `distributed_cache.in_flight_timeout` -- **Sampled cache**: Default capacity of 1,000,000 trace IDs -- **Non-sampled cache**: Default capacity of 1,000,000 trace IDs +3. **`traces_ttl`** (default: `in_flight_timeout * 2` = 240s) + - Redis key expiration for trace span data + - Ensures trace data persists long enough for evaluation and recovery + - Defined via `distributed_cache.traces_ttl` -When a cache reaches its maximum capacity, the least recently accessed trace IDs are automatically evicted. This approach ensures: -- Recent sampling decisions remain available for late-arriving spans -- Memory usage remains within configured bounds -- Consistent cache performance under load +4. **`cache_ttl`** (default: `traces_ttl * 2` = 480s) + - Redis key expiration for decision cache entries (sampled/non-sampled) + - Prevents duplicate evaluation for late-arriving spans + - Defined via `distributed_cache.cache_ttl` -Configure cache sizes through the following parameters: +### Example configuration ```yaml -tail_sampling: - decision_cache: - sampled_cache_size: 1000000 - non_sampled_cache_size: 1000000 + tail_sampling: + distributed_cache: + trace_window_expiration: 30s # Primary control + in_flight_timeout: 120s # Optional: defaults to trace_window_expiration * 4 + traces_ttl: 240s # Optional: defaults to in_flight_timeout * 2 + cache_ttl: 480s # Optional: defaults to traces_ttl * 2 ``` +## LRU eviction for decision caches + +When using `distributed_cache`, the decision caches rely on Redis's native LRU eviction rather than application-managed size limits: + + + **Redis LRU configuration required**: Configure your Redis instance with `maxmemory-policy allkeys-lru` to enable automatic eviction of old entries when memory limits are reached. The decision cache keys are stored in Redis with TTL-based expiration (controlled by `cache_ttl`), and Redis will automatically evict the least recently used keys when memory pressure occurs. + + +- **Sampled cache**: TTL-managed (default: 480s via `cache_ttl`) +- **Non-sampled cache**: TTL-managed (default: 480s via `cache_ttl`) + +This approach provides several benefits: +- Recent sampling decisions remain available for late-arriving spans +- No hard limit on cache size—Redis manages memory automatically +- Consistent cache performance under load +- Simpler configuration without manual cache sizing + +The decision caches use Lua scripts to atomically check for key existence and refresh TTLs, ensuring data consistency across distributed processor instances. + ## Batch processing The processor handles traces in batches to optimize performance: From e7cdc2e09e8bafcccec2c087272a6d6fb2ec3dc4 Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Thu, 6 Nov 2025 11:57:39 +0100 Subject: [PATCH 4/7] include latest config --- .../bring-your-own-cache.mdx | 165 +++++++++++------- 1 file changed, 105 insertions(+), 60 deletions(-) diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx index 23a44ccc91b..89ed917c349 100644 --- a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx @@ -29,7 +29,10 @@ For production deployments, we recommend using cluster mode (sharded) to ensure password: 'local' trace_window_expiration: 30s suffix: "itc" - max_traces_per_batch: 50 + max_traces_per_batch: 500 + evaluation_interval: 1s + data_compression: + format: lz4 ``` @@ -53,6 +56,45 @@ Alternatively, you can embed credentials directly in the `address` parameter: The processor is implemented in Go and uses the [go-redis](https://github.com/redis/go-redis/tree/v9) client library. +# Configuration parameters + +The `distributed_cache` section supports the following parameters: + +## Connection parameters + +- **`connection.address`** (required): Redis server address in format `redis[s]://[[username][:password]@][host][:port][/db-number]` +- **`connection.password`** (optional): Redis password (alternative to embedding in address) + +## Trace evaluation parameters + +- **`trace_window_expiration`** (default: 30s): Time window after the last span arrives before a trace is evaluated for sampling decisions +- **`evaluation_interval`** (default: 1s): How frequently the processor evaluates pending traces for sampling decisions +- **`evaluation_workers`** (default: number of CPU cores): Number of parallel worker threads for evaluating sampling policies. Higher values increase throughput but consume more resources. + +## TTL and expiration parameters + +- **`in_flight_timeout`** (default: equals `trace_window_expiration`): Maximum time a batch can remain in processing before being considered orphaned and recovered +- **`traces_ttl`** (default: 1 hour): Redis key expiration time for trace span data +- **`cache_ttl`** (default: 2 hours): Redis key expiration time for sampling decision cache entries + +## Storage parameters + +- **`max_traces_per_batch`** (default: 500): Maximum number of traces processed in a single evaluation cycle. Higher values improve throughput but increase memory usage. +- **`suffix`** (default: "tsp"): Prefix for Redis keys to avoid collisions when multiple processors share the same Redis instance +- **`data_compression`** (optional): Compression settings for trace data stored in Redis + - **`format`** (default: none): Compression format: `none`, `snappy`, `zstd`, or `lz4` + + + **Compression tradeoffs**: Enabling compression reduces network bandwidth between the processor and Redis and lowers Redis memory requirements. However, compression increases CPU and memory usage on the processor instance during compression/decompression operations. + + **Format recommendations**: + - **`zstd`**: Maximum compression ratio, best for bandwidth-constrained environments but highest CPU overhead during decompression + - **`lz4`**: Balanced option with good compression and near-negligible decompression overhead—recommended for most deployments + - **`snappy`**: Fastest compression/decompression with lowest CPU cost, but lower compression ratios than lz4 + + Choose based on your bottleneck: network bandwidth and Redis storage vs. processor CPU availability. + + # Redis-compatible cache requirements The processor uses the cache as distributed storage for the following trace data: @@ -65,21 +107,7 @@ The processor executes **Lua scripts** to interact with the Redis cache atomical ## Sizing and performance -Proper Redis instance sizing is critical for optimal performance. The following example demonstrates how to calculate memory requirements based on a sample `tail_sampling` configuration: - -```yaml - tail_sampling: - decision_wait: 30s - distributed_cache: - connection: - address: redis://localhost:6379/0 - password: 'local' - trace_window_expiration: 30s - suffix: "itc" - max_traces_per_batch: 50 -``` - -To complete the calculation, you must also estimate your workload characteristics: +Proper Redis instance sizing is critical for optimal performance. Use the configuration example from "Supported caches" above. To calculate memory requirements, you must estimate your workload characteristics: - **Spans per second**: Assumed throughput of 10,000 spans/sec - **Average span size**: Assumed size of 900 bytes (marshaled protobuf format) @@ -91,19 +119,28 @@ Total Memory = (Trace Data) + (Decision Caches) + (Overhead) #### 1. Trace data storage -Trace data is stored temporarily in Redis during the trace window period: +Trace data is stored in Redis for the full `traces_ttl` period to support late-arriving spans and trace recovery: - **Per-span storage**: ~900 bytes (marshaled protobuf) -- **Storage duration**: Controlled by `traces_ttl` (default: 240s) -- **Active window**: Controlled by `trace_window_expiration` (default: 30s) -- **Formula**: `Memory ≈ spans_per_second × trace_window_expiration × 900 bytes` +- **Storage duration**: Controlled by `traces_ttl` (default: **1 hour**) +- **Active collection window**: Controlled by `trace_window_expiration` (default: 30s) +- **Formula**: `Memory ≈ spans_per_second × traces_ttl × 900 bytes` + + + **Active window vs. full retention**: Traces are collected during a ~30-second active window (`trace_window_expiration`), but persist in Redis for the full 1-hour `traces_ttl` period. This allows the processor to handle late-arriving spans and recover orphaned traces. Your Redis sizing must account for the **full retention period**, not just the active window. + -**Example calculation**: At 10,000 spans/second with a 30-second `trace_window_expiration`: +**Example calculation**: At 10,000 spans/second with 1-hour `traces_ttl`: ``` -10,000 spans/sec × 30 sec × 900 bytes = 270 MB +10,000 spans/sec × 3600 sec × 900 bytes = 32.4 GB ``` -Note: This calculation estimates memory for actively accumulating traces. The actual Redis memory may be higher due to traces waiting in the evaluation queue or being processed. +**With lz4 compression** (we have observed 25% reduction): +``` +32.4 GB × 0.75 = 24.3 GB +``` + +Note: This calculation represents the primary memory consumer. Actual Redis memory may be slightly higher due to decision caches and internal data structures. #### 2. Decision cache storage @@ -124,30 +161,43 @@ When using `distributed_cache`, the decision caches are stored in Redis without - **Current batch queue**: Minimal (trace IDs + scores in sorted set) - **In-flight batches**: `max_traces_per_batch × average_spans_per_trace × 900 bytes` -**Example calculation**: 50 traces per batch with 20 spans per trace on average: +**Example calculation**: 500 traces per batch (default) with 20 spans per trace on average: ``` -50 × 20 × 900 bytes = 900 KB per batch +500 × 20 × 900 bytes = 9 MB per batch ``` -Batch size also impacts memory usage and processing efficiency. +Batch size impacts memory usage during evaluation. In-flight batch memory is temporary and released after processing completes. ### Complete sizing example Based on the configuration above with the following workload parameters: - **Throughput**: 10,000 spans/second - **Average span size**: 900 bytes +- **Storage period**: 1 hour (`traces_ttl`) + +**Without compression:** + +| Component | Memory Required | +|-----------|----------------| +| Trace data (1-hour retention) | 32.4 GB | +| Decision caches | Variable (LRU-managed) | +| Batch processing | ~10 MB | +| Redis overhead (25%) | ~8.1 GB | +| **Total (minimum)** | **~40.5 GB + decision cache** | + +**With lz4 compression (25% reduction):** | Component | Memory Required | |-----------|----------------| -| Trace data (active) | 270 MB | +| Trace data (1-hour retention) | 24.3 GB | | Decision caches | Variable (LRU-managed) | -| Batch processing | ~1 MB | -| Redis overhead (20%) | ~54 MB | -| **Total (minimum)** | **~325 MB + decision cache** | +| Batch processing | ~7 MB | +| Redis overhead (25%) | ~6.1 GB | +| **Total (minimum)** | **~30.4 GB + decision cache** | **Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider: - - Provisioning **2-3x the calculated memory** to accommodate traffic spikes and growth + - Provisioning **10-15% additional memory** beyond calculated requirements to accommodate traffic spikes and transient overhead - Using Redis cluster mode for horizontal scaling - Monitoring actual memory usage and adjusting capacity accordingly @@ -155,25 +205,15 @@ Based on the configuration above with the following workload parameters: ### Performance considerations - **Network latency**: Round-trip time between the collector and Redis directly impacts sampling throughput. Deploy Redis instances with low-latency network connectivity to the collector. -- **Lua script execution**: All cache operations use atomic Lua scripts executed server-side, ensuring data consistency and optimal performance. - **Cluster mode**: Distributing load across multiple Redis nodes increases throughput and provides fault tolerance for high-availability deployments. -# Limitations and evictions +# Data Management and Performance **Performance bottleneck**: Redis and network communication are typically the limiting factors for processor performance. The speed and reliability of your Redis cache are essential for proper collector operation. Ensure your Redis instance has sufficient resources and maintains low-latency network connectivity to the collector. -The processor stores trace data temporarily in Redis while making sampling decisions. Understanding data management and eviction policies is critical for optimal performance. - -## Data stored in Redis - -The processor stores the following data structures in Redis: - -1. **Trace spans**: Stored as lists using protobuf-marshaled trace data -2. **Decision cache**: Separate LRU caches for sampled and non-sampled trace IDs -3. **Current batch queue**: Sorted set tracking traces waiting for sampling decisions -4. **In-flight batches**: Temporary storage for traces being evaluated +The processor stores trace data temporarily in Redis while making sampling decisions. Understanding data expiration and cache eviction policies is critical for optimal performance. ## TTL and expiration @@ -192,17 +232,17 @@ The processor uses a cascading TTL structure, with each level providing protecti - Acts as a sliding window: resets each time new spans arrive for a trace - Defined via `distributed_cache.trace_window_expiration` -2. **`in_flight_timeout`** (default: `trace_window_expiration * 4` = 120s) +2. **`in_flight_timeout`** (default: equals `trace_window_expiration` if not specified) - Maximum time a batch can be processed before being considered orphaned - Orphaned batches are automatically recovered and re-queued - - Defined via `distributed_cache.in_flight_timeout` + - Can be overridden via `distributed_cache.in_flight_timeout` -3. **`traces_ttl`** (default: `in_flight_timeout * 2` = 240s) +3. **`traces_ttl`** (default: 1 hour) - Redis key expiration for trace span data - Ensures trace data persists long enough for evaluation and recovery - Defined via `distributed_cache.traces_ttl` -4. **`cache_ttl`** (default: `traces_ttl * 2` = 480s) +4. **`cache_ttl`** (default: 2 hours) - Redis key expiration for decision cache entries (sampled/non-sampled) - Prevents duplicate evaluation for late-arriving spans - Defined via `distributed_cache.cache_ttl` @@ -212,10 +252,19 @@ The processor uses a cascading TTL structure, with each level providing protecti ```yaml tail_sampling: distributed_cache: - trace_window_expiration: 30s # Primary control - in_flight_timeout: 120s # Optional: defaults to trace_window_expiration * 4 - traces_ttl: 240s # Optional: defaults to in_flight_timeout * 2 - cache_ttl: 480s # Optional: defaults to traces_ttl * 2 + connection: + address: redis://localhost:6379/0 + password: 'local' + trace_window_expiration: 30s # Default: how long to wait after last span before evaluating + in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set + traces_ttl: 3600s # Optional: default 1 hour + cache_ttl: 7200s # Optional: default 2 hours + suffix: "itc" # Redis key prefix + max_traces_per_batch: 500 # Default: traces processed per evaluation cycle + evaluation_interval: 1s # Default: evaluation frequency + evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count) + data_compression: + format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended ``` ## LRU eviction for decision caches @@ -223,11 +272,11 @@ The processor uses a cascading TTL structure, with each level providing protecti When using `distributed_cache`, the decision caches rely on Redis's native LRU eviction rather than application-managed size limits: - **Redis LRU configuration required**: Configure your Redis instance with `maxmemory-policy allkeys-lru` to enable automatic eviction of old entries when memory limits are reached. The decision cache keys are stored in Redis with TTL-based expiration (controlled by `cache_ttl`), and Redis will automatically evict the least recently used keys when memory pressure occurs. +**Redis LRU configuration required**: Configure your Redis instance with `maxmemory-policy allkeys-lru` to enable automatic eviction of old entries when memory limits are reached. The decision cache keys are stored in Redis with TTL-based expiration (controlled by `cache_ttl`), and Redis will automatically evict the least recently used keys when memory pressure occurs. -- **Sampled cache**: TTL-managed (default: 480s via `cache_ttl`) -- **Non-sampled cache**: TTL-managed (default: 480s via `cache_ttl`) +- **Sampled cache**: TTL-managed (default: 2 hours via `cache_ttl`) +- **Non-sampled cache**: TTL-managed (default: 2 hours via `cache_ttl`) This approach provides several benefits: - Recent sampling decisions remain available for late-arriving spans @@ -235,14 +284,10 @@ This approach provides several benefits: - Consistent cache performance under load - Simpler configuration without manual cache sizing -The decision caches use Lua scripts to atomically check for key existence and refresh TTLs, ensuring data consistency across distributed processor instances. - ## Batch processing The processor handles traces in batches to optimize performance: -- **Maximum traces per batch**: Default of 50, configurable via `max_traces_per_batch` +- **Maximum traces per batch**: Default of 500, configurable via `max_traces_per_batch` - **Atomic batch operations**: Batches are retrieved atomically from the current queue -- **Failure recovery**: Failed batches are automatically recovered and re-queued after the in-flight timeout expires - - +- **Failure recovery**: Failed batches are automatically recovered and re-queued after the in-flight timeout expires \ No newline at end of file From 60ebf388b2ac66b85f6cf6af5af78aa63ef71f81 Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Thu, 6 Nov 2025 17:49:53 +0100 Subject: [PATCH 5/7] remove duplicate content, move config to the top --- .../bring-your-own-cache.mdx | 64 +++---------------- 1 file changed, 10 insertions(+), 54 deletions(-) diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx index 89ed917c349..becd6f2b1ce 100644 --- a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx @@ -22,17 +22,20 @@ For production deployments, we recommend using cluster mode (sharded) to ensure ```yaml tail_sampling: - decision_wait: 30s distributed_cache: connection: address: redis://localhost:6379/0 password: 'local' - trace_window_expiration: 30s - suffix: "itc" - max_traces_per_batch: 500 - evaluation_interval: 1s + trace_window_expiration: 30s # Default: how long to wait after last span before evaluating + in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set + traces_ttl: 3600s # Optional: default 1 hour + cache_ttl: 7200s # Optional: default 2 hours + suffix: "itc" # Redis key prefix + max_traces_per_batch: 500 # Default: traces processed per evaluation cycle + evaluation_interval: 1s # Default: evaluation frequency + evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count) data_compression: - format: lz4 + format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended ``` @@ -154,8 +157,6 @@ When using `distributed_cache`, the decision caches are stored in Redis without **Memory management**: Configure Redis with `maxmemory-policy allkeys-lru` to allow automatic eviction of old decision cache entries when memory limits are reached. The decision cache keys use TTL-based expiration (controlled by `cache_ttl`) rather than fixed size limits. - - #### 3. Batch processing overhead - **Current batch queue**: Minimal (trace IDs + scores in sorted set) @@ -245,49 +246,4 @@ The processor uses a cascading TTL structure, with each level providing protecti 4. **`cache_ttl`** (default: 2 hours) - Redis key expiration for decision cache entries (sampled/non-sampled) - Prevents duplicate evaluation for late-arriving spans - - Defined via `distributed_cache.cache_ttl` - -### Example configuration - -```yaml - tail_sampling: - distributed_cache: - connection: - address: redis://localhost:6379/0 - password: 'local' - trace_window_expiration: 30s # Default: how long to wait after last span before evaluating - in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set - traces_ttl: 3600s # Optional: default 1 hour - cache_ttl: 7200s # Optional: default 2 hours - suffix: "itc" # Redis key prefix - max_traces_per_batch: 500 # Default: traces processed per evaluation cycle - evaluation_interval: 1s # Default: evaluation frequency - evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count) - data_compression: - format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended -``` - -## LRU eviction for decision caches - -When using `distributed_cache`, the decision caches rely on Redis's native LRU eviction rather than application-managed size limits: - - -**Redis LRU configuration required**: Configure your Redis instance with `maxmemory-policy allkeys-lru` to enable automatic eviction of old entries when memory limits are reached. The decision cache keys are stored in Redis with TTL-based expiration (controlled by `cache_ttl`), and Redis will automatically evict the least recently used keys when memory pressure occurs. - - -- **Sampled cache**: TTL-managed (default: 2 hours via `cache_ttl`) -- **Non-sampled cache**: TTL-managed (default: 2 hours via `cache_ttl`) - -This approach provides several benefits: -- Recent sampling decisions remain available for late-arriving spans -- No hard limit on cache size—Redis manages memory automatically -- Consistent cache performance under load -- Simpler configuration without manual cache sizing - -## Batch processing - -The processor handles traces in batches to optimize performance: - -- **Maximum traces per batch**: Default of 500, configurable via `max_traces_per_batch` -- **Atomic batch operations**: Batches are retrieved atomically from the current queue -- **Failure recovery**: Failed batches are automatically recovered and re-queued after the in-flight timeout expires \ No newline at end of file + - Defined via `distributed_cache.cache_ttl` \ No newline at end of file From 30ec135374af87fdf075f9019c8dc8a142ba50eb Mon Sep 17 00:00:00 2001 From: jvegarodriguez Date: Mon, 10 Nov 2025 09:58:32 +0100 Subject: [PATCH 6/7] remove additional pages and nav config --- .../infinite-tracing-introduction.mdx | 12 ------------ src/nav/distributed-tracing.yml | 6 ------ 2 files changed, 18 deletions(-) delete mode 100644 src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx deleted file mode 100644 index 1678b24bea6..00000000000 --- a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction.mdx +++ /dev/null @@ -1,12 +0,0 @@ ---- -title: (fix this titlle) Introduction to Infinite Tracing -tags: - - Understand dependencies - - Distributed tracing - - Infinite Tracing -metaDescription: 'FIXME' -redirects: - - /docs/on-premise-infinite-tracing -freshnessValidatedDate: never ---- - diff --git a/src/nav/distributed-tracing.yml b/src/nav/distributed-tracing.yml index 8ab5338f9df..afea1a07fd9 100644 --- a/src/nav/distributed-tracing.yml +++ b/src/nav/distributed-tracing.yml @@ -35,12 +35,6 @@ pages: path: /docs/distributed-tracing/infinite-tracing/infinite-tracing-configure-proxy-support - title: 'Configure SSL for Java 7, 8' path: /docs/distributed-tracing/other-requirements/infinite-tracing-configuring-ssl-java-7-8 - - title: Infinite Tracing Collector - pages: - - title: Introduction to Infinite Tracing Collector - path: /docs/distributed-tracing/infinite-tracing-on-premise/infinite-tracing-introduction - - title: Bring your own distributed cache - path: /docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache - title: Trace API pages: - title: Introduction to the Trace API From ad1e75312f081b827b3bac79a076b800cc7ec5e7 Mon Sep 17 00:00:00 2001 From: Abhilash Dutta Date: Mon, 10 Nov 2025 15:38:24 +0530 Subject: [PATCH 7/7] Updated the indentation and fixed other issues --- .../bring-your-own-cache.mdx | 165 +++++++++--------- 1 file changed, 83 insertions(+), 82 deletions(-) diff --git a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx index becd6f2b1ce..9e39ae8d33f 100644 --- a/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx +++ b/src/content/docs/distributed-tracing/infinite-tracing-on-premise/bring-your-own-cache.mdx @@ -7,59 +7,57 @@ tags: - Redis - Cache configuration metaDescription: 'Configure Redis-compatible caches for Infinite Tracing on-premise tail sampling processor to enable high-availability and distributed processing' -redirects: [] freshnessValidatedDate: never --- - New Relic's Infinite Tracing Processor is an implementation of the OpenTelemetry Collector [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor). In addition to upstream features, it supports highly scalable distributed processing by using a distributed cache for shared state storage. This documentation describes the supported cache implementations and their configuration. -# Supported caches +## Supported caches The processor supports any Redis-compatible cache implementation. It has been tested and validated with Redis and Valkey in both single-instance and cluster configurations. For production deployments, we recommend using cluster mode (sharded) to ensure high availability and scalability. To enable distributed caching, add the `distributed_cache` configuration to your `tail_sampling` processor section: -```yaml - tail_sampling: - distributed_cache: - connection: - address: redis://localhost:6379/0 - password: 'local' - trace_window_expiration: 30s # Default: how long to wait after last span before evaluating - in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set - traces_ttl: 3600s # Optional: default 1 hour - cache_ttl: 7200s # Optional: default 2 hours - suffix: "itc" # Redis key prefix - max_traces_per_batch: 500 # Default: traces processed per evaluation cycle - evaluation_interval: 1s # Default: evaluation frequency - evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count) - data_compression: - format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended -``` - - - **Configuration behavior**: When `distributed_cache` is configured, the processor automatically uses the distributed cache for state management. If `distributed_cache` is omitted entirely, the collector will use in-memory processing instead. There is no separate `enabled` flag. - + ```yaml + tail_sampling: + distributed_cache: + connection: + address: redis://localhost:6379/0 + password: 'local' + trace_window_expiration: 30s # Default: how long to wait after last span before evaluating + in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set + traces_ttl: 3600s # Optional: default 1 hour + cache_ttl: 7200s # Optional: default 2 hours + suffix: "itc" # Redis key prefix + max_traces_per_batch: 500 # Default: traces processed per evaluation cycle + evaluation_interval: 1s # Default: evaluation frequency + evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count) + data_compression: + format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended + ``` + + + **Configuration behavior**: When `distributed_cache` is configured, the processor automatically uses the distributed cache for state management. If `distributed_cache` is omitted entirely, the collector will use in-memory processing instead. There is no separate `enabled` flag. + The `address` parameter must specify a valid Redis-compatible server address using the standard format: -```shell -redis[s]://[[username][:password]@][host][:port][/db-number] -``` + ```shell + redis[s]://[[username][:password]@][host][:port][/db-number] + ``` Alternatively, you can embed credentials directly in the `address` parameter: -```yaml - tail_sampling: - distributed_cache: - connection: - address: redis://:yourpassword@localhost:6379/0 -``` + ```yaml + tail_sampling: + distributed_cache: + connection: + address: redis://:yourpassword@localhost:6379/0 + ``` The processor is implemented in Go and uses the [go-redis](https://github.com/redis/go-redis/tree/v9) client library. -# Configuration parameters +## Configuration parameters The `distributed_cache` section supports the following parameters: @@ -87,18 +85,18 @@ The `distributed_cache` section supports the following parameters: - **`data_compression`** (optional): Compression settings for trace data stored in Redis - **`format`** (default: none): Compression format: `none`, `snappy`, `zstd`, or `lz4` - - **Compression tradeoffs**: Enabling compression reduces network bandwidth between the processor and Redis and lowers Redis memory requirements. However, compression increases CPU and memory usage on the processor instance during compression/decompression operations. + + **Compression tradeoffs**: Enabling compression reduces network bandwidth between the processor and Redis and lowers Redis memory requirements. However, compression increases CPU and memory usage on the processor instance during compression/decompression operations. - **Format recommendations**: - - **`zstd`**: Maximum compression ratio, best for bandwidth-constrained environments but highest CPU overhead during decompression - - **`lz4`**: Balanced option with good compression and near-negligible decompression overhead—recommended for most deployments - - **`snappy`**: Fastest compression/decompression with lowest CPU cost, but lower compression ratios than lz4 + **Format recommendations**: + - **`zstd`**: Maximum compression ratio, best for bandwidth-constrained environments but highest CPU overhead during decompression + - **`lz4`**: Balanced option with good compression and near-negligible decompression overhead—recommended for most deployments + - **`snappy`**: Fastest compression/decompression with lowest CPU cost, but lower compression ratios than lz4 - Choose based on your bottleneck: network bandwidth and Redis storage vs. processor CPU availability. - + Choose based on your bottleneck: network bandwidth and Redis storage vs. processor CPU availability. + -# Redis-compatible cache requirements +## Redis-compatible cache requirements The processor uses the cache as distributed storage for the following trace data: @@ -116,32 +114,34 @@ Proper Redis instance sizing is critical for optimal performance. Use the config ### Memory estimation formula -``` -Total Memory = (Trace Data) + (Decision Caches) + (Overhead) -``` + ```shell + Total Memory = (Trace Data) + (Decision Caches) + (Overhead) + ``` #### 1. Trace data storage Trace data is stored in Redis for the full `traces_ttl` period to support late-arriving spans and trace recovery: -- **Per-span storage**: ~900 bytes (marshaled protobuf) +- **Per-span storage**: `~900 bytes` (marshaled protobuf) - **Storage duration**: Controlled by `traces_ttl` (default: **1 hour**) - **Active collection window**: Controlled by `trace_window_expiration` (default: 30s) - **Formula**: `Memory ≈ spans_per_second × traces_ttl × 900 bytes` - - **Active window vs. full retention**: Traces are collected during a ~30-second active window (`trace_window_expiration`), but persist in Redis for the full 1-hour `traces_ttl` period. This allows the processor to handle late-arriving spans and recover orphaned traces. Your Redis sizing must account for the **full retention period**, not just the active window. - + + **Active window vs. full retention**: Traces are collected during a `~30-second` active window (`trace_window_expiration`), but persist in Redis for the full 1-hour `traces_ttl` period. This allows the processor to handle late-arriving spans and recover orphaned traces. Your Redis sizing must account for the **full retention period**, not just the active window. + **Example calculation**: At 10,000 spans/second with 1-hour `traces_ttl`: -``` -10,000 spans/sec × 3600 sec × 900 bytes = 32.4 GB -``` + + ```shell + 10,000 spans/sec × 3600 sec × 900 bytes = 32.4 GB + ``` **With lz4 compression** (we have observed 25% reduction): -``` -32.4 GB × 0.75 = 24.3 GB -``` + + ```shell + 32.4 GB × 0.75 = 24.3 GB + ``` Note: This calculation represents the primary memory consumer. Actual Redis memory may be slightly higher due to decision caches and internal data structures. @@ -151,11 +151,11 @@ When using `distributed_cache`, the decision caches are stored in Redis without - **Sampled cache**: Managed by Redis LRU eviction - **Non-sampled cache**: Managed by Redis LRU eviction -- **Typical overhead per trace ID**: ~50 bytes +- **Typical overhead per trace ID**: `~50 bytes` - - **Memory management**: Configure Redis with `maxmemory-policy allkeys-lru` to allow automatic eviction of old decision cache entries when memory limits are reached. The decision cache keys use TTL-based expiration (controlled by `cache_ttl`) rather than fixed size limits. - + + **Memory management**: Configure Redis with `maxmemory-policy allkeys-lru` to allow automatic eviction of old decision cache entries when memory limits are reached. The decision cache keys use TTL-based expiration (controlled by `cache_ttl`) rather than fixed size limits. + #### 3. Batch processing overhead @@ -163,9 +163,10 @@ When using `distributed_cache`, the decision caches are stored in Redis without - **In-flight batches**: `max_traces_per_batch × average_spans_per_trace × 900 bytes` **Example calculation**: 500 traces per batch (default) with 20 spans per trace on average: -``` -500 × 20 × 900 bytes = 9 MB per batch -``` + + ```shell + 500 × 20 × 900 bytes = 9 MB per batch + ``` Batch size impacts memory usage during evaluation. In-flight batch memory is temporary and released after processing completes. @@ -182,9 +183,9 @@ Based on the configuration above with the following workload parameters: |-----------|----------------| | Trace data (1-hour retention) | 32.4 GB | | Decision caches | Variable (LRU-managed) | -| Batch processing | ~10 MB | -| Redis overhead (25%) | ~8.1 GB | -| **Total (minimum)** | **~40.5 GB + decision cache** | +| Batch processing | `~10 MB`| +| Redis overhead (25%) | `~8.1 GB` | +| **Total (minimum)** | `**~40.5 GB + decision cache**` | **With lz4 compression (25% reduction):** @@ -192,27 +193,27 @@ Based on the configuration above with the following workload parameters: |-----------|----------------| | Trace data (1-hour retention) | 24.3 GB | | Decision caches | Variable (LRU-managed) | -| Batch processing | ~7 MB | -| Redis overhead (25%) | ~6.1 GB | -| **Total (minimum)** | **~30.4 GB + decision cache** | +| Batch processing | `~7 MB` | +| Redis overhead (25%) | `~6.1 GB` | +| **Total (minimum)** | `**~30.4 GB + decision cache**` | - - **Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider: - - Provisioning **10-15% additional memory** beyond calculated requirements to accommodate traffic spikes and transient overhead - - Using Redis cluster mode for horizontal scaling - - Monitoring actual memory usage and adjusting capacity accordingly - + + **Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider: + - Provisioning **10-15% additional memory** beyond calculated requirements to accommodate traffic spikes and transient overhead + - Using Redis cluster mode for horizontal scaling + - Monitoring actual memory usage and adjusting capacity accordingly + ### Performance considerations - **Network latency**: Round-trip time between the collector and Redis directly impacts sampling throughput. Deploy Redis instances with low-latency network connectivity to the collector. - **Cluster mode**: Distributing load across multiple Redis nodes increases throughput and provides fault tolerance for high-availability deployments. -# Data Management and Performance +## Data Management and Performance - - **Performance bottleneck**: Redis and network communication are typically the limiting factors for processor performance. The speed and reliability of your Redis cache are essential for proper collector operation. Ensure your Redis instance has sufficient resources and maintains low-latency network connectivity to the collector. - + + **Performance bottleneck**: Redis and network communication are typically the limiting factors for processor performance. The speed and reliability of your Redis cache are essential for proper collector operation. Ensure your Redis instance has sufficient resources and maintains low-latency network connectivity to the collector. + The processor stores trace data temporarily in Redis while making sampling decisions. Understanding data expiration and cache eviction policies is critical for optimal performance. @@ -220,9 +221,9 @@ The processor stores trace data temporarily in Redis while making sampling decis When using `distributed_cache`, the TTL configuration differs from the in-memory processor. The following parameters control data expiration: - - **Key difference from in-memory mode**: When `distributed_cache` is configured, `trace_window_expiration` replaces `decision_wait` for determining when traces are evaluated. The `trace_window_expiration` parameter defines a sliding window: each time new spans arrive for a trace, the trace remains active for another `trace_window_expiration` period. This incremental approach keeps traces with ongoing activity alive longer than those that have stopped receiving spans. - + + **Key difference from in-memory mode**: When `distributed_cache` is configured, `trace_window_expiration` replaces `decision_wait` for determining when traces are evaluated. The `trace_window_expiration` parameter defines a sliding window: each time new spans arrive for a trace, the trace remains active for another `trace_window_expiration` period. This incremental approach keeps traces with ongoing activity alive longer than those that have stopped receiving spans. + ### TTL hierarchy and defaults