Skip to content

Commit 9ac640c

Browse files
Merge pull request #21922 from jdvr/develop
feat(distributed-tracing): Create 8tcol bring your own cache config
2 parents d2c85a0 + ad1e753 commit 9ac640c

File tree

1 file changed

+250
-0
lines changed

1 file changed

+250
-0
lines changed
Lines changed: 250 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,250 @@
1+
---
2+
title: Bring your own cache
3+
tags:
4+
- Distributed tracing
5+
- Infinite Tracing
6+
- On-premise
7+
- Redis
8+
- Cache configuration
9+
metaDescription: 'Configure Redis-compatible caches for Infinite Tracing on-premise tail sampling processor to enable high-availability and distributed processing'
10+
freshnessValidatedDate: never
11+
---
12+
13+
New Relic's Infinite Tracing Processor is an implementation of the OpenTelemetry Collector [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor). In addition to upstream features, it supports highly scalable distributed processing by using a distributed cache for shared state storage. This documentation describes the supported cache implementations and their configuration.
14+
15+
## Supported caches
16+
17+
The processor supports any Redis-compatible cache implementation. It has been tested and validated with Redis and Valkey in both single-instance and cluster configurations.
18+
19+
For production deployments, we recommend using cluster mode (sharded) to ensure high availability and scalability. To enable distributed caching, add the `distributed_cache` configuration to your `tail_sampling` processor section:
20+
21+
```yaml
22+
tail_sampling:
23+
distributed_cache:
24+
connection:
25+
address: redis://localhost:6379/0
26+
password: 'local'
27+
trace_window_expiration: 30s # Default: how long to wait after last span before evaluating
28+
in_flight_timeout: 120s # Optional: defaults to trace_window_expiration if not set
29+
traces_ttl: 3600s # Optional: default 1 hour
30+
cache_ttl: 7200s # Optional: default 2 hours
31+
suffix: "itc" # Redis key prefix
32+
max_traces_per_batch: 500 # Default: traces processed per evaluation cycle
33+
evaluation_interval: 1s # Default: evaluation frequency
34+
evaluation_workers: 4 # Default: number of parallel workers (defaults to CPU count)
35+
data_compression:
36+
format: lz4 # Optional: compression format (none, snappy, zstd, lz4); lz4 recommended
37+
```
38+
39+
<Callout variant="important">
40+
**Configuration behavior**: When `distributed_cache` is configured, the processor automatically uses the distributed cache for state management. If `distributed_cache` is omitted entirely, the collector will use in-memory processing instead. There is no separate `enabled` flag.
41+
</Callout>
42+
43+
The `address` parameter must specify a valid Redis-compatible server address using the standard format:
44+
45+
```shell
46+
redis[s]://[[username][:password]@][host][:port][/db-number]
47+
```
48+
49+
Alternatively, you can embed credentials directly in the `address` parameter:
50+
51+
```yaml
52+
tail_sampling:
53+
distributed_cache:
54+
connection:
55+
address: redis://:yourpassword@localhost:6379/0
56+
```
57+
58+
The processor is implemented in Go and uses the [go-redis](https://github.com/redis/go-redis/tree/v9) client library.
59+
60+
## Configuration parameters
61+
62+
The `distributed_cache` section supports the following parameters:
63+
64+
## Connection parameters
65+
66+
- **`connection.address`** (required): Redis server address in format `redis[s]://[[username][:password]@][host][:port][/db-number]`
67+
- **`connection.password`** (optional): Redis password (alternative to embedding in address)
68+
69+
## Trace evaluation parameters
70+
71+
- **`trace_window_expiration`** (default: 30s): Time window after the last span arrives before a trace is evaluated for sampling decisions
72+
- **`evaluation_interval`** (default: 1s): How frequently the processor evaluates pending traces for sampling decisions
73+
- **`evaluation_workers`** (default: number of CPU cores): Number of parallel worker threads for evaluating sampling policies. Higher values increase throughput but consume more resources.
74+
75+
## TTL and expiration parameters
76+
77+
- **`in_flight_timeout`** (default: equals `trace_window_expiration`): Maximum time a batch can remain in processing before being considered orphaned and recovered
78+
- **`traces_ttl`** (default: 1 hour): Redis key expiration time for trace span data
79+
- **`cache_ttl`** (default: 2 hours): Redis key expiration time for sampling decision cache entries
80+
81+
## Storage parameters
82+
83+
- **`max_traces_per_batch`** (default: 500): Maximum number of traces processed in a single evaluation cycle. Higher values improve throughput but increase memory usage.
84+
- **`suffix`** (default: "tsp"): Prefix for Redis keys to avoid collisions when multiple processors share the same Redis instance
85+
- **`data_compression`** (optional): Compression settings for trace data stored in Redis
86+
- **`format`** (default: none): Compression format: `none`, `snappy`, `zstd`, or `lz4`
87+
88+
<Callout variant="tip">
89+
**Compression tradeoffs**: Enabling compression reduces network bandwidth between the processor and Redis and lowers Redis memory requirements. However, compression increases CPU and memory usage on the processor instance during compression/decompression operations.
90+
91+
**Format recommendations**:
92+
- **`zstd`**: Maximum compression ratio, best for bandwidth-constrained environments but highest CPU overhead during decompression
93+
- **`lz4`**: Balanced option with good compression and near-negligible decompression overhead—recommended for most deployments
94+
- **`snappy`**: Fastest compression/decompression with lowest CPU cost, but lower compression ratios than lz4
95+
96+
Choose based on your bottleneck: network bandwidth and Redis storage vs. processor CPU availability.
97+
</Callout>
98+
99+
## Redis-compatible cache requirements
100+
101+
The processor uses the cache as distributed storage for the following trace data:
102+
103+
- Trace and span attributes
104+
- Active trace data
105+
- Sampling decision cache
106+
107+
The processor executes **Lua scripts** to interact with the Redis cache atomically. Lua script support is typically enabled by default in Redis-compatible caches. No additional configuration is required unless you have explicitly disabled this feature.
108+
109+
## Sizing and performance
110+
111+
Proper Redis instance sizing is critical for optimal performance. Use the configuration example from "Supported caches" above. To calculate memory requirements, you must estimate your workload characteristics:
112+
- **Spans per second**: Assumed throughput of 10,000 spans/sec
113+
- **Average span size**: Assumed size of 900 bytes (marshaled protobuf format)
114+
115+
### Memory estimation formula
116+
117+
```shell
118+
Total Memory = (Trace Data) + (Decision Caches) + (Overhead)
119+
```
120+
121+
#### 1. Trace data storage
122+
123+
Trace data is stored in Redis for the full `traces_ttl` period to support late-arriving spans and trace recovery:
124+
125+
- **Per-span storage**: `~900 bytes` (marshaled protobuf)
126+
- **Storage duration**: Controlled by `traces_ttl` (default: **1 hour**)
127+
- **Active collection window**: Controlled by `trace_window_expiration` (default: 30s)
128+
- **Formula**: `Memory ≈ spans_per_second × traces_ttl × 900 bytes`
129+
130+
<Callout variant="important">
131+
**Active window vs. full retention**: Traces are collected during a `~30-second` active window (`trace_window_expiration`), but persist in Redis for the full 1-hour `traces_ttl` period. This allows the processor to handle late-arriving spans and recover orphaned traces. Your Redis sizing must account for the **full retention period**, not just the active window.
132+
</Callout>
133+
134+
**Example calculation**: At 10,000 spans/second with 1-hour `traces_ttl`:
135+
136+
```shell
137+
10,000 spans/sec × 3600 sec × 900 bytes = 32.4 GB
138+
```
139+
140+
**With lz4 compression** (we have observed 25% reduction):
141+
142+
```shell
143+
32.4 GB × 0.75 = 24.3 GB
144+
```
145+
146+
Note: This calculation represents the primary memory consumer. Actual Redis memory may be slightly higher due to decision caches and internal data structures.
147+
148+
#### 2. Decision cache storage
149+
150+
When using `distributed_cache`, the decision caches are stored in Redis without explicit size limits. Instead, Redis uses its native LRU eviction policy (configured via `maxmemory-policy`) to manage memory. Each trace ID requires approximately 50 bytes of storage:
151+
152+
- **Sampled cache**: Managed by Redis LRU eviction
153+
- **Non-sampled cache**: Managed by Redis LRU eviction
154+
- **Typical overhead per trace ID**: `~50 bytes`
155+
156+
<Callout variant="tip">
157+
**Memory management**: Configure Redis with `maxmemory-policy allkeys-lru` to allow automatic eviction of old decision cache entries when memory limits are reached. The decision cache keys use TTL-based expiration (controlled by `cache_ttl`) rather than fixed size limits.
158+
</Callout>
159+
160+
#### 3. Batch processing overhead
161+
162+
- **Current batch queue**: Minimal (trace IDs + scores in sorted set)
163+
- **In-flight batches**: `max_traces_per_batch × average_spans_per_trace × 900 bytes`
164+
165+
**Example calculation**: 500 traces per batch (default) with 20 spans per trace on average:
166+
167+
```shell
168+
500 × 20 × 900 bytes = 9 MB per batch
169+
```
170+
171+
Batch size impacts memory usage during evaluation. In-flight batch memory is temporary and released after processing completes.
172+
173+
### Complete sizing example
174+
175+
Based on the configuration above with the following workload parameters:
176+
- **Throughput**: 10,000 spans/second
177+
- **Average span size**: 900 bytes
178+
- **Storage period**: 1 hour (`traces_ttl`)
179+
180+
**Without compression:**
181+
182+
| Component | Memory Required |
183+
|-----------|----------------|
184+
| Trace data (1-hour retention) | 32.4 GB |
185+
| Decision caches | Variable (LRU-managed) |
186+
| Batch processing | `~10 MB`|
187+
| Redis overhead (25%) | `~8.1 GB` |
188+
| **Total (minimum)** | `**~40.5 GB + decision cache**` |
189+
190+
**With lz4 compression (25% reduction):**
191+
192+
| Component | Memory Required |
193+
|-----------|----------------|
194+
| Trace data (1-hour retention) | 24.3 GB |
195+
| Decision caches | Variable (LRU-managed) |
196+
| Batch processing | `~7 MB` |
197+
| Redis overhead (25%) | `~6.1 GB` |
198+
| **Total (minimum)** | `**~30.4 GB + decision cache**` |
199+
200+
<Callout variant="important">
201+
**Sizing guidance**: The calculations above serve as an estimation example. We recommend performing your own capacity planning based on your specific workload characteristics. For production deployments, consider:
202+
- Provisioning **10-15% additional memory** beyond calculated requirements to accommodate traffic spikes and transient overhead
203+
- Using Redis cluster mode for horizontal scaling
204+
- Monitoring actual memory usage and adjusting capacity accordingly
205+
</Callout>
206+
207+
### Performance considerations
208+
209+
- **Network latency**: Round-trip time between the collector and Redis directly impacts sampling throughput. Deploy Redis instances with low-latency network connectivity to the collector.
210+
- **Cluster mode**: Distributing load across multiple Redis nodes increases throughput and provides fault tolerance for high-availability deployments.
211+
212+
## Data Management and Performance
213+
214+
<Callout variant="caution">
215+
**Performance bottleneck**: Redis and network communication are typically the limiting factors for processor performance. The speed and reliability of your Redis cache are essential for proper collector operation. Ensure your Redis instance has sufficient resources and maintains low-latency network connectivity to the collector.
216+
</Callout>
217+
218+
The processor stores trace data temporarily in Redis while making sampling decisions. Understanding data expiration and cache eviction policies is critical for optimal performance.
219+
220+
## TTL and expiration
221+
222+
When using `distributed_cache`, the TTL configuration differs from the in-memory processor. The following parameters control data expiration:
223+
224+
<Callout variant="important">
225+
**Key difference from in-memory mode**: When `distributed_cache` is configured, `trace_window_expiration` replaces `decision_wait` for determining when traces are evaluated. The `trace_window_expiration` parameter defines a sliding window: each time new spans arrive for a trace, the trace remains active for another `trace_window_expiration` period. This incremental approach keeps traces with ongoing activity alive longer than those that have stopped receiving spans.
226+
</Callout>
227+
228+
### TTL hierarchy and defaults
229+
230+
The processor uses a cascading TTL structure, with each level providing protection for the layer below:
231+
232+
1. **`trace_window_expiration`** (default: 30s)
233+
- Configures how long to wait after the last span arrives before evaluating a trace
234+
- Acts as a sliding window: resets each time new spans arrive for a trace
235+
- Defined via `distributed_cache.trace_window_expiration`
236+
237+
2. **`in_flight_timeout`** (default: equals `trace_window_expiration` if not specified)
238+
- Maximum time a batch can be processed before being considered orphaned
239+
- Orphaned batches are automatically recovered and re-queued
240+
- Can be overridden via `distributed_cache.in_flight_timeout`
241+
242+
3. **`traces_ttl`** (default: 1 hour)
243+
- Redis key expiration for trace span data
244+
- Ensures trace data persists long enough for evaluation and recovery
245+
- Defined via `distributed_cache.traces_ttl`
246+
247+
4. **`cache_ttl`** (default: 2 hours)
248+
- Redis key expiration for decision cache entries (sampled/non-sampled)
249+
- Prevents duplicate evaluation for late-arriving spans
250+
- Defined via `distributed_cache.cache_ttl`

0 commit comments

Comments
 (0)