Skip to content

Commit 1fef6ce

Browse files
committed
Update with sumeerbhola feedback (1)
1 parent 4d532f6 commit 1fef6ce

File tree

2 files changed

+3
-10
lines changed

2 files changed

+3
-10
lines changed

src/current/_includes/v25.3/wal-failover-intro.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,11 @@ Failing over the WAL may allow some operations against a store to continue to co
55
When WAL failover is enabled, CockroachDB:
66

77
- Pairs each primary store with a secondary failover store at node startup.
8-
- Monitors primary WAL `fsync` latency. If any sync exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9-
- Probes the primary store while failed over by `fsync`ing a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.
10-
- Switches back to the primary store once a probe `fsync` on its volume completes within [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables). If a probe `fsync` blocks longer than this duration, CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
8+
- Monitors latency of all write operations against the primary WAL. If any operation exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9+
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.
10+
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
1111
- Exposes status at [`/_status/stores`]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) so you can monitor each store's health and failover state.
1212

1313
{{site.data.alerts.callout_info}}
1414
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency.
15-
- [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables) is chosen to bound long cloud disk stalls without flapping; tune with care. High tail-latency cloud volumes (for example, oversubscribed [AWS EBS gp3](https://docs.aws.amazon.com/ebs/latest/userguide/general-purpose.html#gp3-ebs-volume-type)) are more prone to transient stalls.
1615
{{site.data.alerts.end}}

src/current/_includes/v25.3/wal-failover-metrics.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,3 @@ You can access these metrics via the following methods:
1010

1111
- The [**Custom Chart** debug page]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}) in [DB Console]({% link {{ page.version.version }}/ui-custom-chart-debug-page.md %}).
1212
- By [monitoring CockroachDB with Prometheus]({% link {{ page.version.version }}/monitor-cockroachdb-with-prometheus.md %}).
13-
14-
In addition to metrics, logs help identify disk stalls during WAL failover. The following message indicates a disk stall on the primary store's volume:
15-
16-
~~~
17-
disk stall detected: sync on file probe-file has been ongoing for 40.0s
18-
~~~

0 commit comments

Comments
 (0)