Skip to content

Commit 589bdb3

Browse files
committed
Update with taroface feedback (1)
1 parent 1fef6ce commit 589bdb3

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

src/current/_includes/v25.3/wal-failover-intro.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@ On a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview
22

33
Failing over the WAL may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. For example, if the node's primary store is stalled, and the node can't read from or write to it, the node can still write to the WAL on another store. This can allow the node to continue to service requests during momentary unavailability of the underlying storage device.
44

5-
When WAL failover is enabled, CockroachDB:
5+
When WAL failover is enabled, CockroachDB does the following:
66

77
- Pairs each primary store with a secondary failover store at node startup.
8-
- Monitors latency of all write operations against the primary WAL. If any operation exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9-
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.
10-
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
8+
- Monitors latency of all write operations against the primary WAL. If any operation exceeds the duration of [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9+
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal "probe file" on its volume. This file contains no user data and exists only when WAL failover is enabled.
10+
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of tens of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
1111
- Exposes status at [`/_status/stores`]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) so you can monitor each store's health and failover state.
1212

1313
{{site.data.alerts.callout_info}}
14-
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency.
14+
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled. Caches typically limit blast radius, but some reads may see elevated latency.
1515
{{site.data.alerts.end}}

0 commit comments

Comments
 (0)