You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/current/_includes/v25.3/wal-failover-intro.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,14 +2,14 @@ On a CockroachDB [node]({% link {{ page.version.version }}/architecture/overview
2
2
3
3
Failing over the WAL may allow some operations against a store to continue to complete despite temporary unavailability of the underlying storage. For example, if the node's primary store is stalled, and the node can't read from or write to it, the node can still write to the WAL on another store. This can allow the node to continue to service requests during momentary unavailability of the underlying storage device.
4
4
5
-
When WAL failover is enabled, CockroachDB:
5
+
When WAL failover is enabled, CockroachDB does the following:
6
6
7
7
- Pairs each primary store with a secondary failover store at node startup.
8
-
- Monitors latency of all write operations against the primary WAL. If any operation exceeds [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9
-
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal 'probe file' on its volume. This file contains no user data and exists only when WAL failover is enabled.
10
-
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of 10s of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
8
+
- Monitors latency of all write operations against the primary WAL. If any operation exceeds the duration of [`storage.wal_failover.unhealthy_op_threshold`]({% link {{page.version.version}}/cluster-settings.md %}#setting-storage-wal-failover-unhealthy-op-threshold), the node redirects new WAL writes to the secondary store.
9
+
- Checks the primary store while failed over by performing a set of filesystem operations against a small internal "probe file" on its volume. This file contains no user data and exists only when WAL failover is enabled.
10
+
- Switches back to the primary store once the set of filesystem operations against the probe file on its volume starts consuming less than a latency threshold (order of tens of milliseconds). If a probe `fsync` blocks longer than [`COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT`]({% link {{ page.version.version }}/wal-failover.md %}#important-environment-variables), CockroachDB emits a log like: `disk stall detected: sync on file probe-file has been ongoing for 40.0s` and, if the stall persists, the node exits (fatals) to [shed leases]({% link {{ page.version.version }}/architecture/replication-layer.md %}#how-leases-are-transferred-from-a-dead-node) and allow recovery elsewhere.
11
11
- Exposes status at [`/_status/stores`]({% link {{ page.version.version }}/monitoring-and-alerting.md %}#store-status-endpoint) so you can monitor each store's health and failover state.
12
12
13
13
{{site.data.alerts.callout_info}}
14
-
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled; caches typically limit blast radius, but some reads may see elevated latency.
14
+
- WAL failover only relocates the WAL. Data files remain on the primary volume. Reads that miss the Pebble block cache and the OS page cache can still stall if the primary disk is stalled. Caches typically limit blast radius, but some reads may see elevated latency.
0 commit comments