Skip to content

Clarify the role of Cluster ID and other inputs #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: 2no
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions enhancements/two-node-fencing/tnf.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Upon rebooting, the RHEL-HA components ensure that a node remains inert (not run
If the failed peer is likely to remain offline for an extended period, admin confirmation is required on the remaining node to allow it to start OpenShift.
This functionality exists within RHEL-HA, but a wrapper will be provided to take care of the details.

When starting etcd, the OCF script will use etcd's cluster ID and version counter to determine whether the existing data directory can be reused, or must be erased before joining an active peer.
When starting etcd, the OCF script will use data on disk (e.g. etcd's cluster ID) and the current state of the cluster (e.g. which resource agent is already running) to determine whether the existing data directory can be reused, or must be erased before joining an active peer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought the reverse... that the cluster ID was almost useless, and the version counter the only interesting/useful detail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cluster ID is still the only source on disk to understand if the peer node has forced a new cluster while the starting node was stopped, so, even if it is expected to be the same most of the time, I would still consider valuable to check on its value at startup.



### Summary of Changes
Expand Down Expand Up @@ -441,7 +441,7 @@ For `platform: none` clusters, this will require customers to provide an ingress
#### Graceful vs. Unplanned Reboots
Events that have to be handled uniquely by a two-node cluster can largely be categorized into one of two buckets. In the first bucket, we have things that trigger graceful reboots. This includes events like upgrades, MCO-triggered reboots, and users sending a shutdown command to one of the nodes. In each of these cases - assuming a functioning two-node cluster - the node that is shutting down must wait for pacemaker to signal to etcd to remove the node from the etcd quorum to maintain e-quorum. When the node reboots, it must rejoin the etcd cluster and sync its database to the active node.

Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves pacemaker restarting the etcd on the surviving node with a new cluster ID as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new cluster before it can rejoin as an active peer.
Unplanned reboots include any event where one of the nodes cannot signal to etcd that it needs to leave the cluster. This includes situations such as a network disconnection between the nodes, power outages, or turning off a machine using a command like `poweroff -f`. The point is that a machine needs to be fenced so that the other node can perform a special recovery operation. This recovery involves pacemaker restarting the etcd on the surviving node as a cluster-of-one. This way, when the other node rejoins, it must reconcile its data directory and resync to the new cluster before it can rejoin as an active peer.

#### Failure Scenario Timelines:
This section provides specific steps for how two-node clusters would handle interesting events.
Expand Down