Initial tests for Two Nodes OCP with Fencing (TNF) cluster #29833

clobrano · 2025-05-21T09:30:27Z

Add initial topology tests

Ensure correct number of ControlPlanes, Workers, Arbiters
Ensure correct number of static etcd pod containers
Ensure correct number of podman etcd containers

Add initial behavior tests

Ensure the cluster can handle a graceful node shutdown

Closes: OCPEDGE-1481, OCPEDGE-1482

As a starting point for test integration within this new cluster
environment, this change enables only a minimal set of monitors. These
monitors are known to be reliable in general, but are currently
exhibiting unexpected behavior in this specific cluster.

This approach allows us to establish a foundational test base. Further
investigation into the reasons for their misbehavior in this new cluster
will be conducted once this initial test setup is merged.

clobrano · 2025-05-22T07:55:33Z

Temporarily converting it to draft to investigate a crash

clobrano · 2025-05-22T13:18:51Z

Ready again for review

openshift-trt · 2025-05-22T23:16:32Z

Job Failure Risk Analysis for sha: fe755b3

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive	Medium [bz-Etcd] clusteroperator/etcd should not change condition/Available Potential external regression detected for High Risk Test analysis --- [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling	Low [bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week. --- [bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.

eggfoobar · 2025-05-26T13:43:47Z

test/extended/tnf/recovery.go

+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+
+var _ = g.Describe("[sig-node][apigroup:config.openshift.io] Two Nodes OCP with fencing recovery", func() {


Can you please add the annotation [OCPFeatureGate:DualReplica] to the test names to allow the feature gate to be captured

eggfoobar · 2025-05-29T14:08:54Z

/test e2e-metal-ipi-ovn-two-node-arbiter e2e-metal-ipi-ovn-two-node-fencing

openshift-ci · 2025-05-29T14:09:34Z

@eggfoobar: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins

/test e2e-aws-ovn-edge-zones

/test e2e-aws-ovn-fips

/test e2e-aws-ovn-image-registry

/test e2e-aws-ovn-microshift

/test e2e-aws-ovn-microshift-serial

/test e2e-aws-ovn-serial-1of2

/test e2e-aws-ovn-serial-2of2

/test e2e-gcp-ovn

/test e2e-gcp-ovn-builds

/test e2e-gcp-ovn-image-ecosystem

/test e2e-gcp-ovn-upgrade

/test e2e-metal-ipi-ovn-ipv6

/test e2e-vsphere-ovn

/test e2e-vsphere-ovn-upi

/test images

/test lint

/test okd-scos-images

/test unit

/test verify

/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback

/test e2e-agnostic-ovn-cmd

/test e2e-aws

/test e2e-aws-csi

/test e2e-aws-disruptive

/test e2e-aws-etcd-certrotation

/test e2e-aws-etcd-recovery

/test e2e-aws-ovn

/test e2e-aws-ovn-cgroupsv2

/test e2e-aws-ovn-etcd-scaling

/test e2e-aws-ovn-ipsec-serial

/test e2e-aws-ovn-kube-apiserver-rollout

/test e2e-aws-ovn-kubevirt

/test e2e-aws-ovn-serial-publicnet-1of2

/test e2e-aws-ovn-serial-publicnet-2of2

/test e2e-aws-ovn-single-node

/test e2e-aws-ovn-single-node-serial

/test e2e-aws-ovn-single-node-techpreview

/test e2e-aws-ovn-single-node-techpreview-serial

/test e2e-aws-ovn-single-node-upgrade

/test e2e-aws-ovn-upgrade

/test e2e-aws-ovn-upgrade-rollback

/test e2e-aws-ovn-upi

/test e2e-aws-ovn-virt-techpreview

/test e2e-aws-proxy

/test e2e-azure

/test e2e-azure-ovn-etcd-scaling

/test e2e-azure-ovn-upgrade

/test e2e-baremetalds-kubevirt

/test e2e-external-aws

/test e2e-external-aws-ccm

/test e2e-external-vsphere-ccm

/test e2e-gcp-csi

/test e2e-gcp-disruptive

/test e2e-gcp-fips-serial-1of2

/test e2e-gcp-fips-serial-2of2

/test e2e-gcp-ovn-etcd-scaling

/test e2e-gcp-ovn-rt-upgrade

/test e2e-gcp-ovn-techpreview

/test e2e-gcp-ovn-techpreview-serial-1of2

/test e2e-gcp-ovn-techpreview-serial-2of2

/test e2e-gcp-ovn-usernamespace

/test e2e-hypershift-conformance

/test e2e-metal-ipi-ovn

/test e2e-metal-ipi-ovn-dualstack

/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview

/test e2e-metal-ipi-ovn-dualstack-bgp-techpreview

/test e2e-metal-ipi-ovn-dualstack-local-gateway

/test e2e-metal-ipi-ovn-kube-apiserver-rollout

/test e2e-metal-ipi-serial-1of2

/test e2e-metal-ipi-serial-2of2

/test e2e-metal-ipi-serial-ovn-ipv6-1of2

/test e2e-metal-ipi-serial-ovn-ipv6-2of2

/test e2e-metal-ipi-virtualmedia

/test e2e-metal-ovn-single-node-live-iso

/test e2e-metal-ovn-single-node-with-worker-live-iso

/test e2e-metal-ovn-two-node-arbiter

/test e2e-metal-ovn-two-node-fencing

/test e2e-openstack-ovn

/test e2e-openstack-serial

/test e2e-vsphere-ovn-dualstack-primaryv6

/test e2e-vsphere-ovn-etcd-scaling

/test okd-e2e-gcp

/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback

pull-ci-openshift-origin-main-e2e-agnostic-ovn-cmd

pull-ci-openshift-origin-main-e2e-aws

pull-ci-openshift-origin-main-e2e-aws-csi

pull-ci-openshift-origin-main-e2e-aws-disruptive

pull-ci-openshift-origin-main-e2e-aws-ovn

pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2

pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones

pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-aws-ovn-fips

pull-ci-openshift-origin-main-e2e-aws-ovn-kube-apiserver-rollout

pull-ci-openshift-origin-main-e2e-aws-ovn-microshift

pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-1of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-2of2

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade

pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade

pull-ci-openshift-origin-main-e2e-aws-proxy

pull-ci-openshift-origin-main-e2e-azure

pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade

pull-ci-openshift-origin-main-e2e-gcp-csi

pull-ci-openshift-origin-main-e2e-gcp-disruptive

pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2

pull-ci-openshift-origin-main-e2e-gcp-fips-serial-2of2

pull-ci-openshift-origin-main-e2e-gcp-ovn

pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade

pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade

pull-ci-openshift-origin-main-e2e-hypershift-conformance

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-1of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2

pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia

pull-ci-openshift-origin-main-e2e-openstack-ovn

pull-ci-openshift-origin-main-e2e-openstack-serial

pull-ci-openshift-origin-main-e2e-vsphere-ovn

pull-ci-openshift-origin-main-e2e-vsphere-ovn-dualstack-primaryv6

pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi

pull-ci-openshift-origin-main-images

pull-ci-openshift-origin-main-lint

pull-ci-openshift-origin-main-okd-e2e-gcp

pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn

pull-ci-openshift-origin-main-okd-scos-images

pull-ci-openshift-origin-main-unit

pull-ci-openshift-origin-main-verify

pull-ci-openshift-origin-main-verify-deps

In response to this:

/test e2e-metal-ipi-ovn-two-node-arbiter e2e-metal-ipi-ovn-two-node-fencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

eggfoobar · 2025-05-29T14:10:39Z

/test e2e-metal-ipi-ovn-two-node-arbiter
/test e2e-metal-ipi-ovn-two-node-fencing

openshift-ci · 2025-05-29T14:10:43Z

@eggfoobar: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins

/test e2e-aws-ovn-edge-zones

/test e2e-aws-ovn-fips

/test e2e-aws-ovn-image-registry

/test e2e-aws-ovn-microshift

/test e2e-aws-ovn-microshift-serial

/test e2e-aws-ovn-serial-1of2

/test e2e-aws-ovn-serial-2of2

/test e2e-gcp-ovn

/test e2e-gcp-ovn-builds

/test e2e-gcp-ovn-image-ecosystem

/test e2e-gcp-ovn-upgrade

/test e2e-metal-ipi-ovn-ipv6

/test e2e-vsphere-ovn

/test e2e-vsphere-ovn-upi

/test images

/test lint

/test okd-scos-images

/test unit

/test verify

/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback

/test e2e-agnostic-ovn-cmd

/test e2e-aws

/test e2e-aws-csi

/test e2e-aws-disruptive

/test e2e-aws-etcd-certrotation

/test e2e-aws-etcd-recovery

/test e2e-aws-ovn

/test e2e-aws-ovn-cgroupsv2

/test e2e-aws-ovn-etcd-scaling

/test e2e-aws-ovn-ipsec-serial

/test e2e-aws-ovn-kube-apiserver-rollout

/test e2e-aws-ovn-kubevirt

/test e2e-aws-ovn-serial-publicnet-1of2

/test e2e-aws-ovn-serial-publicnet-2of2

/test e2e-aws-ovn-single-node

/test e2e-aws-ovn-single-node-serial

/test e2e-aws-ovn-single-node-techpreview

/test e2e-aws-ovn-single-node-techpreview-serial

/test e2e-aws-ovn-single-node-upgrade

/test e2e-aws-ovn-upgrade

/test e2e-aws-ovn-upgrade-rollback

/test e2e-aws-ovn-upi

/test e2e-aws-ovn-virt-techpreview

/test e2e-aws-proxy

/test e2e-azure

/test e2e-azure-ovn-etcd-scaling

/test e2e-azure-ovn-upgrade

/test e2e-baremetalds-kubevirt

/test e2e-external-aws

/test e2e-external-aws-ccm

/test e2e-external-vsphere-ccm

/test e2e-gcp-csi

/test e2e-gcp-disruptive

/test e2e-gcp-fips-serial-1of2

/test e2e-gcp-fips-serial-2of2

/test e2e-gcp-ovn-etcd-scaling

/test e2e-gcp-ovn-rt-upgrade

/test e2e-gcp-ovn-techpreview

/test e2e-gcp-ovn-techpreview-serial-1of2

/test e2e-gcp-ovn-techpreview-serial-2of2

/test e2e-gcp-ovn-usernamespace

/test e2e-hypershift-conformance

/test e2e-metal-ipi-ovn

/test e2e-metal-ipi-ovn-dualstack

/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview

/test e2e-metal-ipi-ovn-dualstack-bgp-techpreview

/test e2e-metal-ipi-ovn-dualstack-local-gateway

/test e2e-metal-ipi-ovn-kube-apiserver-rollout

/test e2e-metal-ipi-serial-1of2

/test e2e-metal-ipi-serial-2of2

/test e2e-metal-ipi-serial-ovn-ipv6-1of2

/test e2e-metal-ipi-serial-ovn-ipv6-2of2

/test e2e-metal-ipi-virtualmedia

/test e2e-metal-ovn-single-node-live-iso

/test e2e-metal-ovn-single-node-with-worker-live-iso

/test e2e-metal-ovn-two-node-arbiter

/test e2e-metal-ovn-two-node-fencing

/test e2e-openstack-ovn

/test e2e-openstack-serial

/test e2e-vsphere-ovn-dualstack-primaryv6

/test e2e-vsphere-ovn-etcd-scaling

/test okd-e2e-gcp

/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback

pull-ci-openshift-origin-main-e2e-agnostic-ovn-cmd

pull-ci-openshift-origin-main-e2e-aws

pull-ci-openshift-origin-main-e2e-aws-csi

pull-ci-openshift-origin-main-e2e-aws-disruptive

pull-ci-openshift-origin-main-e2e-aws-ovn

pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2

pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones

pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-aws-ovn-fips

pull-ci-openshift-origin-main-e2e-aws-ovn-kube-apiserver-rollout

pull-ci-openshift-origin-main-e2e-aws-ovn-microshift

pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-1of2

pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-2of2

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial

pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade

pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade

pull-ci-openshift-origin-main-e2e-aws-proxy

pull-ci-openshift-origin-main-e2e-azure

pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade

pull-ci-openshift-origin-main-e2e-gcp-csi

pull-ci-openshift-origin-main-e2e-gcp-disruptive

pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2

pull-ci-openshift-origin-main-e2e-gcp-fips-serial-2of2

pull-ci-openshift-origin-main-e2e-gcp-ovn

pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade

pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade

pull-ci-openshift-origin-main-e2e-hypershift-conformance

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6

pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-1of2

pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2

pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia

pull-ci-openshift-origin-main-e2e-openstack-ovn

pull-ci-openshift-origin-main-e2e-openstack-serial

pull-ci-openshift-origin-main-e2e-vsphere-ovn

pull-ci-openshift-origin-main-e2e-vsphere-ovn-dualstack-primaryv6

pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling

pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi

pull-ci-openshift-origin-main-images

pull-ci-openshift-origin-main-lint

pull-ci-openshift-origin-main-okd-e2e-gcp

pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn

pull-ci-openshift-origin-main-okd-scos-images

pull-ci-openshift-origin-main-unit

pull-ci-openshift-origin-main-verify

pull-ci-openshift-origin-main-verify-deps

In response to this:

/test e2e-metal-ipi-ovn-two-node-arbiter
/test e2e-metal-ipi-ovn-two-node-fencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

eggfoobar · 2025-05-29T14:13:40Z

/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing

eggfoobar

Looking good :), just had some small suggestion around the helpers.

eggfoobar · 2025-05-29T14:24:25Z

test/extended/include.go

@@ -57,6 +57,7 @@ import (
 	_ "github.com/openshift/origin/test/extended/storage"
 	_ "github.com/openshift/origin/test/extended/tbr_health"
 	_ "github.com/openshift/origin/test/extended/templates"
+	_ "github.com/openshift/origin/test/extended/tnf"


We should be good to delete this now

Forgot, thank you for noticing :)

eggfoobar · 2025-05-29T14:47:42Z

test/extended/two_node/common.go

+	}
+}
+
+func getInfraStatus(oc *exutil.CLI) (*v1.InfrastructureStatus, error) {


Thanks for cleaning this up, while you're already here, I think we can simplify this a bit more, I had missed that we already have a helper for control plane topology, would you mind removing this and using https://github.com/openshift/origin/blob/main/test/extended/util/framework.go#L2125

Good point, I'll replace this and the one below

eggfoobar · 2025-05-29T14:51:47Z

test/extended/two_node/common.go

+	return &infra.Status, nil
+}
+
+func runOnNodeNS(oc *exutil.CLI, nodeName, namespace, command string) (string, string, error) {


Same thing here, noticed we had helper function with a retry wrapper, https://github.com/openshift/origin/blob/main/test/extended/util/nodes.go#L38

openshift-trt · 2025-05-29T21:39:23Z

Job Failure Risk Analysis for sha: e540ed0

Job Name	Failure Risk
pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback	High operator conditions network This test has passed 98.99% of 3859 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade	IncompleteTests Tests for this run (18) are below the historical average (2883): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

jaypoulz

I really love how this is coming together. My biggest concern is how we need to properly label the tests that have the potential to affect other tests.

In other words - the node reboot/restart tests need to have some kind of label to indicate that they should be run serially and/or are disruptive.
The existing logic for that is

origin/pkg/testsuites/standard_suites.go

Lines 59 to 92 in 9508b94

    
           	{ 
        
           		Name: "openshift/conformance/serial", 
        
           		Description: templates.LongDesc(` 
        
           		Only the portion of the openshift/conformance test suite that run serially. 
        
           		`), 
        
           		Matches: func(name string) bool { 
        
           			if isDisabled(name) { 
        
           				return false 
        
           			} 
        
           			return strings.Contains(name, "[Suite:openshift/conformance/serial") || isStandardEarlyOrLateTest(name) 
        
           		}, 
        
           		TestTimeout: 40 * time.Minute, 
        
           	}, 
        
           	{ 
        
           		Name: "openshift/disruptive", 
        
           		Description: templates.LongDesc(` 
        
           		The disruptive test suite.  Disruptive tests interrupt the cluster function such as by stopping/restarting the control plane or  
        
           		changing the global cluster configuration in a way that can affect other tests. 
        
           		`), 
        
           		Matches: func(name string) bool { 
        
           			if isDisabled(name) { 
        
           				return false 
        
           			} 
        
           			// excluded due to stopped instance handling until https://bugzilla.redhat.com/show_bug.cgi?id=1905709 is fixed 
        
           			if strings.Contains(name, "Cluster should survive master and worker failure and recover with machine health checks") { 
        
           				return false 
        
           			} 
        
           			return strings.Contains(name, "[Feature:EtcdRecovery]") || strings.Contains(name, "[Feature:NodeRecovery]") || isStandardEarlyTest(name) 
        
           		}, 
        
           		// Duration of the quorum restore test exceeds 60 minutes. 
        
           		TestTimeout:                90 * time.Minute, 
        
           		ClusterStabilityDuringTest: ginkgo.Disruptive, 
        
           	},

I'm not familiar enough with how the final test list is composed to know if it's sane to tag the graceful shutdown test as part of the serial suite.

jaypoulz · 2025-06-04T15:19:05Z

test/extended/two_node/common.go

+func skipIfNotTopology(oc *exutil.CLI, wanted v1.TopologyMode) {
+	current, err := exutil.GetControlPlaneTopology(oc)
+	if err != nil {
+		e2eskipper.Skip(fmt.Sprintf("Could not get current topology, skipping test: error %v", err))


This may be a little strange, but I think we should default to running the test when we don't know the topology. The reason is - this will likely lead to the tests running and failing, which will help us identify misconfigured clusters.

We want to avoid failing silently.

jaypoulz · 2025-06-04T15:24:01Z

test/extended/two_node/tnf_recovery.go

+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+)
+
+var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica] Two Node with Fencing etcd recovery", func() {


I think there is an existing convention for naming "disruptive" or tests that should be run in serial. I'm not sure graceful recovery qualifies as that latter, but it definitely qualifies as the former since API requests may be routed to the dead node depend on how the load-balancer handles the rebooting node.

jaypoulz · 2025-06-04T15:25:16Z

test/extended/two_node/tnf_recovery.go

+
+		nodes, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
+		o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to retrieve nodes without error")
+		o.Expect(len(nodes.Items)).To(o.BeNumerically("==", 2), "Expected to find 2 Nodes only")


Should we add a filter for control-plane nodes? I'm imagining a future where we're asked to support/test compute nodes.

It might also be good to verify that there are 0 arbiter nodes.

Looking below, we have another test for that. :) So I think just verifying 2 control-plane nodes should be sufficient.

jaypoulz · 2025-06-04T15:27:06Z

test/extended/two_node/tnf_recovery.go

+				return fmt.Errorf("Expected node: %s to be a started and voting member. Membership: %+v", nodeA.Name, members)
+			}
+
+			// Ensure GNS node is unstarted and a learner member (i.e. !learner)


I am confused by this comment.
Wouldn't !learner mean that it's not a learner?

jaypoulz · 2025-06-04T15:29:22Z

test/extended/two_node/tnf_recovery.go

+
+			g.GinkgoT().Logf("membership: %+v", members)
+			return nil
+		}, 2*time.Minute, 15*time.Second).ShouldNot(o.HaveOccurred())


We may want to pull these timing values out to a shared place in the file to keep them consistent with our non-graceful shutdown test - or even just to quickly be able to adjust timeouts and check frequency across the test suite.

jaypoulz · 2025-06-04T16:09:19Z

test/extended/two_node/tnf_topology.go

+		skipIfNotTopology(oc, v1.DualReplicaTopologyMode)
+	})
+
+	g.It("Should validate the number of control-planes, workers and arbiters as configured", func() {


Only concern here is the potential for compute nodes introduced down the line. I would keep the check to 2 control-plane nodes, and omit the general node-count check.

xueqzhan · 2025-06-16T23:22:54Z

/test e2e-metal-ovn-two-node-fencing

openshift-trt · 2025-06-17T03:17:00Z

Job Failure Risk Analysis for sha: 03c3232

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial	IncompleteTests Tests for this run (27) are below the historical average (1762): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn	IncompleteTests Tests for this run (103) are below the historical average (3084): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-ci · 2025-06-17T06:55:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign neisw for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

clobrano · 2025-06-17T06:56:17Z

/test e2e-metal-ovn-two-node-fencing

Add initial topology tests * Ensure correct number of ControlPlanes, Workers, Arbiters * Ensure correct number of static etcd pod containers * Ensure correct number of podman etcd containers Add initial behavior tests * Ensure the cluster can handle a graceful node shutdown Closes: OCPEDGE-1481, OCPEDGE-1482

Signed-off-by: Carlo Lobrano <[email protected]>

As a starting point for test integration within this new cluster environment, this commit enables only a minimal set of monitors. These monitors are known to be reliable in general, but are currently exhibiting unexpected behavior in this specific cluster. This approach allows us to establish a foundational test base. Further investigation into the reasons for their misbehavior in this new cluster will be conducted once this initial test setup is merged.

clobrano · 2025-06-20T13:57:59Z

/test e2e-metal-ovn-two-node-fencing

…alReplica" This reverts commit ff3ce03.

clobrano · 2025-06-22T10:26:56Z

/test e2e-metal-ovn-two-node-fencing

openshift-ci · 2025-06-22T13:37:37Z

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ovn-two-node-arbiter	`e540ed0`	link	false	`/test e2e-metal-ovn-two-node-arbiter`
ci/prow/e2e-aws-ovn-single-node	`03c3232`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-metal-ipi-ovn	`03c3232`	link	false	`/test e2e-metal-ipi-ovn`
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6	`03c3232`	link	false	`/test e2e-vsphere-ovn-dualstack-primaryv6`
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2	`03c3232`	link	false	`/test e2e-metal-ipi-serial-ovn-ipv6-2of2`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`03c3232`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback	`03c3232`	link	false	`/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback`
ci/prow/e2e-azure-ovn-upgrade	`03c3232`	link	false	`/test e2e-azure-ovn-upgrade`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	`03c3232`	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`
ci/prow/e2e-aws-ovn-serial-publicnet-1of2	`03c3232`	link	false	`/test e2e-aws-ovn-serial-publicnet-1of2`
ci/prow/e2e-aws-ovn-etcd-scaling	`03c3232`	link	false	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-gcp-fips-serial-1of2	`03c3232`	link	false	`/test e2e-gcp-fips-serial-1of2`
ci/prow/e2e-gcp-disruptive	`03c3232`	link	false	`/test e2e-gcp-disruptive`
ci/prow/okd-e2e-gcp	`03c3232`	link	false	`/test okd-e2e-gcp`
ci/prow/e2e-azure-ovn-etcd-scaling	`03c3232`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-metal-ipi-ovn-dualstack	`03c3232`	link	false	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-gcp-ovn-etcd-scaling	`03c3232`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-gcp-ovn-upgrade	`03c3232`	link	true	`/test e2e-gcp-ovn-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	`03c3232`	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/verify-deps	`03c3232`	link	true	`/test verify-deps`
ci/prow/verify	`03c3232`	link	true	`/test verify`
ci/prow/e2e-aws-disruptive	`03c3232`	link	false	`/test e2e-aws-disruptive`
ci/prow/lint	`03c3232`	link	true	`/test lint`
ci/prow/e2e-aws-ovn-microshift-serial	`03c3232`	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/okd-scos-e2e-aws-ovn	`03c3232`	link	false	`/test okd-scos-e2e-aws-ovn`
ci/prow/e2e-gcp-fips-serial-2of2	`03c3232`	link	false	`/test e2e-gcp-fips-serial-2of2`
ci/prow/e2e-aws-ovn-microshift	`03c3232`	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/e2e-aws-ovn-single-node-serial	`03c3232`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway	`03c3232`	link	false	`/test e2e-metal-ipi-ovn-dualstack-local-gateway`
ci/prow/e2e-aws-ovn-single-node-upgrade	`03c3232`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-metal-ipi-virtualmedia	`03c3232`	link	false	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-metal-ovn-two-node-fencing	`5ceaa9d`	link	false	`/test e2e-metal-ovn-two-node-fencing`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from p0lyn0mial and sjenning May 21, 2025 09:32

clobrano marked this pull request as draft May 22, 2025 07:55

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2025

clobrano force-pushed the tnf-tests branch from bc35368 to fdff67e Compare May 22, 2025 12:55

clobrano marked this pull request as ready for review May 22, 2025 13:18

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2025

clobrano force-pushed the tnf-tests branch 2 times, most recently from 0d87592 to fe755b3 Compare May 22, 2025 17:32

eggfoobar reviewed May 26, 2025

View reviewed changes

clobrano force-pushed the tnf-tests branch from fe755b3 to e540ed0 Compare May 29, 2025 13:31

eggfoobar reviewed May 29, 2025

View reviewed changes

clobrano force-pushed the tnf-tests branch 2 times, most recently from b7943dd to 03c3232 Compare May 30, 2025 10:11

jaypoulz reviewed Jun 4, 2025

View reviewed changes

clobrano marked this pull request as draft June 16, 2025 14:16

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2025

clobrano added 2 commits June 18, 2025 12:35

Add two-node suite

b4b2bf5

Signed-off-by: Carlo Lobrano <[email protected]>

clobrano force-pushed the tnf-tests branch 2 times, most recently from 69f084e to 6b9291a Compare June 19, 2025 08:05

clobrano force-pushed the tnf-tests branch from 6b9291a to ff3ce03 Compare June 20, 2025 13:56

Revert "Enable minimal, known-good monitors for initial testing in Du…

5ceaa9d

…alReplica" This reverts commit ff3ce03.

	{
	Name: "openshift/conformance/serial",
	Description: templates.LongDesc(`
	Only the portion of the openshift/conformance test suite that run serially.
	`),
	Matches: func(name string) bool {
	if isDisabled(name) {
	return false
	}
	return strings.Contains(name, "[Suite:openshift/conformance/serial") \|\| isStandardEarlyOrLateTest(name)
	},
	TestTimeout: 40 * time.Minute,
	},
	{
	Name: "openshift/disruptive",
	Description: templates.LongDesc(`
	The disruptive test suite. Disruptive tests interrupt the cluster function such as by stopping/restarting the control plane or
	changing the global cluster configuration in a way that can affect other tests.
	`),
	Matches: func(name string) bool {
	if isDisabled(name) {
	return false
	}
	// excluded due to stopped instance handling until https://bugzilla.redhat.com/show_bug.cgi?id=1905709 is fixed
	if strings.Contains(name, "Cluster should survive master and worker failure and recover with machine health checks") {
	return false
	}
	return strings.Contains(name, "[Feature:EtcdRecovery]") \|\| strings.Contains(name, "[Feature:NodeRecovery]") \|\| isStandardEarlyTest(name)

	},
	// Duration of the quorum restore test exceeds 60 minutes.
	TestTimeout: 90 * time.Minute,
	ClusterStabilityDuringTest: ginkgo.Disruptive,
	},

Initial tests for Two Nodes OCP with Fencing (TNF) cluster #29833

Are you sure you want to change the base?

Initial tests for Two Nodes OCP with Fencing (TNF) cluster #29833

Conversation

clobrano commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clobrano commented May 22, 2025

Uh oh!

clobrano commented May 22, 2025

Uh oh!

openshift-trt bot commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eggfoobar commented May 29, 2025

Uh oh!

openshift-ci bot commented May 29, 2025

Uh oh!

eggfoobar commented May 29, 2025

Uh oh!

openshift-ci bot commented May 29, 2025

Uh oh!

eggfoobar commented May 29, 2025

Uh oh!

eggfoobar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-trt bot commented May 29, 2025

Uh oh!

jaypoulz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xueqzhan commented Jun 16, 2025

Uh oh!

openshift-trt bot commented Jun 17, 2025

Uh oh!

openshift-ci bot commented Jun 17, 2025

Uh oh!

clobrano commented Jun 17, 2025

Uh oh!

clobrano commented Jun 20, 2025

Uh oh!

clobrano commented Jun 22, 2025

Uh oh!

openshift-ci bot commented Jun 22, 2025

Uh oh!

Uh oh!

clobrano commented May 21, 2025 •

edited

Loading

jaypoulz left a comment •

edited

Loading