Skip to content

Initial tests for Two Nodes OCP with Fencing (TNF) cluster #29833

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

clobrano
Copy link

@clobrano clobrano commented May 21, 2025

Add initial topology tests

  • Ensure correct number of ControlPlanes, Workers, Arbiters
  • Ensure correct number of static etcd pod containers
  • Ensure correct number of podman etcd containers

Add initial behavior tests

  • Ensure the cluster can handle a graceful node shutdown

Closes: OCPEDGE-1481, OCPEDGE-1482


As a starting point for test integration within this new cluster
environment, this change enables only a minimal set of monitors. These
monitors are known to be reliable in general, but are currently
exhibiting unexpected behavior in this specific cluster.

This approach allows us to establish a foundational test base. Further
investigation into the reasons for their misbehavior in this new cluster
will be conducted once this initial test setup is merged.

@openshift-ci openshift-ci bot requested review from p0lyn0mial and sjenning May 21, 2025 09:32
@clobrano clobrano marked this pull request as draft May 22, 2025 07:55
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2025
@clobrano
Copy link
Author

Temporarily converting it to draft to investigate a crash

@clobrano clobrano marked this pull request as ready for review May 22, 2025 13:18
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 22, 2025
@clobrano
Copy link
Author

Ready again for review

@clobrano clobrano force-pushed the tnf-tests branch 2 times, most recently from 0d87592 to fe755b3 Compare May 22, 2025 17:32
Copy link

openshift-trt bot commented May 22, 2025

Job Failure Risk Analysis for sha: fe755b3

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive Medium
[bz-Etcd] clusteroperator/etcd should not change condition/Available
Potential external regression detected for High Risk Test analysis
---
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.
---
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 0.00% of 1 runs on release 4.20 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:azure SecurityMode:default Topology:ha Upgrade:none] in the last week.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

var _ = g.Describe("[sig-node][apigroup:config.openshift.io] Two Nodes OCP with fencing recovery", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the annotation [OCPFeatureGate:DualReplica] to the test names to allow the feature gate to be captured

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure 👍

@eggfoobar
Copy link
Contributor

/test e2e-metal-ipi-ovn-two-node-arbiter e2e-metal-ipi-ovn-two-node-fencing

Copy link
Contributor

openshift-ci bot commented May 29, 2025

@eggfoobar: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-image-registry
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-ovn
/test e2e-gcp-ovn-builds
/test e2e-gcp-ovn-image-ecosystem
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test images
/test lint
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
/test e2e-agnostic-ovn-cmd
/test e2e-aws
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-aws-ovn
/test e2e-aws-ovn-cgroupsv2
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-ipsec-serial
/test e2e-aws-ovn-kube-apiserver-rollout
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-serial-publicnet-1of2
/test e2e-aws-ovn-serial-publicnet-2of2
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-single-node-techpreview-serial
/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-rollback
/test e2e-aws-ovn-upi
/test e2e-aws-ovn-virt-techpreview
/test e2e-aws-proxy
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-upgrade
/test e2e-baremetalds-kubevirt
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-external-vsphere-ccm
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial-1of2
/test e2e-gcp-fips-serial-2of2
/test e2e-gcp-ovn-etcd-scaling
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-techpreview-serial-1of2
/test e2e-gcp-ovn-techpreview-serial-2of2
/test e2e-gcp-ovn-usernamespace
/test e2e-hypershift-conformance
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview
/test e2e-metal-ipi-ovn-dualstack-bgp-techpreview
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-kube-apiserver-rollout
/test e2e-metal-ipi-serial-1of2
/test e2e-metal-ipi-serial-2of2
/test e2e-metal-ipi-serial-ovn-ipv6-1of2
/test e2e-metal-ipi-serial-ovn-ipv6-2of2
/test e2e-metal-ipi-virtualmedia
/test e2e-metal-ovn-single-node-live-iso
/test e2e-metal-ovn-single-node-with-worker-live-iso
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-openstack-ovn
/test e2e-openstack-serial
/test e2e-vsphere-ovn-dualstack-primaryv6
/test e2e-vsphere-ovn-etcd-scaling
/test okd-e2e-gcp
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
pull-ci-openshift-origin-main-e2e-agnostic-ovn-cmd
pull-ci-openshift-origin-main-e2e-aws
pull-ci-openshift-origin-main-e2e-aws-csi
pull-ci-openshift-origin-main-e2e-aws-disruptive
pull-ci-openshift-origin-main-e2e-aws-ovn
pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-aws-ovn-fips
pull-ci-openshift-origin-main-e2e-aws-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-1of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-2of2
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade
pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade
pull-ci-openshift-origin-main-e2e-aws-proxy
pull-ci-openshift-origin-main-e2e-azure
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade
pull-ci-openshift-origin-main-e2e-gcp-csi
pull-ci-openshift-origin-main-e2e-gcp-disruptive
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-2of2
pull-ci-openshift-origin-main-e2e-gcp-ovn
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade
pull-ci-openshift-origin-main-e2e-hypershift-conformance
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia
pull-ci-openshift-origin-main-e2e-openstack-ovn
pull-ci-openshift-origin-main-e2e-openstack-serial
pull-ci-openshift-origin-main-e2e-vsphere-ovn
pull-ci-openshift-origin-main-e2e-vsphere-ovn-dualstack-primaryv6
pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi
pull-ci-openshift-origin-main-images
pull-ci-openshift-origin-main-lint
pull-ci-openshift-origin-main-okd-e2e-gcp
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-origin-main-okd-scos-images
pull-ci-openshift-origin-main-unit
pull-ci-openshift-origin-main-verify
pull-ci-openshift-origin-main-verify-deps

In response to this:

/test e2e-metal-ipi-ovn-two-node-arbiter e2e-metal-ipi-ovn-two-node-fencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@eggfoobar
Copy link
Contributor

/test e2e-metal-ipi-ovn-two-node-arbiter
/test e2e-metal-ipi-ovn-two-node-fencing

Copy link
Contributor

openshift-ci bot commented May 29, 2025

@eggfoobar: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aws-jenkins
/test e2e-aws-ovn-edge-zones
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-image-registry
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-ovn
/test e2e-gcp-ovn-builds
/test e2e-gcp-ovn-image-ecosystem
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi
/test images
/test lint
/test okd-scos-images
/test unit
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
/test e2e-agnostic-ovn-cmd
/test e2e-aws
/test e2e-aws-csi
/test e2e-aws-disruptive
/test e2e-aws-etcd-certrotation
/test e2e-aws-etcd-recovery
/test e2e-aws-ovn
/test e2e-aws-ovn-cgroupsv2
/test e2e-aws-ovn-etcd-scaling
/test e2e-aws-ovn-ipsec-serial
/test e2e-aws-ovn-kube-apiserver-rollout
/test e2e-aws-ovn-kubevirt
/test e2e-aws-ovn-serial-publicnet-1of2
/test e2e-aws-ovn-serial-publicnet-2of2
/test e2e-aws-ovn-single-node
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-single-node-techpreview
/test e2e-aws-ovn-single-node-techpreview-serial
/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-upgrade-rollback
/test e2e-aws-ovn-upi
/test e2e-aws-ovn-virt-techpreview
/test e2e-aws-proxy
/test e2e-azure
/test e2e-azure-ovn-etcd-scaling
/test e2e-azure-ovn-upgrade
/test e2e-baremetalds-kubevirt
/test e2e-external-aws
/test e2e-external-aws-ccm
/test e2e-external-vsphere-ccm
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial-1of2
/test e2e-gcp-fips-serial-2of2
/test e2e-gcp-ovn-etcd-scaling
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-ovn-techpreview
/test e2e-gcp-ovn-techpreview-serial-1of2
/test e2e-gcp-ovn-techpreview-serial-2of2
/test e2e-gcp-ovn-usernamespace
/test e2e-hypershift-conformance
/test e2e-metal-ipi-ovn
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-dualstack-bgp-local-gw-techpreview
/test e2e-metal-ipi-ovn-dualstack-bgp-techpreview
/test e2e-metal-ipi-ovn-dualstack-local-gateway
/test e2e-metal-ipi-ovn-kube-apiserver-rollout
/test e2e-metal-ipi-serial-1of2
/test e2e-metal-ipi-serial-2of2
/test e2e-metal-ipi-serial-ovn-ipv6-1of2
/test e2e-metal-ipi-serial-ovn-ipv6-2of2
/test e2e-metal-ipi-virtualmedia
/test e2e-metal-ovn-single-node-live-iso
/test e2e-metal-ovn-single-node-with-worker-live-iso
/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing
/test e2e-openstack-ovn
/test e2e-openstack-serial
/test e2e-vsphere-ovn-dualstack-primaryv6
/test e2e-vsphere-ovn-etcd-scaling
/test okd-e2e-gcp
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
pull-ci-openshift-origin-main-e2e-agnostic-ovn-cmd
pull-ci-openshift-origin-main-e2e-aws
pull-ci-openshift-origin-main-e2e-aws-csi
pull-ci-openshift-origin-main-e2e-aws-disruptive
pull-ci-openshift-origin-main-e2e-aws-ovn
pull-ci-openshift-origin-main-e2e-aws-ovn-cgroupsv2
pull-ci-openshift-origin-main-e2e-aws-ovn-edge-zones
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-aws-ovn-fips
pull-ci-openshift-origin-main-e2e-aws-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift
pull-ci-openshift-origin-main-e2e-aws-ovn-microshift-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-1of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-2of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-1of2
pull-ci-openshift-origin-main-e2e-aws-ovn-serial-publicnet-2of2
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade
pull-ci-openshift-origin-main-e2e-aws-ovn-upgrade
pull-ci-openshift-origin-main-e2e-aws-proxy
pull-ci-openshift-origin-main-e2e-azure
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade
pull-ci-openshift-origin-main-e2e-gcp-csi
pull-ci-openshift-origin-main-e2e-gcp-disruptive
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-1of2
pull-ci-openshift-origin-main-e2e-gcp-fips-serial-2of2
pull-ci-openshift-origin-main-e2e-gcp-ovn
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade
pull-ci-openshift-origin-main-e2e-hypershift-conformance
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-kube-apiserver-rollout
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-1of2
pull-ci-openshift-origin-main-e2e-metal-ipi-serial-ovn-ipv6-2of2
pull-ci-openshift-origin-main-e2e-metal-ipi-virtualmedia
pull-ci-openshift-origin-main-e2e-openstack-ovn
pull-ci-openshift-origin-main-e2e-openstack-serial
pull-ci-openshift-origin-main-e2e-vsphere-ovn
pull-ci-openshift-origin-main-e2e-vsphere-ovn-dualstack-primaryv6
pull-ci-openshift-origin-main-e2e-vsphere-ovn-etcd-scaling
pull-ci-openshift-origin-main-e2e-vsphere-ovn-upi
pull-ci-openshift-origin-main-images
pull-ci-openshift-origin-main-lint
pull-ci-openshift-origin-main-okd-e2e-gcp
pull-ci-openshift-origin-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-origin-main-okd-scos-images
pull-ci-openshift-origin-main-unit
pull-ci-openshift-origin-main-verify
pull-ci-openshift-origin-main-verify-deps

In response to this:

/test e2e-metal-ipi-ovn-two-node-arbiter
/test e2e-metal-ipi-ovn-two-node-fencing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@eggfoobar
Copy link
Contributor

/test e2e-metal-ovn-two-node-arbiter
/test e2e-metal-ovn-two-node-fencing

Copy link
Contributor

@eggfoobar eggfoobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good :), just had some small suggestion around the helpers.

@@ -57,6 +57,7 @@ import (
_ "github.com/openshift/origin/test/extended/storage"
_ "github.com/openshift/origin/test/extended/tbr_health"
_ "github.com/openshift/origin/test/extended/templates"
_ "github.com/openshift/origin/test/extended/tnf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be good to delete this now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot, thank you for noticing :)

}
}

func getInfraStatus(oc *exutil.CLI) (*v1.InfrastructureStatus, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning this up, while you're already here, I think we can simplify this a bit more, I had missed that we already have a helper for control plane topology, would you mind removing this and using https://github.com/openshift/origin/blob/main/test/extended/util/framework.go#L2125

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll replace this and the one below

return &infra.Status, nil
}

func runOnNodeNS(oc *exutil.CLI, nodeName, namespace, command string) (string, string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, noticed we had helper function with a retry wrapper, https://github.com/openshift/origin/blob/main/test/extended/util/nodes.go#L38

Copy link

openshift-trt bot commented May 29, 2025

Job Failure Risk Analysis for sha: e540ed0

Job Name Failure Risk
pull-ci-openshift-origin-main-4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback High
operator conditions network
This test has passed 98.99% of 3859 runs on release 4.20 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-upgrade IncompleteTests
Tests for this run (18) are below the historical average (2883): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@clobrano clobrano force-pushed the tnf-tests branch 2 times, most recently from b7943dd to 03c3232 Compare May 30, 2025 10:11
Copy link
Contributor

@jaypoulz jaypoulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really love how this is coming together. My biggest concern is how we need to properly label the tests that have the potential to affect other tests.

In other words - the node reboot/restart tests need to have some kind of label to indicate that they should be run serially and/or are disruptive.
The existing logic for that is

{
Name: "openshift/conformance/serial",
Description: templates.LongDesc(`
Only the portion of the openshift/conformance test suite that run serially.
`),
Matches: func(name string) bool {
if isDisabled(name) {
return false
}
return strings.Contains(name, "[Suite:openshift/conformance/serial") || isStandardEarlyOrLateTest(name)
},
TestTimeout: 40 * time.Minute,
},
{
Name: "openshift/disruptive",
Description: templates.LongDesc(`
The disruptive test suite. Disruptive tests interrupt the cluster function such as by stopping/restarting the control plane or
changing the global cluster configuration in a way that can affect other tests.
`),
Matches: func(name string) bool {
if isDisabled(name) {
return false
}
// excluded due to stopped instance handling until https://bugzilla.redhat.com/show_bug.cgi?id=1905709 is fixed
if strings.Contains(name, "Cluster should survive master and worker failure and recover with machine health checks") {
return false
}
return strings.Contains(name, "[Feature:EtcdRecovery]") || strings.Contains(name, "[Feature:NodeRecovery]") || isStandardEarlyTest(name)
},
// Duration of the quorum restore test exceeds 60 minutes.
TestTimeout: 90 * time.Minute,
ClusterStabilityDuringTest: ginkgo.Disruptive,
},

I'm not familiar enough with how the final test list is composed to know if it's sane to tag the graceful shutdown test as part of the serial suite.

func skipIfNotTopology(oc *exutil.CLI, wanted v1.TopologyMode) {
current, err := exutil.GetControlPlaneTopology(oc)
if err != nil {
e2eskipper.Skip(fmt.Sprintf("Could not get current topology, skipping test: error %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a little strange, but I think we should default to running the test when we don't know the topology. The reason is - this will likely lead to the tests running and failing, which will help us identify misconfigured clusters.

We want to avoid failing silently.

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica] Two Node with Fencing etcd recovery", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an existing convention for naming "disruptive" or tests that should be run in serial. I'm not sure graceful recovery qualifies as that latter, but it definitely qualifies as the former since API requests may be routed to the dead node depend on how the load-balancer handles the rebooting node.


nodes, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{})
o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to retrieve nodes without error")
o.Expect(len(nodes.Items)).To(o.BeNumerically("==", 2), "Expected to find 2 Nodes only")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a filter for control-plane nodes? I'm imagining a future where we're asked to support/test compute nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be good to verify that there are 0 arbiter nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking below, we have another test for that. :) So I think just verifying 2 control-plane nodes should be sufficient.

return fmt.Errorf("Expected node: %s to be a started and voting member. Membership: %+v", nodeA.Name, members)
}

// Ensure GNS node is unstarted and a learner member (i.e. !learner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this comment.
Wouldn't !learner mean that it's not a learner?


g.GinkgoT().Logf("membership: %+v", members)
return nil
}, 2*time.Minute, 15*time.Second).ShouldNot(o.HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to pull these timing values out to a shared place in the file to keep them consistent with our non-graceful shutdown test - or even just to quickly be able to adjust timeouts and check frequency across the test suite.

skipIfNotTopology(oc, v1.DualReplicaTopologyMode)
})

g.It("Should validate the number of control-planes, workers and arbiters as configured", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only concern here is the potential for compute nodes introduced down the line. I would keep the check to 2 control-plane nodes, and omit the general node-count check.

@clobrano clobrano marked this pull request as draft June 16, 2025 14:16
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2025
@xueqzhan
Copy link
Contributor

/test e2e-metal-ovn-two-node-fencing

Copy link

openshift-trt bot commented Jun 17, 2025

Job Failure Risk Analysis for sha: 03c3232

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-serial IncompleteTests
Tests for this run (27) are below the historical average (1762): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn IncompleteTests
Tests for this run (103) are below the historical average (3084): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Copy link
Contributor

openshift-ci bot commented Jun 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clobrano
Once this PR has been reviewed and has the lgtm label, please assign neisw for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@clobrano
Copy link
Author

/test e2e-metal-ovn-two-node-fencing

clobrano added 2 commits June 18, 2025 12:35
Add initial topology tests
* Ensure correct number of ControlPlanes, Workers, Arbiters
* Ensure correct number of static etcd pod containers
* Ensure correct number of podman etcd containers

Add initial behavior tests
* Ensure the cluster can handle a graceful node shutdown

Closes: OCPEDGE-1481, OCPEDGE-1482
Signed-off-by: Carlo Lobrano <[email protected]>
@clobrano clobrano force-pushed the tnf-tests branch 2 times, most recently from 69f084e to 6b9291a Compare June 19, 2025 08:05
As a starting point for test integration within this new cluster
environment, this commit enables only a minimal set of monitors. These
monitors are known to be reliable in general, but are currently
exhibiting unexpected behavior in this specific cluster.

This approach allows us to establish a foundational test base. Further
investigation into the reasons for their misbehavior in this new cluster
will be conducted once this initial test setup is merged.
@clobrano
Copy link
Author

/test e2e-metal-ovn-two-node-fencing

@clobrano
Copy link
Author

/test e2e-metal-ovn-two-node-fencing

Copy link
Contributor

openshift-ci bot commented Jun 22, 2025

@clobrano: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ovn-two-node-arbiter e540ed0 link false /test e2e-metal-ovn-two-node-arbiter
ci/prow/e2e-aws-ovn-single-node 03c3232 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ipi-ovn 03c3232 link false /test e2e-metal-ipi-ovn
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 03c3232 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2 03c3232 link false /test e2e-metal-ipi-serial-ovn-ipv6-2of2
ci/prow/e2e-vsphere-ovn-etcd-scaling 03c3232 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback 03c3232 link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-azure-ovn-upgrade 03c3232 link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 03c3232 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-serial-publicnet-1of2 03c3232 link false /test e2e-aws-ovn-serial-publicnet-1of2
ci/prow/e2e-aws-ovn-etcd-scaling 03c3232 link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-gcp-fips-serial-1of2 03c3232 link false /test e2e-gcp-fips-serial-1of2
ci/prow/e2e-gcp-disruptive 03c3232 link false /test e2e-gcp-disruptive
ci/prow/okd-e2e-gcp 03c3232 link false /test okd-e2e-gcp
ci/prow/e2e-azure-ovn-etcd-scaling 03c3232 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-metal-ipi-ovn-dualstack 03c3232 link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-gcp-ovn-etcd-scaling 03c3232 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-ovn-upgrade 03c3232 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 03c3232 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/verify-deps 03c3232 link true /test verify-deps
ci/prow/verify 03c3232 link true /test verify
ci/prow/e2e-aws-disruptive 03c3232 link false /test e2e-aws-disruptive
ci/prow/lint 03c3232 link true /test lint
ci/prow/e2e-aws-ovn-microshift-serial 03c3232 link true /test e2e-aws-ovn-microshift-serial
ci/prow/okd-scos-e2e-aws-ovn 03c3232 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-fips-serial-2of2 03c3232 link false /test e2e-gcp-fips-serial-2of2
ci/prow/e2e-aws-ovn-microshift 03c3232 link true /test e2e-aws-ovn-microshift
ci/prow/e2e-aws-ovn-single-node-serial 03c3232 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 03c3232 link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-aws-ovn-single-node-upgrade 03c3232 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-metal-ipi-virtualmedia 03c3232 link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-metal-ovn-two-node-fencing 5ceaa9d link false /test e2e-metal-ovn-two-node-fencing

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants