-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Initial tests for Two Nodes OCP with Fencing (TNF) cluster #29833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Temporarily converting it to draft to investigate a crash |
Ready again for review |
0d87592
to
fe755b3
Compare
Job Failure Risk Analysis for sha: fe755b3
|
test/extended/tnf/recovery.go
Outdated
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
) | ||
|
||
var _ = g.Describe("[sig-node][apigroup:config.openshift.io] Two Nodes OCP with fencing recovery", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the annotation [OCPFeatureGate:DualReplica]
to the test names to allow the feature gate to be captured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure 👍
/test e2e-metal-ipi-ovn-two-node-arbiter e2e-metal-ipi-ovn-two-node-fencing |
@eggfoobar: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test e2e-metal-ipi-ovn-two-node-arbiter |
@eggfoobar: The specified target(s) for
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test e2e-metal-ovn-two-node-arbiter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good :), just had some small suggestion around the helpers.
test/extended/include.go
Outdated
@@ -57,6 +57,7 @@ import ( | |||
_ "github.com/openshift/origin/test/extended/storage" | |||
_ "github.com/openshift/origin/test/extended/tbr_health" | |||
_ "github.com/openshift/origin/test/extended/templates" | |||
_ "github.com/openshift/origin/test/extended/tnf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be good to delete this now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot, thank you for noticing :)
test/extended/two_node/common.go
Outdated
} | ||
} | ||
|
||
func getInfraStatus(oc *exutil.CLI) (*v1.InfrastructureStatus, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning this up, while you're already here, I think we can simplify this a bit more, I had missed that we already have a helper for control plane topology, would you mind removing this and using https://github.com/openshift/origin/blob/main/test/extended/util/framework.go#L2125
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll replace this and the one below
test/extended/two_node/common.go
Outdated
return &infra.Status, nil | ||
} | ||
|
||
func runOnNodeNS(oc *exutil.CLI, nodeName, namespace, command string) (string, string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here, noticed we had helper function with a retry wrapper, https://github.com/openshift/origin/blob/main/test/extended/util/nodes.go#L38
Job Failure Risk Analysis for sha: e540ed0
|
b7943dd
to
03c3232
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really love how this is coming together. My biggest concern is how we need to properly label the tests that have the potential to affect other tests.
In other words - the node reboot/restart tests need to have some kind of label to indicate that they should be run serially and/or are disruptive.
The existing logic for that is
origin/pkg/testsuites/standard_suites.go
Lines 59 to 92 in 9508b94
{ | |
Name: "openshift/conformance/serial", | |
Description: templates.LongDesc(` | |
Only the portion of the openshift/conformance test suite that run serially. | |
`), | |
Matches: func(name string) bool { | |
if isDisabled(name) { | |
return false | |
} | |
return strings.Contains(name, "[Suite:openshift/conformance/serial") || isStandardEarlyOrLateTest(name) | |
}, | |
TestTimeout: 40 * time.Minute, | |
}, | |
{ | |
Name: "openshift/disruptive", | |
Description: templates.LongDesc(` | |
The disruptive test suite. Disruptive tests interrupt the cluster function such as by stopping/restarting the control plane or | |
changing the global cluster configuration in a way that can affect other tests. | |
`), | |
Matches: func(name string) bool { | |
if isDisabled(name) { | |
return false | |
} | |
// excluded due to stopped instance handling until https://bugzilla.redhat.com/show_bug.cgi?id=1905709 is fixed | |
if strings.Contains(name, "Cluster should survive master and worker failure and recover with machine health checks") { | |
return false | |
} | |
return strings.Contains(name, "[Feature:EtcdRecovery]") || strings.Contains(name, "[Feature:NodeRecovery]") || isStandardEarlyTest(name) | |
}, | |
// Duration of the quorum restore test exceeds 60 minutes. | |
TestTimeout: 90 * time.Minute, | |
ClusterStabilityDuringTest: ginkgo.Disruptive, | |
}, |
I'm not familiar enough with how the final test list is composed to know if it's sane to tag the graceful shutdown test as part of the serial suite.
func skipIfNotTopology(oc *exutil.CLI, wanted v1.TopologyMode) { | ||
current, err := exutil.GetControlPlaneTopology(oc) | ||
if err != nil { | ||
e2eskipper.Skip(fmt.Sprintf("Could not get current topology, skipping test: error %v", err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a little strange, but I think we should default to running the test when we don't know the topology. The reason is - this will likely lead to the tests running and failing, which will help us identify misconfigured clusters.
We want to avoid failing silently.
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
) | ||
|
||
var _ = g.Describe("[sig-etcd][apigroup:config.openshift.io][OCPFeatureGate:DualReplica] Two Node with Fencing etcd recovery", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is an existing convention for naming "disruptive" or tests that should be run in serial. I'm not sure graceful recovery qualifies as that latter, but it definitely qualifies as the former since API requests may be routed to the dead node depend on how the load-balancer handles the rebooting node.
|
||
nodes, err := oc.AdminKubeClient().CoreV1().Nodes().List(context.Background(), metav1.ListOptions{}) | ||
o.Expect(err).ShouldNot(o.HaveOccurred(), "Expected to retrieve nodes without error") | ||
o.Expect(len(nodes.Items)).To(o.BeNumerically("==", 2), "Expected to find 2 Nodes only") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a filter for control-plane nodes? I'm imagining a future where we're asked to support/test compute nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be good to verify that there are 0 arbiter nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking below, we have another test for that. :) So I think just verifying 2 control-plane nodes should be sufficient.
return fmt.Errorf("Expected node: %s to be a started and voting member. Membership: %+v", nodeA.Name, members) | ||
} | ||
|
||
// Ensure GNS node is unstarted and a learner member (i.e. !learner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this comment.
Wouldn't !learner
mean that it's not a learner?
|
||
g.GinkgoT().Logf("membership: %+v", members) | ||
return nil | ||
}, 2*time.Minute, 15*time.Second).ShouldNot(o.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to pull these timing values out to a shared place in the file to keep them consistent with our non-graceful shutdown test - or even just to quickly be able to adjust timeouts and check frequency across the test suite.
skipIfNotTopology(oc, v1.DualReplicaTopologyMode) | ||
}) | ||
|
||
g.It("Should validate the number of control-planes, workers and arbiters as configured", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only concern here is the potential for compute nodes introduced down the line. I would keep the check to 2 control-plane nodes, and omit the general node-count check.
/test e2e-metal-ovn-two-node-fencing |
Job Failure Risk Analysis for sha: 03c3232
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clobrano The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-metal-ovn-two-node-fencing |
Add initial topology tests * Ensure correct number of ControlPlanes, Workers, Arbiters * Ensure correct number of static etcd pod containers * Ensure correct number of podman etcd containers Add initial behavior tests * Ensure the cluster can handle a graceful node shutdown Closes: OCPEDGE-1481, OCPEDGE-1482
Signed-off-by: Carlo Lobrano <[email protected]>
69f084e
to
6b9291a
Compare
As a starting point for test integration within this new cluster environment, this commit enables only a minimal set of monitors. These monitors are known to be reliable in general, but are currently exhibiting unexpected behavior in this specific cluster. This approach allows us to establish a foundational test base. Further investigation into the reasons for their misbehavior in this new cluster will be conducted once this initial test setup is merged.
/test e2e-metal-ovn-two-node-fencing |
…alReplica" This reverts commit ff3ce03.
/test e2e-metal-ovn-two-node-fencing |
@clobrano: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Add initial topology tests
Add initial behavior tests
Closes: OCPEDGE-1481, OCPEDGE-1482
As a starting point for test integration within this new cluster
environment, this change enables only a minimal set of monitors. These
monitors are known to be reliable in general, but are currently
exhibiting unexpected behavior in this specific cluster.
This approach allows us to establish a foundational test base. Further
investigation into the reasons for their misbehavior in this new cluster
will be conducted once this initial test setup is merged.