feature(scale_test): new scale test with empty cluster #11981

aleksbykov · 2025-09-16T12:46:27Z

New scale tests have been developed to reproduce and
validate the issues identified below.

The current implementation of the LongevityTest does not support
extended execution without a workload.
To address this limitation, a new ScaleClusterTest has been introduced.

This test allows for the execution of tests without workloads in various scenarios:

Initializing a large cluster to a specified target size (e.g., from 10 to 100 nodes).
Scaling down the cluster to a desired size (e.g., from 100 to 10 nodes).
Creating a large number of keyspaces and tables with predefined columns or
utilizing the cs-profile-template.
Running tests with Nemesis without any payload,
with a duration specified using the new 'idle_duration' parameter.

The development of these new tests was aimed at simplifying the complexity
associated with the LongevityTest object and
ensuring compatibility with future scale testing efforts
using Kubernetes (K8s), Docker, and other cloud providers.

Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624

Testing

Jobs are running

PR pre-checks (self review)

I added the relevant backport labels
I didn't leave commented-out/debugging code

Reminders

Add New configuration option and document them (in sdcm/sct_config.py)
Add unit tests to cover my changes (under unit-test/ folder)
Update the Readme/doc folder relevant to this change (if needed)

Copilot

Pull Request Overview

This PR introduces a new scale test framework for ScyllaDB that enables testing cluster resizing operations with empty clusters. It provides infrastructure for growing and shrinking clusters while maintaining schema without active workloads.

Key changes:

Adds ScaleClusterTest class with methods for cluster resizing operations
Introduces idle_duration configuration parameter for workload-free testing
Creates YAML test configurations for multi-datacenter and single datacenter scenarios

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`scale_cluster_test.py`	New test class implementing cluster grow/shrink operations with schema management
`sdcm/sct_config.py`	Adds `idle_duration` configuration parameter definition
`test-cases/scale/scale-multi-dc-100-empty-tables-cluster-resize.yaml`	Multi-DC test config with 100 empty tables and cluster resizing
`test-cases/scale/scale-20-200-20-cluster-resize.yaml`	Single DC test config for scaling from 20 to 200 nodes
`jenkins-pipelines/oss/scale/scale-20-200-nodes-cluster.jenkinsfile`	Jenkins pipeline for 20-200 node scaling test
`jenkins-pipelines/oss/scale/scale-120-multidc-cluster-resize.jenkinsfile`	Jenkins pipeline for multi-DC cluster resizing
`docs/configuration_options.md`	Documentation for new `idle_duration` parameter
`defaults/test_default.yaml`	Default value for `idle_duration` parameter
`data_dir/templated_100_table.yaml`	Template definition for creating 100 test tables

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

scale_cluster_test.py

New scale tests have been developed to reproduce and validate the issues identified below. The current implementation of the LongevityTest does not support extended execution without a workload. To address this limitation, a new ScaleClusterTest has been introduced. This test allows for the execution of tests without workloads in various scenarios: - Initializing a large cluster to a specified target size (e.g., from 10 to 100 nodes). - Scaling down the cluster to a desired size (e.g., from 100 to 10 nodes). - Creating a large number of keyspaces and tables with predefined columns or utilizing the cs-profile-template. - Running tests with Nemesis without any payload, with a duration specified using the new 'idle_duration' parameter. The development of these new tests was aimed at simplifying the complexity associated with the LongevityTest object and ensuring compatibility with future scale testing efforts using Kubernetes (K8s), Docker, and other cloud providers. Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624

Copilot

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-22T06:14:43Z

scale_cluster_test.py

+        except Exception as ex:  # noqa: BLE001
+            self.log.error(f"Failed to grow cluster: {ex}")
+            InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
+
+        try:
+            InfoEvent("Start shrink cluster").publish()
+            self.shrink_to_cluster_target_size(init_cluster_size)
+        except Exception as ex:  # noqa: BLE001


Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.

Suggested change

except Exception as ex: # noqa: BLE001

self.log.error(f"Failed to grow cluster: {ex}")

InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()

try:

InfoEvent("Start shrink cluster").publish()

self.shrink_to_cluster_target_size(init_cluster_size)

except Exception as ex: # noqa: BLE001

except (KeyboardInterrupt, SystemExit):

raise

except Exception as ex:

self.log.error(f"Failed to grow cluster: {ex}")

InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()

try:

InfoEvent("Start shrink cluster").publish()

self.shrink_to_cluster_target_size(init_cluster_size)

except (KeyboardInterrupt, SystemExit):

raise

except Exception as ex:

Copilot · 2025-09-22T06:14:43Z

scale_cluster_test.py

+        except Exception as ex:  # noqa: BLE001
+            self.log.error(f"Failed to grow cluster: {ex}")
+            InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
+
+        try:
+            InfoEvent("Start shrink cluster").publish()
+            self.shrink_to_cluster_target_size(init_cluster_size)
+        except Exception as ex:  # noqa: BLE001
+            self.log.error(f"Failed to shrink cluster: {ex}")
+            InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish()


Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.

Suggested change

except Exception as ex: # noqa: BLE001

self.log.error(f"Failed to grow cluster: {ex}")

InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()

try:

InfoEvent("Start shrink cluster").publish()

self.shrink_to_cluster_target_size(init_cluster_size)

except Exception as ex: # noqa: BLE001

self.log.error(f"Failed to shrink cluster: {ex}")

InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish()

except TimeoutError as ex:

self.log.error(f"Timeout while growing cluster: {ex}")

InfoEvent(f"Grow cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()

except RuntimeError as ex:

self.log.error(f"Runtime error while growing cluster: {ex}")

InfoEvent(f"Grow cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()

except Exception as ex:

self.log.critical(f"Unexpected error while growing cluster: {ex}", exc_info=True)

InfoEvent(f"Grow cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()

try:

InfoEvent("Start shrink cluster").publish()

self.shrink_to_cluster_target_size(init_cluster_size)

except TimeoutError as ex:

self.log.error(f"Timeout while shrinking cluster: {ex}")

InfoEvent(f"Shrink cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()

except RuntimeError as ex:

self.log.error(f"Runtime error while shrinking cluster: {ex}")

InfoEvent(f"Shrink cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()

except Exception as ex:

self.log.critical(f"Unexpected error while shrinking cluster: {ex}", exc_info=True)

InfoEvent(f"Shrink cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()

Copilot · 2025-09-22T06:14:43Z

scale_cluster_test.py

+        InfoEvent(f"Wait {duration} minutes while cluster resizing").publish()
+        time.sleep(duration * 60)
+
+        self.shrink_to_cluster_target_size(self.params.total_db_nodes)


The method expects a list[int] parameter but self.params.total_db_nodes is likely not in the correct format. This should be converted to match the expected format, similar to how cluster_target_size is handled in the property method.

pehala

I am missing the reason for a new TestCase:

Existing scale test are using normal longevities
Creating empty tables, is done via custom profile.
Grow_shrink can be implemented by nemesis
test_no_workloads_idle_custom_time is a normal longevity with empty stress

pehala · 2025-09-22T06:15:12Z

scale_cluster_test.py

+    def grow_to_cluster_target_size(self, cluster_target_size: list[int]):
+        """ Bootstrap node in each dc in each rack while cluster size less than target size"""
+        nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+        current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+        if self.is_target_reached(current_cluster_size, cluster_target_size):
+            self.log.debug("Cluster has required size, no need to grow")
+            return
+        InfoEvent(
+            message=f"Starting to grow cluster from {current_cluster_size} to {cluster_target_size}").publish()
+
+        add_node_cnt = self.params.get('add_node_cnt')
+        try:
+            while not self.is_target_reached(current_cluster_size, cluster_target_size):
+                for dcx, target in enumerate(cluster_target_size):
+                    if current_cluster_size[dcx] >= target:
+                        continue
+                    add_nodes_num = add_node_cnt if (
+                        target - current_cluster_size[dcx]) >= add_node_cnt else target - current_cluster_size[dcx]
+
+                    for rack in range(self.db_cluster.racks_count):
+                        added_nodes = []
+                        InfoEvent(
+                            message=f"Adding next number of nodes {add_nodes_num} to dc_idx {dcx} and rack {rack}").publish()
+                        added_nodes.extend(self.db_cluster.add_nodes(
+                            count=add_nodes_num, enable_auto_bootstrap=True, dc_idx=dcx, rack=rack))
+                        self.monitors.reconfigure_scylla_monitoring()
+                        up_timeout = MAX_TIME_WAIT_FOR_NEW_NODE_UP
+                        with adaptive_timeout(Operations.NEW_NODE, node=self.db_cluster.data_nodes[0], timeout=up_timeout):
+                            self.db_cluster.wait_for_init(
+                                node_list=added_nodes, timeout=up_timeout, check_node_health=False)
+                        self.db_cluster.wait_for_nodes_up_and_normal(nodes=added_nodes)
+                        InfoEvent(f"New nodes up and normal {[node.name for node in added_nodes]}").publish()
+                nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+                current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+        finally:
+            nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+            current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+            InfoEvent(message=f"Grow cluster finished, cluster size is {current_cluster_size}").publish()
+
+    def shrink_to_cluster_target_size(self, cluster_target_size: list[int]):
+        """Decommission node in each dc in each rack while cluster size more than target size"""
+        nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+        current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+        if self.is_target_reached(cluster_target_size, current_cluster_size):
+            self.log.debug("Cluster has required size, no need to shrink")
+            return
+        InfoEvent(
+            message=f"Starting to shrink cluster from {current_cluster_size} to {cluster_target_size}").publish()
+        try:
+            nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+            while not self.is_target_reached(cluster_target_size, current_cluster_size):
+                for dcx, _ in enumerate(current_cluster_size):
+                    nodes_by_racks = self.db_cluster.get_nodes_per_datacenter_and_rack_idx(nodes_by_dcx[dcx])
+                    for nodes in nodes_by_racks.values():
+                        decommissioning_node = nodes[-1]
+                        decommissioning_node.running_nemesis = "Decommissioning node"
+                        self.db_cluster.decommission(node=decommissioning_node, timeout=7200)
+                nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+                current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+        finally:
+            nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
+            current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
+            InfoEvent(
+                message=f"Reached cluster size {current_cluster_size}").publish()


cannot this be achieved by grow_shrink nemesis?

github-actions bot assigned aleksbykov Sep 16, 2025

aleksbykov force-pushed the scale-cluster-tests branch 2 times, most recently from a4efd61 to 5d8927a Compare September 17, 2025 05:56

aleksbykov requested a review from Copilot September 17, 2025 08:59

Copilot AI reviewed Sep 17, 2025

View reviewed changes

scale_cluster_test.py Show resolved Hide resolved

scale_cluster_test.py Show resolved Hide resolved

scale_cluster_test.py Outdated Show resolved Hide resolved

aleksbykov force-pushed the scale-cluster-tests branch 7 times, most recently from c70fe4a to fe74b08 Compare September 22, 2025 05:11

aleksbykov force-pushed the scale-cluster-tests branch from fe74b08 to 9dd2a7d Compare September 22, 2025 05:16

aleksbykov marked this pull request as ready for review September 22, 2025 05:17

aleksbykov requested review from pehala, temichus, soyacz and Copilot September 22, 2025 05:18

Copilot AI reviewed Sep 22, 2025

View reviewed changes

pehala requested changes Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature(scale_test): new scale test with empty cluster #11981

feature(scale_test): new scale test with empty cluster #11981

Uh oh!

aleksbykov commented Sep 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 22, 2025

Uh oh!

Copilot AI Sep 22, 2025

Uh oh!

Copilot AI Sep 22, 2025

Uh oh!

pehala left a comment

Uh oh!

pehala Sep 22, 2025

Uh oh!

Uh oh!

-        except Exception as ex:  # noqa: BLE001
-            self.log.error(f"Failed to grow cluster: {ex}")
-            InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
-        try:
-            InfoEvent("Start shrink cluster").publish()
-            self.shrink_to_cluster_target_size(init_cluster_size)
-        except Exception as ex:  # noqa: BLE001
-            self.log.error(f"Failed to shrink cluster: {ex}")
-            InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish()
+        except TimeoutError as ex:
+            self.log.error(f"Timeout while growing cluster: {ex}")
+            InfoEvent(f"Grow cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()
+        except RuntimeError as ex:
+            self.log.error(f"Runtime error while growing cluster: {ex}")
+            InfoEvent(f"Grow cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()
+        except Exception as ex:
+            self.log.critical(f"Unexpected error while growing cluster: {ex}", exc_info=True)
+            InfoEvent(f"Grow cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()
+        try:
+            InfoEvent("Start shrink cluster").publish()
+            self.shrink_to_cluster_target_size(init_cluster_size)
+        except TimeoutError as ex:
+            self.log.error(f"Timeout while shrinking cluster: {ex}")
+            InfoEvent(f"Shrink cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()
+        except RuntimeError as ex:
+            self.log.error(f"Runtime error while shrinking cluster: {ex}")
+            InfoEvent(f"Shrink cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()
+        except Exception as ex:
+            self.log.critical(f"Unexpected error while shrinking cluster: {ex}", exc_info=True)
+            InfoEvent(f"Shrink cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()

feature(scale_test): new scale test with empty cluster #11981

Are you sure you want to change the base?

feature(scale_test): new scale test with empty cluster #11981

Uh oh!

Conversation

aleksbykov commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

PR pre-checks (self review)

Reminders

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

pehala left a comment

Choose a reason for hiding this comment

Uh oh!

pehala Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aleksbykov commented Sep 16, 2025 •

edited

Loading