Skip to content

Conversation

aleksbykov
Copy link
Contributor

@aleksbykov aleksbykov commented Sep 16, 2025

New scale tests have been developed to reproduce and
validate the issues identified below.

The current implementation of the LongevityTest does not support
extended execution without a workload.
To address this limitation, a new ScaleClusterTest has been introduced.

This test allows for the execution of tests without workloads in various scenarios:

  • Initializing a large cluster to a specified target size (e.g., from 10 to 100 nodes).
  • Scaling down the cluster to a desired size (e.g., from 100 to 10 nodes).
  • Creating a large number of keyspaces and tables with predefined columns or
    utilizing the cs-profile-template.
  • Running tests with Nemesis without any payload,
    with a duration specified using the new 'idle_duration' parameter.

The development of these new tests was aimed at simplifying the complexity
associated with the LongevityTest object and
ensuring compatibility with future scale testing efforts
using Kubernetes (K8s), Docker, and other cloud providers.

Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624

Testing

  • Jobs are running

PR pre-checks (self review)

  • I added the relevant backport labels
  • I didn't leave commented-out/debugging code

Reminders

  • Add New configuration option and document them (in sdcm/sct_config.py)
  • Add unit tests to cover my changes (under unit-test/ folder)
  • Update the Readme/doc folder relevant to this change (if needed)

@aleksbykov aleksbykov force-pushed the scale-cluster-tests branch 2 times, most recently from a4efd61 to 5d8927a Compare September 17, 2025 05:56
@aleksbykov aleksbykov requested a review from Copilot September 17, 2025 08:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new scale test framework for ScyllaDB that enables testing cluster resizing operations with empty clusters. It provides infrastructure for growing and shrinking clusters while maintaining schema without active workloads.

Key changes:

  • Adds ScaleClusterTest class with methods for cluster resizing operations
  • Introduces idle_duration configuration parameter for workload-free testing
  • Creates YAML test configurations for multi-datacenter and single datacenter scenarios

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scale_cluster_test.py New test class implementing cluster grow/shrink operations with schema management
sdcm/sct_config.py Adds idle_duration configuration parameter definition
test-cases/scale/scale-multi-dc-100-empty-tables-cluster-resize.yaml Multi-DC test config with 100 empty tables and cluster resizing
test-cases/scale/scale-20-200-20-cluster-resize.yaml Single DC test config for scaling from 20 to 200 nodes
jenkins-pipelines/oss/scale/scale-20-200-nodes-cluster.jenkinsfile Jenkins pipeline for 20-200 node scaling test
jenkins-pipelines/oss/scale/scale-120-multidc-cluster-resize.jenkinsfile Jenkins pipeline for multi-DC cluster resizing
docs/configuration_options.md Documentation for new idle_duration parameter
defaults/test_default.yaml Default value for idle_duration parameter
data_dir/templated_100_table.yaml Template definition for creating 100 test tables

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@aleksbykov aleksbykov force-pushed the scale-cluster-tests branch 7 times, most recently from c70fe4a to fe74b08 Compare September 22, 2025 05:11
New scale tests have been developed to reproduce and
validate the issues identified below.
The current implementation of the LongevityTest does not support
extended execution without a workload.
To address this limitation, a new ScaleClusterTest has been introduced.
This test allows for the execution of tests without workloads in various scenarios:
 - Initializing a large cluster to a specified target size (e.g., from 10 to 100 nodes).
 - Scaling down the cluster to a desired size (e.g., from 100 to 10 nodes).
 - Creating a large number of keyspaces and tables with predefined columns or
   utilizing the cs-profile-template.
 - Running tests with Nemesis without any payload,
   with a duration specified using the new 'idle_duration' parameter.

The development of these new tests was aimed at simplifying the complexity
associated with the LongevityTest object and
ensuring compatibility with future scale testing efforts
using Kubernetes (K8s), Docker, and other cloud providers.

Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624
@aleksbykov aleksbykov marked this pull request as ready for review September 22, 2025 05:17
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +144 to +151
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to grow cluster: {ex}")
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()

try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except Exception as ex: # noqa: BLE001
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.

Suggested change
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to grow cluster: {ex}")
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except Exception as ex: # noqa: BLE001
except (KeyboardInterrupt, SystemExit):
raise
except Exception as ex:
self.log.error(f"Failed to grow cluster: {ex}")
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except (KeyboardInterrupt, SystemExit):
raise
except Exception as ex:

Copilot uses AI. Check for mistakes.

Comment on lines +144 to +153
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to grow cluster: {ex}")
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()

try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to shrink cluster: {ex}")
InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish()
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.

Suggested change
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to grow cluster: {ex}")
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish()
try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except Exception as ex: # noqa: BLE001
self.log.error(f"Failed to shrink cluster: {ex}")
InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish()
except TimeoutError as ex:
self.log.error(f"Timeout while growing cluster: {ex}")
InfoEvent(f"Grow cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()
except RuntimeError as ex:
self.log.error(f"Runtime error while growing cluster: {ex}")
InfoEvent(f"Grow cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()
except Exception as ex:
self.log.critical(f"Unexpected error while growing cluster: {ex}", exc_info=True)
InfoEvent(f"Grow cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()
try:
InfoEvent("Start shrink cluster").publish()
self.shrink_to_cluster_target_size(init_cluster_size)
except TimeoutError as ex:
self.log.error(f"Timeout while shrinking cluster: {ex}")
InfoEvent(f"Shrink cluster failed with timeout: {ex}", severity=Severity.ERROR).publish()
except RuntimeError as ex:
self.log.error(f"Runtime error while shrinking cluster: {ex}")
InfoEvent(f"Shrink cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish()
except Exception as ex:
self.log.critical(f"Unexpected error while shrinking cluster: {ex}", exc_info=True)
InfoEvent(f"Shrink cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish()

Copilot uses AI. Check for mistakes.

InfoEvent(f"Wait {duration} minutes while cluster resizing").publish()
time.sleep(duration * 60)

self.shrink_to_cluster_target_size(self.params.total_db_nodes)
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method expects a list[int] parameter but self.params.total_db_nodes is likely not in the correct format. This should be converted to match the expected format, similar to how cluster_target_size is handled in the property method.

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@pehala pehala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing the reason for a new TestCase:

  • Existing scale test are using normal longevities
  • Creating empty tables, is done via custom profile.
  • Grow_shrink can be implemented by nemesis
  • test_no_workloads_idle_custom_time is a normal longevity with empty stress

Comment on lines +42 to +105
def grow_to_cluster_target_size(self, cluster_target_size: list[int]):
""" Bootstrap node in each dc in each rack while cluster size less than target size"""
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
if self.is_target_reached(current_cluster_size, cluster_target_size):
self.log.debug("Cluster has required size, no need to grow")
return
InfoEvent(
message=f"Starting to grow cluster from {current_cluster_size} to {cluster_target_size}").publish()

add_node_cnt = self.params.get('add_node_cnt')
try:
while not self.is_target_reached(current_cluster_size, cluster_target_size):
for dcx, target in enumerate(cluster_target_size):
if current_cluster_size[dcx] >= target:
continue
add_nodes_num = add_node_cnt if (
target - current_cluster_size[dcx]) >= add_node_cnt else target - current_cluster_size[dcx]

for rack in range(self.db_cluster.racks_count):
added_nodes = []
InfoEvent(
message=f"Adding next number of nodes {add_nodes_num} to dc_idx {dcx} and rack {rack}").publish()
added_nodes.extend(self.db_cluster.add_nodes(
count=add_nodes_num, enable_auto_bootstrap=True, dc_idx=dcx, rack=rack))
self.monitors.reconfigure_scylla_monitoring()
up_timeout = MAX_TIME_WAIT_FOR_NEW_NODE_UP
with adaptive_timeout(Operations.NEW_NODE, node=self.db_cluster.data_nodes[0], timeout=up_timeout):
self.db_cluster.wait_for_init(
node_list=added_nodes, timeout=up_timeout, check_node_health=False)
self.db_cluster.wait_for_nodes_up_and_normal(nodes=added_nodes)
InfoEvent(f"New nodes up and normal {[node.name for node in added_nodes]}").publish()
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
finally:
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
InfoEvent(message=f"Grow cluster finished, cluster size is {current_cluster_size}").publish()

def shrink_to_cluster_target_size(self, cluster_target_size: list[int]):
"""Decommission node in each dc in each rack while cluster size more than target size"""
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
if self.is_target_reached(cluster_target_size, current_cluster_size):
self.log.debug("Cluster has required size, no need to shrink")
return
InfoEvent(
message=f"Starting to shrink cluster from {current_cluster_size} to {cluster_target_size}").publish()
try:
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
while not self.is_target_reached(cluster_target_size, current_cluster_size):
for dcx, _ in enumerate(current_cluster_size):
nodes_by_racks = self.db_cluster.get_nodes_per_datacenter_and_rack_idx(nodes_by_dcx[dcx])
for nodes in nodes_by_racks.values():
decommissioning_node = nodes[-1]
decommissioning_node.running_nemesis = "Decommissioning node"
self.db_cluster.decommission(node=decommissioning_node, timeout=7200)
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
finally:
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes)
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)]
InfoEvent(
message=f"Reached cluster size {current_cluster_size}").publish()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot this be achieved by grow_shrink nemesis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants