-
Notifications
You must be signed in to change notification settings - Fork 108
feature(scale_test): new scale test with empty cluster #11981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
a4efd61
to
5d8927a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new scale test framework for ScyllaDB that enables testing cluster resizing operations with empty clusters. It provides infrastructure for growing and shrinking clusters while maintaining schema without active workloads.
Key changes:
- Adds
ScaleClusterTest
class with methods for cluster resizing operations - Introduces
idle_duration
configuration parameter for workload-free testing - Creates YAML test configurations for multi-datacenter and single datacenter scenarios
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
scale_cluster_test.py |
New test class implementing cluster grow/shrink operations with schema management |
sdcm/sct_config.py |
Adds idle_duration configuration parameter definition |
test-cases/scale/scale-multi-dc-100-empty-tables-cluster-resize.yaml |
Multi-DC test config with 100 empty tables and cluster resizing |
test-cases/scale/scale-20-200-20-cluster-resize.yaml |
Single DC test config for scaling from 20 to 200 nodes |
jenkins-pipelines/oss/scale/scale-20-200-nodes-cluster.jenkinsfile |
Jenkins pipeline for 20-200 node scaling test |
jenkins-pipelines/oss/scale/scale-120-multidc-cluster-resize.jenkinsfile |
Jenkins pipeline for multi-DC cluster resizing |
docs/configuration_options.md |
Documentation for new idle_duration parameter |
defaults/test_default.yaml |
Default value for idle_duration parameter |
data_dir/templated_100_table.yaml |
Template definition for creating 100 test tables |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
c70fe4a
to
fe74b08
Compare
New scale tests have been developed to reproduce and validate the issues identified below. The current implementation of the LongevityTest does not support extended execution without a workload. To address this limitation, a new ScaleClusterTest has been introduced. This test allows for the execution of tests without workloads in various scenarios: - Initializing a large cluster to a specified target size (e.g., from 10 to 100 nodes). - Scaling down the cluster to a desired size (e.g., from 100 to 10 nodes). - Creating a large number of keyspaces and tables with predefined columns or utilizing the cs-profile-template. - Running tests with Nemesis without any payload, with a duration specified using the new 'idle_duration' parameter. The development of these new tests was aimed at simplifying the complexity associated with the LongevityTest object and ensuring compatibility with future scale testing efforts using Kubernetes (K8s), Docker, and other cloud providers. Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624
fe74b08
to
9dd2a7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
except Exception as ex: # noqa: BLE001 | ||
self.log.error(f"Failed to grow cluster: {ex}") | ||
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish() | ||
|
||
try: | ||
InfoEvent("Start shrink cluster").publish() | ||
self.shrink_to_cluster_target_size(init_cluster_size) | ||
except Exception as ex: # noqa: BLE001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.
except Exception as ex: # noqa: BLE001 | |
self.log.error(f"Failed to grow cluster: {ex}") | |
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish() | |
try: | |
InfoEvent("Start shrink cluster").publish() | |
self.shrink_to_cluster_target_size(init_cluster_size) | |
except Exception as ex: # noqa: BLE001 | |
except (KeyboardInterrupt, SystemExit): | |
raise | |
except Exception as ex: | |
self.log.error(f"Failed to grow cluster: {ex}") | |
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish() | |
try: | |
InfoEvent("Start shrink cluster").publish() | |
self.shrink_to_cluster_target_size(init_cluster_size) | |
except (KeyboardInterrupt, SystemExit): | |
raise | |
except Exception as ex: |
Copilot uses AI. Check for mistakes.
except Exception as ex: # noqa: BLE001 | ||
self.log.error(f"Failed to grow cluster: {ex}") | ||
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish() | ||
|
||
try: | ||
InfoEvent("Start shrink cluster").publish() | ||
self.shrink_to_cluster_target_size(init_cluster_size) | ||
except Exception as ex: # noqa: BLE001 | ||
self.log.error(f"Failed to shrink cluster: {ex}") | ||
InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare Exception catch is overly broad. Consider catching more specific exceptions like cluster operation failures or timeout exceptions to provide better error handling and debugging information.
except Exception as ex: # noqa: BLE001 | |
self.log.error(f"Failed to grow cluster: {ex}") | |
InfoEvent(f"Grow cluster failed with error: {ex}", severity=Severity.ERROR).publish() | |
try: | |
InfoEvent("Start shrink cluster").publish() | |
self.shrink_to_cluster_target_size(init_cluster_size) | |
except Exception as ex: # noqa: BLE001 | |
self.log.error(f"Failed to shrink cluster: {ex}") | |
InfoEvent(f"Shrink cluster failed with error: {ex}", severity=Severity.ERROR).publish() | |
except TimeoutError as ex: | |
self.log.error(f"Timeout while growing cluster: {ex}") | |
InfoEvent(f"Grow cluster failed with timeout: {ex}", severity=Severity.ERROR).publish() | |
except RuntimeError as ex: | |
self.log.error(f"Runtime error while growing cluster: {ex}") | |
InfoEvent(f"Grow cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish() | |
except Exception as ex: | |
self.log.critical(f"Unexpected error while growing cluster: {ex}", exc_info=True) | |
InfoEvent(f"Grow cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish() | |
try: | |
InfoEvent("Start shrink cluster").publish() | |
self.shrink_to_cluster_target_size(init_cluster_size) | |
except TimeoutError as ex: | |
self.log.error(f"Timeout while shrinking cluster: {ex}") | |
InfoEvent(f"Shrink cluster failed with timeout: {ex}", severity=Severity.ERROR).publish() | |
except RuntimeError as ex: | |
self.log.error(f"Runtime error while shrinking cluster: {ex}") | |
InfoEvent(f"Shrink cluster failed with runtime error: {ex}", severity=Severity.ERROR).publish() | |
except Exception as ex: | |
self.log.critical(f"Unexpected error while shrinking cluster: {ex}", exc_info=True) | |
InfoEvent(f"Shrink cluster failed with unexpected error: {ex}", severity=Severity.ERROR).publish() |
Copilot uses AI. Check for mistakes.
InfoEvent(f"Wait {duration} minutes while cluster resizing").publish() | ||
time.sleep(duration * 60) | ||
|
||
self.shrink_to_cluster_target_size(self.params.total_db_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method expects a list[int] parameter but self.params.total_db_nodes is likely not in the correct format. This should be converted to match the expected format, similar to how cluster_target_size is handled in the property method.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing the reason for a new TestCase:
- Existing scale test are using normal longevities
- Creating empty tables, is done via custom profile.
- Grow_shrink can be implemented by nemesis
test_no_workloads_idle_custom_time
is a normal longevity with empty stress
def grow_to_cluster_target_size(self, cluster_target_size: list[int]): | ||
""" Bootstrap node in each dc in each rack while cluster size less than target size""" | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
if self.is_target_reached(current_cluster_size, cluster_target_size): | ||
self.log.debug("Cluster has required size, no need to grow") | ||
return | ||
InfoEvent( | ||
message=f"Starting to grow cluster from {current_cluster_size} to {cluster_target_size}").publish() | ||
|
||
add_node_cnt = self.params.get('add_node_cnt') | ||
try: | ||
while not self.is_target_reached(current_cluster_size, cluster_target_size): | ||
for dcx, target in enumerate(cluster_target_size): | ||
if current_cluster_size[dcx] >= target: | ||
continue | ||
add_nodes_num = add_node_cnt if ( | ||
target - current_cluster_size[dcx]) >= add_node_cnt else target - current_cluster_size[dcx] | ||
|
||
for rack in range(self.db_cluster.racks_count): | ||
added_nodes = [] | ||
InfoEvent( | ||
message=f"Adding next number of nodes {add_nodes_num} to dc_idx {dcx} and rack {rack}").publish() | ||
added_nodes.extend(self.db_cluster.add_nodes( | ||
count=add_nodes_num, enable_auto_bootstrap=True, dc_idx=dcx, rack=rack)) | ||
self.monitors.reconfigure_scylla_monitoring() | ||
up_timeout = MAX_TIME_WAIT_FOR_NEW_NODE_UP | ||
with adaptive_timeout(Operations.NEW_NODE, node=self.db_cluster.data_nodes[0], timeout=up_timeout): | ||
self.db_cluster.wait_for_init( | ||
node_list=added_nodes, timeout=up_timeout, check_node_health=False) | ||
self.db_cluster.wait_for_nodes_up_and_normal(nodes=added_nodes) | ||
InfoEvent(f"New nodes up and normal {[node.name for node in added_nodes]}").publish() | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
finally: | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
InfoEvent(message=f"Grow cluster finished, cluster size is {current_cluster_size}").publish() | ||
|
||
def shrink_to_cluster_target_size(self, cluster_target_size: list[int]): | ||
"""Decommission node in each dc in each rack while cluster size more than target size""" | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
if self.is_target_reached(cluster_target_size, current_cluster_size): | ||
self.log.debug("Cluster has required size, no need to shrink") | ||
return | ||
InfoEvent( | ||
message=f"Starting to shrink cluster from {current_cluster_size} to {cluster_target_size}").publish() | ||
try: | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
while not self.is_target_reached(cluster_target_size, current_cluster_size): | ||
for dcx, _ in enumerate(current_cluster_size): | ||
nodes_by_racks = self.db_cluster.get_nodes_per_datacenter_and_rack_idx(nodes_by_dcx[dcx]) | ||
for nodes in nodes_by_racks.values(): | ||
decommissioning_node = nodes[-1] | ||
decommissioning_node.running_nemesis = "Decommissioning node" | ||
self.db_cluster.decommission(node=decommissioning_node, timeout=7200) | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
finally: | ||
nodes_by_dcx = group_nodes_by_dc_idx(self.db_cluster.data_nodes) | ||
current_cluster_size = [len(nodes_by_dcx[dcx]) for dcx in sorted(nodes_by_dcx)] | ||
InfoEvent( | ||
message=f"Reached cluster size {current_cluster_size}").publish() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot this be achieved by grow_shrink
nemesis?
New scale tests have been developed to reproduce and
validate the issues identified below.
The current implementation of the LongevityTest does not support
extended execution without a workload.
To address this limitation, a new ScaleClusterTest has been introduced.
This test allows for the execution of tests without workloads in various scenarios:
utilizing the cs-profile-template.
with a duration specified using the new 'idle_duration' parameter.
The development of these new tests was aimed at simplifying the complexity
associated with the LongevityTest object and
ensuring compatibility with future scale testing efforts
using Kubernetes (K8s), Docker, and other cloud providers.
Refs: scylladb/scylladb#24790, scylladb/scylla-enterprise#5626, scylladb/scylla-enterprise#5624
Testing
PR pre-checks (self review)
backport
labelsReminders
sdcm/sct_config.py
)unit-test/
folder)