-
Notifications
You must be signed in to change notification settings - Fork 19
Added network benchmarking feature #311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, the variance in individual node results is shocking. It seems like this would definitely impact our various performance tests.
| open BenchmarkDaemonSet | ||
| open ApiRateLimit | ||
| open System |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more unused imports: BenchmarkDaemonSet, ApiRateLimit, and System
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is from an earlier commit, as they're all used now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkDaemonSet and ApiRateLimit are still showing up as unused because you're using them fully qualified in this file (for example, BenchmarkDaemonSet.createTcpTuningDaemonSet). With the import, you can just call createTcpTuningDaemonSet directly. If all usages are fully qualified, you don't need the import. Style-wise, I think either is fine, but doing both is unnecessary.
Ya, looking into this a bit more I'm fairly confident it has to do with the default TCP buffer sizes being too small given the way in which we simulate latency. That being said, it looks like a lot of other blockchains recommended turning kernel network settings way up, which might be something we want to do as well. Currently working on a test to see if I can verify this behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a network infrastructure benchmarking feature that mirrors the stellar-core network topology to measure network performance characteristics. The benchmark uses iperf3 to simulate P2P traffic patterns and provides detailed performance metrics before or instead of running stellar-core tests.
- Bidirectional performance testing using iperf3 with coordinated traffic generation
- Infrastructure topology mirroring with identical peer connections and network delays
- Comprehensive results parsing and reporting with per-node and aggregate metrics
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/scripts/start-benchmark-client.sh | Client script for coordinated iperf3 bidirectional testing |
| src/scripts/parse_benchmark_results.py | Python parser for aggregating and formatting benchmark results |
| src/FSLibrary/StellarSupercluster.fs | Main entry point integration for benchmark execution |
| src/FSLibrary/StellarStatefulSets.fs | Core benchmark orchestration and result collection |
| src/FSLibrary/StellarMissionContext.fs | Mission context configuration fields |
| src/FSLibrary/StellarDestination.fs | File path utility enhancement |
| src/FSLibrary/StellarCoreCfg.fs | Stellar core configuration update |
| src/FSLibrary/FSLibrary.fsproj | Project file dependency addition |
| src/FSLibrary/BenchmarkDaemonSet.fs | Kubernetes resource creation for benchmark pods |
| src/FSLibrary.Tests/Tests.fs | Test configuration updates |
| src/App/Program.fs | Command line argument parsing for benchmark flags |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| imbalance_percent = abs(total_network_send - total_network_recv) / total_network_send * 100 | ||
| if imbalance_percent > 5.0: # Alert if more than 5% imbalance |
Copilot
AI
Sep 2, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Division by zero vulnerability when total_network_send is 0. This check should be moved inside the existing if total_network_send > 0: condition on line 299.
|
I've identified the issue with the bandwidth and added a fix in the last comment. When installing latency, the default TCP buffer configs were very small (as the container fills the buffers, but then artificially waits before servicing it to simulate latency). This was destroying the node bandwidth. I've added a flag vs. with the flag: This was definitely a significant bottleneck. To run performance tests, add the following flags to your command: I'm not sure if this is the best way to do this from an implementation perspecitve, and it's kinda hacky. I couldn't find a way to change these kernel settings from the container, but did it on the node itself. This means that setting changes persist run-to-run. To account for this, if the TCP optimization flag is not set, I still "upgrade" the TCP settings, but back to the non-optimal Linux defaults. This is purely for A-B testing. I think longer term this shouldn't be a flag, but should just automatically upgrade TCP settings unconditionally to the better variant. However, given that we have lots of perf testing in flight right now, I think we still need the A-B test. |
d9b5f42 to
9fb1d78
Compare
46f63f7 to
7e098f8
Compare
7e098f8 to
69e4932
Compare
|
I've addressed all the comments and rebased, so this should be ready for review. I've finally got the network stable. There were two required changes. First, we needed to modify the TCP settings of the workers themselves to provide suitable bandwidth when artificial delay was enabled. Secondly, we needed to add fq for peer fairness. Without fq, given the way we install latency, whichever node has the lowest latency dominates our bandwidth. With fq, peers are more fairly served regardless of their latency. To install the cluster level TCP improvements, run with Because the TCP fix is cluster wide and persistent, there's really no good way of A-B testing older images without interfering with other runs on the cluster. I did some benchmarking against the 23.0.1 release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is mostly looking good. It's a little hard to review the correctness of the new options, but they seem properly contained behind flags and nothing jumped out to me as incorrect.
Feel free to ignore the stylistic nitpicks if you disagree with them, but I do think the comments around the TCP tuning flow are worth addressing.
| V1DaemonSet( | ||
| metadata = V1ObjectMeta(name = name, labels = labels, namespaceProperty = nCfg.NamespaceProperty), | ||
| spec = | ||
| V1DaemonSetSpec( | ||
| selector = V1LabelSelector(matchLabels = dict [ ("app", name) ]), | ||
| template = | ||
| V1PodTemplateSpec( | ||
| metadata = V1ObjectMeta(labels = dict [ ("app", name) ]), | ||
| spec = | ||
| V1PodSpec( | ||
| hostNetwork = System.Nullable<bool>(true), | ||
| hostPID = System.Nullable<bool>(true), | ||
| containers = | ||
| [| V1Container( | ||
| name = name, | ||
| image = "busybox:latest", | ||
| command = [| "/bin/sh" |], | ||
| args = [| sprintf "/scripts/%s" scriptName; "--daemon" |], | ||
| volumeMounts = | ||
| [| V1VolumeMount( | ||
| name = "tcp-tuning-script", | ||
| mountPath = "/scripts", | ||
| readOnlyProperty = System.Nullable<bool>(true) | ||
| ) |], | ||
| securityContext = | ||
| V1SecurityContext( | ||
| privileged = System.Nullable<bool>(true), | ||
| capabilities = V1Capabilities(add = [| "SYS_ADMIN"; "NET_ADMIN" |]) | ||
| ), | ||
| resources = | ||
| V1ResourceRequirements( | ||
| requests = | ||
| dict [ ("memory", ResourceQuantity("50Mi")) | ||
| ("cpu", ResourceQuantity("10m")) ], | ||
| limits = | ||
| dict [ ("memory", ResourceQuantity("100Mi")) | ||
| ("cpu", ResourceQuantity("50m")) ] | ||
| ) | ||
| ) |], | ||
| volumes = | ||
| [| V1Volume( | ||
| name = "tcp-tuning-script", | ||
| configMap = | ||
| V1ConfigMapVolumeSource( | ||
| name = "tcp-tuning-script", | ||
| defaultMode = System.Nullable<int>(0o755) | ||
| ) | ||
| ) |], | ||
| tolerations = [| V1Toleration(operatorProperty = "Exists") |] | ||
| ) | ||
| ) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: The nesting gets pretty deep here. This might be more readable by building up some of these inner records as local variables
| open BenchmarkDaemonSet | ||
| open ApiRateLimit | ||
| open System |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BenchmarkDaemonSet and ApiRateLimit are still showing up as unused because you're using them fully qualified in this file (for example, BenchmarkDaemonSet.createTcpTuningDaemonSet). With the import, you can just call createTcpTuningDaemonSet directly. If all usages are fully qualified, you don't need the import. Style-wise, I think either is fine, but doing both is unnecessary.
| "Failed to retrieve results from pod %s (kubectl exec failed): %s" | ||
| pod.Metadata.Name | ||
| stderr | ||
| else if String.IsNullOrWhiteSpace(output) then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not worth fixing throughout this PR, but as a general style nit for the future: The parentheses after this function call are unnecessary. This can be String.IsNullOrWhiteSpace output, which is the idiomatic form for ML-like languages. Parentheses are only needed around more complex expressions, not variable lookups or literals.
| # Keep running for a bit to ensure settings propagate | ||
| sleep 10 | ||
| echo "TCP settings verification complete on node $(hostname)" | ||
| sleep infinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this pseudo-daemon mode? Does supercluster just not like it if a pod exits on its own?
| namespaceContent = NamespaceContent(self, nCfg.missionContext.apiRateLimit, nCfg.NamespaceProperty) | ||
| ) | ||
| // services, ingresses or anything. Optionally sets up TCP tuning daemonsets. | ||
| member self.MakeEmptyFormation(nCfg: NetworkCfg, ?skipTcpConfig: bool) : StellarFormation = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the skipTcpConfig argument? Can we use nCfg.missionContext.enableTcpTuning instead?
| member self.DeployTcpTuningDaemonSet() : unit = | ||
| if self.NetworkCfg.missionContext.enableTcpTuning then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this function should either be renamed to something like MaybeDeployTcpTuningDaemonSet, or the conditional on enableTcpTuning should be removed. As it is, the call sites for DeployTcpTuningDaemonSet look a little strange.
Overview
This PR introduces a network infrastructure benchmark that measures network performance of supercluster without actually running stellar-core. It installs an identical network topology, with the same peer to peer overlay, ingress, DNS, and artificial delay, but instead of running stellar-core containers, it runs iperf3 containers for a network stress test. All nodes flood as much bandwidth as possible to all their peers at the same time bidirectionally to simulate p2p networking. We then report bandwidth and latency as reported by iperf3. This test can be run standalone, or immediately before any ssc mission. We might want to add this to our MaxTPS flow for example to have a better idea of cluster variance run to run.
Usage
--benchmark-infra trueflag will enable the network test prior to the superclsuter mission. if--benchmark-onlyis set, the test will exit after running just the network load test.--benchmark-duration-secondsdetermines how long to generate load for the stress tests and defaults to 30 seconds.Example output:
Design
Bidirectional Performance Testing via iper3
--bidirflag to run simultaneous bidirectional tests between node pairsTopology Mirroring
Pod Structure
Each benchmark pod contains three containers:
Server Container
5201 + source_node_index, where each node has a unique indexClient Container
Network Delay Container (optional)
tcrulesSince there is some overhead in running a server vs. a client, we use a hash of
each pod name in a given connection to randomly decide who will be the client vs.
server for that particular connection.
Results Parsing
parse_benchmark_results.py) processes raw iperf3 JSON outputObservations from EKS cluster:
Here are a few samples I collected from a few different topologies, all with 5 minute runs each:
Small, 7 node topology: theoretical-max-tps.json
Default, 22 node topology:
Larger, 100 node topology generated-overlay-topology-2.json:
It seems like the total bandwidth of each node is pretty variable, and lower than I expected. There we several runs I saw where even the small, 7 node max TPS topology has nodes with < 100 Mbps total bandwidth, which seems like this would negatively affect MaxTPS test. I think we need to do a little more digging to figure out what's causing this variance. This might be related to the way we install network latency, as bandwidth is much higher when we don't install manual latency. This might also be an artifact of iperf3 in latency environments too, I'm not sure. I think it probably has to do with tcp socket buffer sizes or something related. Here's the small 7 node topology with no latency: