Skip to content

Conversation

@alekswn
Copy link
Contributor

@alekswn alekswn commented Oct 15, 2025

Replace global timespec variables with a flexible ft_timer union that supports pause/resume functionality. This allows for more accurate timing measurements in tests by excluding client-server synchronization time.

In particular, this change fixes issue with fi_rma_bw -o writedata -p efa -f efa-direct.
Currently bandwidth numbers produced by that benchmark are wrong. In case of writedata client sends a stream of write messages to server and server replies with a send after each window. The purpose of the benchmark is to measure bandwidth of writedata operation, not the synchronization time. However in some cases synchronization time can take up to 30% of overall timing, therefore bandwidth calculation is off by the same 30%.

Key changes:

  • Add ft_timer union with pause/resume capabilities
  • Replace timespec usage across all test files with new timer API
  • Update timing functions in benchmark and test utilities
  • bandwidth_rma -o writedata excludes reverse sync time from total timing
  • bandwidth_rma starts timer only after all completions from warm-up iterations had been received
  • Enable payload validation for warm-up iterations
  • Increase default warmup iterations from 10 to 64 to mach default window size
  • Fix a bug in uni_bandwidth due to dafaulting to a global var targs
  • Fixed a bug with timing in efa_gda fabtest

This improves timing precision in fabtests by allowing tests to exclude non-measurement periods from their timing calculations.

Edit: Crossed out changes caused increase in BW wall clock timings causing CI timeouts. Removing for now

@alekswn alekswn force-pushed the pausable-timer branch 2 times, most recently from fadd60e to 892eff7 Compare October 16, 2025 23:13
Replace global timespec variables with a flexible ft_timer union that
supports pause/resume functionality. This allows for more accurate
timing measurements in tests by excluding syncronization time.

Key changes:
- Add ft_timer union with pause/resume capabilities
- Replace timespec usage across all test files with new timer API
- Update timing functions in benchmark and test utilities
- `bandwidth_rma -o writedata` exclude reverse sync time from timing
- Fix a bug in `uni_bandwidth` due to dafaulting to a global var `targs`
- Fixed a bug with timing in `efa_gda` fabtest

This improves timing precision in fabtests by allowing tests to exclude
non-measurement periods from their timing calculations.

Signed-off-by: Alexey Novikov <[email protected]>
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun changes! Thank you!

Comment on lines +532 to +539
static inline void ft_timer_pause(union ft_timer *timer)
{
timer->elapsed = ft_gettime_ns() - timer->start;
}
static inline void ft_timer_resume(union ft_timer *timer)
{
timer->start = ft_gettime_ns() - timer->elapsed;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this to just a start and stop where stop is
elapsed = gettime - start

and start is
start = gettime - elapsed

and elapsed is initialized to 0?

or else you could consider a model where
start is
start = gettime

and stop is
elapsed += gettime - start

either way, I don't think you need the pause/resume to be separate. I would rather keep it as simple as possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can have just two state transition functions (start and stop). Do we want some initialization function to set elapsed to zero or should let caller to just zero initialize the timer ?
The later is not not going to work if we returned to global variable model, so the caller would have to reset the global timer before calling start.

Would not it be more convenient to reset timer in start() and still have resume() function which would just update elapsed?

In that case callers which do not need pause functionality can just call start() and stop() as before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with start/stop/resume with an init in start

}
static inline uint64_t ft_timer_get_elapsed(union ft_timer timer, enum precision p)
{
// assiming timer is stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

}
static inline bool ft_timer_is_elapsed(union ft_timer timer, uint64_t timeout, enum precision p)
{
// assuming timer is running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, also couldn't hurt to add some asserts in the timer instead of having to comment this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially implemented ft_timer as struct and added asserts here, but converted it to a union then.
There is no way to add asserts having it as a union. Probably that was a premature optimization and a bad idea.
I'll revert it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I missed the union! A fun optimization! But I think I would prefer the struct, especially if that adds the option of an assert to catch misuses of it. I wouldn't want performance numbers to get reported incorrectly if someone forgot to stop/resume in the correct order or something.

}

static inline void ft_start(void)
static inline void ft_timer_start(union ft_timer *timer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason you changed this to pass in a timer instead of using a global? I'm not necessarily against it if you have a reason you want to update it like if you have a test you want to add that has multiple timers. It's just not necessary right now and makes for a lot of code changes without a clear reason.

Copy link
Contributor Author

@alekswn alekswn Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for that is to make it useful for cases like timeout in your previous comment. There are few places like that in fabtests.
Also, I have noticed global vars s and e are shadowed at lots of places across fabtests code. show_perf() function does function takes s and e as arguments and does not use globals. So, in my opinion, keeping it global might lead to tricky bugs in the future.
I'm fixing a similar shadowing bug with targs variable in the present PR rdm_bw_mt.c:444.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. Thanks!

Comment on lines +473 to +478
if (timer)
ft_timer_pause(timer);
ret = ft_rx(ep, FT_RMA_SYNC_MSG_BYTES);
if (timer)
ft_timer_resume(timer);
return ret;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this give you false performance numbers? Depending on your completion level, the send may have not been completely sent so you could be reporting higher bandwidth. Same on the receive side for the RMA ops. The RMA ops aren't going to generate completions so if you stop the timer right after get_rx_comp you're not actually waiting for all the messages to come in. Unfortunately, for RMA ops you kind of have to wait for the sync message to ensure you're stopping the timer when all the data has arrived.

Copy link
Contributor Author

@alekswn alekswn Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. The rationale here was to exclude server -> client send latency from time calculation. For writedata benchmark we observed synchronization overhead of the same order of magnitude as overall timing due to connection setup delay for EFA.

Connection setup delay is only seen for a first few synchronizations, but we see 30% error in bandwidth measurements attributed to connection setup latency. The delay is negligible after first 5 synchronizations.

The reason the synchronization delay affects writedata test so much is asymmetry of the test. Client makes a lot of writes in the loop, but server replies with just a single send at the end of each window. We are warming up client->server path with the warm-up iterations, but do not warm up server->client connection.

If we want keep synchronization timing included, I think an alternative approach would be to warm up server->client path as well. Does it sound like a better approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, thanks for the context. I think it's ok to warmup both sides. In my opinion, it's a little weird that the writedata function has such a different model than the other functions. I understand why (because of the rx completion) but I think it would be ok to have all ops behave in the same pattern. Can we change writedata to be two sided? Would that help as well? Or are you explicitly looking to have it remain a one-sided test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion!
Will try to make it symmetric.

@alekswn
Copy link
Contributor Author

alekswn commented Oct 21, 2025

Fun changes! Thank you!

Thanks for reviewing this PR and detailed feedback, Alexia!

If we don't want exclude synchronization time from bandwidth_rma timing introducing a new timer is not making much sense.
This PR also fixes a couple of timer-related bugs in efa_gda.c and multy_ep_mt.c. I think I'll cut separate PRs with fixes using the old timer.

@alekswn alekswn marked this pull request as draft October 22, 2025 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants