-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
I'm reviewing whether our current methods for calculating and aggregating correctness and performance scores make sense. To help with this, I am comparing BackendBench with KernelBench. See the table below for an easy comparison:
BackendBench | KernelBench | |
---|---|---|
Correctness score per op | Numeric score (ratio of passed test) | Binary score (whether all test passed) |
Correctness score aggregation | Mean | Mean |
Number of performance tests per op | Many | 1 |
Number of runs per performance test | Many | Many |
Performance score per op | Geometric mean of speedup (incorrect = 1) | Amortized speedup (multiple runs) |
Performance score aggregation | Geometric mean | Geometric mean (correct tests only) |
Number of tests per op | Varies (e.g. opinfo) | Fixed |
Based on this comparison, below are some questions I have for analysis:
- Which correctness scoring method is better? Should we use a simple correct/incorrect result or the ratio of passed tests?
For performance, does running the same test multiple times give us more accurate speedup results?- How should we treat incorrect tests and how does it affect performance scores?
- Since the number of tests per op varies, does this give some ops more weight in the final score?
Edit 1: BackendBench do multiple runs to measure performance as well.
msaroufim and PaliC
Metadata
Metadata
Assignees
Labels
No labels