Skip to content

Survey on correctness and performance score calculation and aggregation #97

@jiannanWang

Description

@jiannanWang

I'm reviewing whether our current methods for calculating and aggregating correctness and performance scores make sense. To help with this, I am comparing BackendBench with KernelBench. See the table below for an easy comparison:

BackendBench KernelBench
Correctness score per op Numeric score (ratio of passed test) Binary score (whether all test passed)
Correctness score aggregation Mean Mean
Number of performance tests per op Many 1
Number of runs per performance test Many Many
Performance score per op Geometric mean of speedup (incorrect = 1) Amortized speedup (multiple runs)
Performance score aggregation Geometric mean Geometric mean (correct tests only)
Number of tests per op Varies (e.g. opinfo) Fixed

Based on this comparison, below are some questions I have for analysis:

  1. Which correctness scoring method is better? Should we use a simple correct/incorrect result or the ratio of passed tests?
  2. For performance, does running the same test multiple times give us more accurate speedup results?
  3. How should we treat incorrect tests and how does it affect performance scores?
  4. Since the number of tests per op varies, does this give some ops more weight in the final score?

Edit 1: BackendBench do multiple runs to measure performance as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions