If we refactor the current metrics logic to compute pass@1 metrics for each sampled file in isolation and keep track of the full array, we can have a proper std estimation that would work for arbitrary metric class. The current one added in #757 only has partial coverage as it's not possible to have a clean access to all pass@1 metrics in the current structure