Reproducibility of WER results in "Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic"

I am trying to reproduce the inference results of this paper
(https://arxiv.org/pdf/2507.13977)

I have followed the same data processing steps provided in `datasets_configs/arabic/*`. 

After running inference with the hugging-face model provided by the paper,
I got a WER of 32.3% for `masc/test` and 39.3% for `masc_noisy/test` respectively.

However, the paper reported 11.63% for the combined masc test set, which is a huge gap.
For other test sets, still I did not get identical WERs but they are much closer to what paper has reported.

**Am I missing any crucial step to reproduce the masc WER results mentioned in the paper?**
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility of WER results in "Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic" #158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducibility of WER results in "Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic" #158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions