I am trying to reproduce the inference results of this paper
(https://arxiv.org/pdf/2507.13977)
I have followed the same data processing steps provided in datasets_configs/arabic/*.
After running inference with the hugging-face model provided by the paper,
I got a WER of 32.3% for masc/test and 39.3% for masc_noisy/test respectively.
However, the paper reported 11.63% for the combined masc test set, which is a huge gap.
For other test sets, still I did not get identical WERs but they are much closer to what paper has reported.
Am I missing any crucial step to reproduce the masc WER results mentioned in the paper?
Thanks.