Author here. We built this because we kept seeing different word error rates (WER) for the same models depending on who was testing and how.
Normalization rules ended up being a big reason why this was happening, so we decided to release a fully reproducible evaluation framework. You can test it yourself with our full repo.
It includes: Normalization rules we use; Scoring scripts; Dataset coverage (conversational, noisy, multilingual); Full eval pipeline
We also published a detailed comparison using this framework across 8 leading STT providers, 7 datasets, and 74 hours of audio. You can see it here: https://www.gladia.io/competitors/benchmarks
Author here. We built this because we kept seeing different word error rates (WER) for the same models depending on who was testing and how.
Normalization rules ended up being a big reason why this was happening, so we decided to release a fully reproducible evaluation framework. You can test it yourself with our full repo.
It includes: Normalization rules we use; Scoring scripts; Dataset coverage (conversational, noisy, multilingual); Full eval pipeline
We also published a detailed comparison using this framework across 8 leading STT providers, 7 datasets, and 74 hours of audio. You can see it here: https://www.gladia.io/competitors/benchmarks
Feedback welcomed!