Far-Field Multi-Speaker Audio Is the Hardest ASR Problem

The largest source of variance in speech-to-text performance comes from audio quality variation. Specifically: far-field, multi-speaker environments remain difficult even with fine-tuned models.

Concrete numbers from market research transcription:

EnvironmentCustom Model WERPay Service WER
Webcam (near-field, single speaker)13.2%21.0%
Far-field multi-speaker47.2%62.5%

The custom model still wins, but performance drops dramatically. The authors note that “far-field multi-speaker performance is not suitable for productization.”

This has practical implications: if your use case involves conference rooms, focus groups, or any setting where speakers are far from microphones, expect transcription quality to suffer. Hardware investments (better microphones, speaker isolation) may matter more than model improvements.

Related: [None yet]