Automatic speech recognition makes great progress

Can humans or machines recognize speech better? In noisy environments such as pubs, modern speech recognition systems (ASR) achieve impressive accuracy - and even outperform humans in some scenarios. But this shows just how remarkable human performance actually is.

Modern speech recognition systems achieve impressive precision in noisy environments. (Symbolic image: Unsplash.com)

In a recent study, UZH computational linguist Eleanor Chodroff and Chloe Patman from Cambridge University investigated how well modern ASR systems cope with challenging listening conditions. The systems tested were "wav2vec 2.0" from Meta and "Whisper large-v3" from OpenAI. The benchmark: the performance of British native speakers.

The tests took place under extreme conditions - from speech-like noise to realistic pub noise, both with and without a cotton face mask. The result: humans performed best overall, but the OpenAI system "Whisper large-v3" outperformed them in almost all scenarios. Only in pub noise was it on a par with human hearing.

Particularly striking was the ability of "Whisper large-v3" to process speech correctly even without contextual support.

The decisive difference

The enormous performance of "Whisper" is based on gigantic amounts of training data. While Meta's "wav2vec 2.0" was trained with 960 hours of speech data, OpenAI drew on over 75 years of speech data for its standard system. The most powerful model even used more than 500 years of speech data. In comparison, humans develop similar skills in just a few years - a remarkable aspect, as study leader Eleanor Chodroff emphasizes. "In addition, automatic speech recognition in almost all other languages remains a major challenge."

Different sources of error

The study also showed that humans and machines fail in different ways. Humans almost always create grammatically correct sentences, but often write sentence fragments. In contrast, "wav2vec 2.0" often generated incomprehensible gibberish under difficult conditions. "Whisper" was able to produce grammatically correct sentences, but filled gaps in the content with completely incorrect information. (pd/swi)


The Study "Speech recognition in adverse conditions by humans and machines" by Chloe Patman and Eleanor Chodroff can be read in detail here.

More articles on the topic