/smstreet/media/media_files/2026/02/16/shobhit-banga-co-founder-of-josh-talks-2026-02-16-10-14-30.jpg)
A new national benchmark for speech recognition in India, ‘Voice of India’, has found a critical performance crisis for global AI models in the Indian market. While voice becomes the primary digital interface for millions in India, the benchmark reveals that leading global systems, including OpenAI and Microsoft struggle to accurately recognize how Indians actually speak, raising concerns about the readiness of voice-based AI models for one of the world’s largest and fastest-growing voice-first markets.
Developed by Josh Talks in collaboration with AI4Bharat at IIT Madras, Voice of India establishes the national standard for evaluating Automatic Speech Recognition (ASR) systems in India, delivering the most comprehensive and methodologically rigorous evaluation framework designed specifically for Indian languages and real-world deployment conditions. Evaluating 15 languages and ~35,000 speakers, the results show that global "multilingual AI" claims often fall apart when tested against Indian accents, regional dialects, and code-switched speech.
Key Findings from the 'Voice of India' Report:
1. Sarvam Dominance in Indian Languages: Sarvam's models (Sarvam Audio) consistently ranks #1 or #2 across almost every language and dialect tested, including major languages like Hindi and Bengali as well as regional ones like Odia and Assamese.
2. The "OpenAI Gap": There is a massive performance disparity for OpenAI models in Indian language transcription. While Google Gemini remains competitive with Sarvam, OpenAI's GPT-4o models trail by over 50 percentage points in accuracy compared to Sarvam in the overall average.
3. Dravidian vs. Indo-Aryan Performance: All models, including Sarvam, perform significantly better in Indo-Aryan languages (Hindi/Bengali at ~5-6% WER) compared to Dravidian languages (Tamil/Telugu/Malayalam/
4. Dialect Difficulty: Global speech systems often treat “Hindi” as a single, standardized language. In reality, Hindi encompasses major dialects such as Bhojpuri and Chhattisgarhi — each spoken by tens of millions of people. Bhojpuri alone has over 50 million speakers, a population larger than most European countries. Yet these dialects remain among the most challenging for AI systems. Even the best models see a sharp decline in performance, with error rates jumping to 20-30% compared to the sub-10% seen in standard Hindi.
5. Global Player Struggles: Large global tech players like Meta and Microsoft struggle significantly with regional Indian languages. For example, in Tamil and Malayalam, Meta's error rates are often double or triple those of Sarvam and Google.
6. Urdu Performance: Despite being linguistically similar to Hindi, OpenAI models perform poorly in Urdu (35.4% WER), while Sarvam Audio maintains high accuracy (6.95% WER).
⁠7. Meta’s Efficiency Gap: Meta’s massive 7B parameter model is only ~4% more accurate than its much smaller 1B parameter model on average across Indian languages.
8. Niche Support: Microsoft STT is "Not Supported" for nearly half the languages tested (6 out of 15), including major regional languages like Punjabi, Odia, and Kannada
9. The Functional Failure: Despite the global popularity of ChatGPT, OpenAI's transcription models (GPT-4o mini transcribe - the latest one) struggles immensely with Indian speech with over 55% WER. In languages like Maithili and Tamil, these models fail to transcribe nearly 2 out of every 3 words correctly.
Note: Full language-wise and demographic leaderboards are available in the public release.
Testing AI on how India actually speaks
The benchmark evaluates ASR performance using conversational speech collected from approximately 2000 speakers per language. The dataset spans a wide range of age groups, genders, regions, socio-economic backgrounds, device types, and acoustic environments. Unlike many existing evaluations, Voice of India explicitly includes code-switched speech such as Hindi-English, Tamil-English, and Urdu-Hindi as well as background noise and informal speaking styles common in everyday Indian conversations. Beyond dialect labels, the benchmark incorporates cluster-based geographic sampling across districts to capture how speech actually varies within a language’s footprint. In India, pronunciation and vocabulary can shift significantly within 50–100 kilometers. By enforcing structured geographic clusters, the evaluation measures not just language support, but robustness across regional variation, a dimension often invisible in global benchmarks
This design reflects how Indians actually interact with voice systems, rather than how models perform under idealised conditions.
Prof Mitesh Khapra, AI4Bharat at IIT Madras said, “This is one of the most rigorous large-scale evaluations of speech recognition for Indian languages, containing district level cohorts with balanced representation across gender and age to truly reflect India’s diversity. Further, recognising that conventional word error rate can unfairly penalize code mixed and multilingual speech, we manually curated multiple valid spelling variants for transcripts, ensuring models are judged for linguistic correctness rather than orthographic variation. This human intensive effort sets a new benchmark for fair and representative ASR evaluation in India.”
Speaking on the Benchmark, Shobhit Banga, Co-Founder of Josh Talks, said. "The Voice of India benchmark is less about the gaps of today and more about the roadmap for tomorrow. The data shows that when we build AI that understands the soul of Indian speech, our dialects, our accents, and our rural context, we can unlock a level of digital inclusion that was previously unimaginable. We are moving towards a future where voice isn't just a feature, but a reliable bridge to opportunity for every Indian.”
Why this matters: voice as critical infrastructure
The release of the benchmark comes ahead of the India AI Summit, as global technology companies increasingly position voice as a key interface for digital services.
As voice increasingly becomes the primary interface for accessing banking, healthcare, and government services, a word error rate of 20–30% is not merely a technical metric. In practice, it can mean: a welfare application misunderstood, a medical symptom mis-transcribed, a customer complaint routed incorrectly, a farmer’s query answered in the wrong language. When ASR fails in India, the cost is often borne quietly by the user.
A Benchmark for Public Conversation
Voice of India follows a hybrid release model: the methodology and benchmark design are published openly alongside a limited public validation split, while a predominantly private blind test set is retained to prevent training leakage and leaderboard overfitting ensuring results are methodologically rigorous, trustworthy, and reflective of true generalization to unseen, real-world Indian speech.The intent is not to single out individual systems, but to provide neutral measurement infrastructure that grounds claims about voice AI in evidence.
By making disparities visible, the benchmark aims to encourage deeper investment in India evaluation, and model optimization and to inform discussions around standards, accountability, and responsible deployment of voice AI in public-facing systems.
As voice-driven AI adoption accelerates, the benchmark raises a clear challenge for global labs: speech systems cannot scale in India unless they can reliably recognise Indian voices, languages, and ways of speaking.
/smstreet/media/agency_attachments/3LWGA69AjH55EG7xRGSA.png)
Follow Us