Skip to main content
ZentroTECH
Voice AI · 11 min read

Voice AI in Hindi, Tamil, Kannada, Telugu: What 2026 Changed (Sarvam, Krutrim, AI4Bharat, ElevenLabs)

ZentroTECH Team · May 24, 2026

For the better part of a decade, "voice AI in Indian languages" was the demo that worked beautifully on stage in Bangalore and broke the moment you tried it on a real call from a real customer in Sangli. The latency was 1.5–3 seconds. Hinglish broke the speech recogniser. Tamil-English code-mixing was an absolute disaster. And the cost per minute was higher than just hiring a human.

2026 is the year that wall finally fell. We've put real production voice agents into Indian SMB phone trees this quarter — for clinics, salons, lead-recovery campaigns, and B2B sales — and the experience is no longer "demo that mostly works." It's "thing that holds a conversation with a Marathi-speaking aunty for four minutes and books an appointment correctly 9 times out of 10."

This is the playbook. We'll cover what changed technically, who's winning (Sarvam, AI4Bharat, ElevenLabs), what happened to Krutrim, what it actually costs in INR, three SMB use cases that pay back fast, and the TRAI/DPDP compliance corners you must not cut.

What changed in 2026

Three structural improvements landed together this year and changed the economics.

Latency finally crossed the conversational floor

Voice AI feels natural under ~400ms of end-to-end latency (user stops speaking → agent starts responding). Above ~700ms, it feels like a video call on a bad connection. Above 1 second, customers hang up.

Until late 2025, Indian-language voice agents almost always sat in the 800–1500ms band because the speech-to-text → LLM → text-to-speech pipeline had to round-trip through US-hosted models that weren't trained on Indian phonemes.

In 2026, Sarvam's Saaras STT and Bulbul v3 TTS run inside Indian data centres, frontier LLMs route through Asia regions, and the whole pipeline can be tuned to sub-400ms on a stable broadband line. This is the single biggest unlock.

Code-mixed (Hinglish, Tanglish, Manglish) actually works

The way Indians actually speak is not "Hindi" or "English" — it's a fluid mix where the verb might be English, the noun Hindi, the politeness particles Hindi, and a number switched mid-sentence to English. Until 2026, every speech recogniser would either over-correct to one language or hallucinate gibberish.

Sarvam's Bulbul v3 and AI4Bharat's IndicTrans2 (covering all 22 scheduled Indian languages) crossed the threshold this year. Hinglish that flips three times in a sentence is now reliably transcribed and reliably synthesised back. We see this work cleanly across Hindi-English, Tamil-English, Telugu-English, Kannada-English, and Marathi-English. Bengali and Malayalam are close behind.

Cost dropped by roughly 5×

A minute of conversational voice AI in an Indian language now costs ₹2–6 all-in (STT + LLM + TTS + telephony minute) depending on which stack you pick. In 2024 the same minute cost ₹15–25. The human-agent equivalent is ₹15–20 per active call minute when you bake in fully-loaded cost.

The cost crossover is real. For the first time, AI voice agents are cheaper than human voice agents for any workflow that doesn't strictly require human judgment.

The Indian voice-AI landscape, May 2026

Four providers matter for SMB use cases.

Sarvam AI — the default for production

Sarvam is now the default pick for Indian-language voice AI production workloads, and it earned that position fairly.

  • Bulbul v3 TTS — Sarvam's text-to-speech model. Supports 11 languages (10 Indian + English with Indian accent): Hindi, Bengali, Tamil, Telugu, Gujarati, Kannada, Malayalam, Marathi, Punjabi, Odia, and English. 30+ speaker voices. Up to 2,500 characters per request.
  • Saaras STT — Sarvam's speech-to-text. Production-grade across the same language set.
  • Pricing (May 2026): ~₹30/hour for STT, ₹15–30 per 10K characters for TTS, ₹20 per 10K characters for translation. Rs 1,000 free credits on signup.
  • Sarvam offered unlimited Bulbul v3 access through February 2026 to build adoption. That window closed but pay-per-use pricing remains aggressive.

For most SMB voice-agent builds in 2026, Sarvam is what we install by default unless there's a specific reason to deviate.

AI4Bharat — the open-source backbone

AI4Bharat is the research lab at the Wadhwani School of Data Science & AI, IIT Madras that has been quietly producing the open-source foundations the entire Indian voice-AI ecosystem stands on.

  • IndicTrans2 — translation model that covers all 22 scheduled languages of India.
  • IndicBERT, IndicBART, Airavata — multilingual LLMs tailored for Indian languages.
  • Plus speech models, transliteration tools, and a 15,000-hour Bhashini data initiative spanning 400+ districts.

For SMBs, AI4Bharat is rarely something you "use directly" — it's the layer that Sarvam, Krutrim, and a dozen indie startups build on top of. But if you have data-residency or open-source requirements, you can self-host AI4Bharat models and avoid every commercial API.

Krutrim — the cautionary tale and the pivot

This story moved fast and a lot of Indian founders are still working off old information.

  • 2024: Ola's Krutrim becomes India's first GenAI unicorn. Big ambitions on chips, foundation models, and a consumer AI assistant called Kruti.
  • April 2026: Krutrim quietly takes Kruti offline across app stores and the web. Per SensorTower, lifetime downloads were under 500,000. The consumer play didn't work.
  • May 2026: Bhavish Aggarwal-led Krutrim formally announces it has paused chip and foundation-model work and is pivoting entirely to B2B cloud infrastructure (Krutrim Cloud).
  • The company now reports ~25 enterprise customers across telecom, finance, healthcare, logistics, and manufacturing. ~Rs 300 crore in FY26 revenue (3× year-on-year) and a first-time profit. Headcount dropped from ~550 (Aug 2025) to ~150–160 (Mar 2026).

What does this mean for SMBs? Don't build new voice-AI workloads on Krutrim today. The company is mid-pivot and the consumer-facing models are dead. If you want Indian-hosted cloud GPUs for your own self-hosted models, Krutrim Cloud is a viable option alongside Yotta and CtrlS. But for STT/TTS in production today, look elsewhere.

ElevenLabs — for English-heavy or premium voices

ElevenLabs remains the gold standard for English voice quality and offers some Indian-accent English voices, but its native Indian-language support lags Sarvam in 2026. We use ElevenLabs in voice agents where the conversation is predominantly English (B2B sales for global customers, premium concierge use cases) and Sarvam everywhere else.

A common pattern in production: Sarvam for STT and Indian-language TTS, ElevenLabs for the rare English-only persona where premium voice quality matters more than ₹0.50 per minute of cost difference.

Three SMB voice-agent use cases that pay back fast

We've deployed these for Bangalore SMBs in the last two quarters. Each one has a clear ROI window under 60 days.

Use case 1: Payment recovery / collections calls

The pain: a Whitefield-based small lender has 800 active loans and 12% are 30+ days past due. Each collection call requires a human agent, takes 4–6 minutes, costs ~₹80–120 fully loaded, and the success rate of getting a same-day commitment is roughly 18%.

The voice-agent build:

  • Sarvam Saaras + Bulbul v3 in Hindi-English code-mixed mode.
  • Claude Sonnet 4.6 as the conversation brain.
  • A short, scripted call: greet, identify the borrower, state the overdue amount in INR, offer a payment plan, send a Razorpay link via WhatsApp if they agree, handle three common objections.
  • All-in cost per call: ~₹5–8.

The result we typically see: 60% pickup rate, ~14% same-day payment commitment, ~10× cost efficiency vs human agents. Our payment recovery automation clients use this pattern at scale.

Use case 2: Inbound customer support for high-volume verticals

The pain: a Koramangala D2C brand gets 200 customer-service calls a day, 70% of which are "where's my order?" or "I need to change my delivery address."

The voice-agent build:

  • Sarvam STT/TTS, multilingual prompt covering Hindi, Tamil, Kannada, English.
  • Connected via MCP to the brand's Shopify backend (order status), Shiprocket (shipping ETA), and Zoho Desk (ticket creation if escalation needed).
  • Falls back to a human agent if the customer says "speak to someone" or if the AI confidence drops below threshold.

Result: ~85% of calls handled fully autonomously. Human agents now handle only the 15% that require judgment. Customer CSAT actually went up because the AI never made customers wait in a queue.

Use case 3: Outbound sales qualification

The pain: an Indiranagar coaching business buys 3,000 leads a month from various sources. Their inside-sales team can only meaningfully follow up on the first ~600. The other 2,400 leads die in the CRM.

The voice-agent build:

  • Sarvam in Hindi-English code-mixed mode.
  • 90-second qualifying call asking 3 questions: are you the decision-maker, are you considering for self or someone else, what's your timeline.
  • Hot leads booked into the sales team's calendar via the CRM integration.
  • Lukewarm leads dropped into a 7-touch WhatsApp nurture sequence.

Result: the dead 2,400 leads now produce ~80–120 qualified meetings per month at a fully-loaded cost of ~₹15,000. The human sales team only talks to leads that have already self-qualified.

TRAI, DND, and recording-consent — the compliance corners

This is the section most voice-AI vendors quietly skip. Skipping it can cost you ₹10,000 per violation per call, plus reputational damage. Please don't.

TRAI TCCCPR 2018 + the 2024–2025 amendments

India's telecom commercial communications rules now require:

  • Caller ID must clearly identify your business. No spoofed numbers, no "private number" tricks.
  • Outbound calls during 9am–9pm only. Period. Voice AI doesn't get an exemption.
  • DND scrubbing before every outbound campaign. If a number is on the National DND register (Cat A/B/etc) and you don't have explicit opt-in for the category you're calling about, you cannot call it.
  • A free opt-out mechanism on every call. The voice agent must reliably honour "don't call me again" — and your CRM must record the suppression.

Recording consent

Two-party consent rules in India are unclear at the edges, but the safe pattern is: announce that the call is being recorded for quality and training purposes at the start of every call, in the caller's language. Most production voice-agent stacks bake this into the system prompt.

DPDP Act considerations

Voice transcripts contain personal data. Storage, retention, and processing of transcripts now fall under the Digital Personal Data Protection Act. Practical implications:

  • Have a clear retention policy (90 days is a common default).
  • If your STT provider stores audio, get a DPA in place.
  • Don't ship transcripts to non-Indian regions without a lawful basis.

This is exactly the compliance layer we install on top of voice-agent builds — and it's the layer that most cheap "voice AI agency" offers leave out. Skip at your own risk.

The build vs buy question

Three options for an Indian SMB that wants a voice agent in production this quarter:

  1. Buy a vertical SaaS (e.g., a "voice agent for clinics" product). Fastest to deploy. ~₹30,000–₹80,000/month, often with hard limits on customisation and integrations.
  2. Use a horizontal voice-agent platform (Vapi, Retell, Sarvam Agents). Mid-effort. ~₹10,000/month platform + per-minute. More flexible. Still some lock-in.
  3. Build on the primitives (Sarvam STT/TTS + your LLM of choice + your own MCP-connected backend + telephony like Plivo or Exotel). Highest upfront effort, lowest run-cost (₹2–6/min), full control, no lock-in.

We almost always recommend option 3 for SMBs serious about voice as a long-term channel — but we recommend they don't build it alone. The integrations, the compliance layer, and the eval suite are the parts that take three weeks instead of three months when you've done it before. This is exactly what our Bangalore AI consulting team ships as a fixed-scope engagement.

What to do this week

Three concrete actions for any Indian SMB owner reading this:

  1. Pick the one phone workflow you most wish was automated. Collections, appointment reminders, lead qualification, inbound FAQs — pick one, not all.
  2. Estimate the unit economics today. What does it cost per call to do this with humans? What's the success rate? How many calls per month?
  3. Compare against ₹2–6 per AI minute at ~80% success rate. If the math is wildly in favour of AI, that's your first voice-agent build.

The era of "Indian voice AI is a future thing" ended in early 2026. The era of "Indian voice AI is a competitive moat for the SMBs who deploy it first" started immediately after. The next 12 months will sort the SMBs who treated this seriously from the ones who waited.

A reference stack we deploy in 2026

For most Indian SMB voice-agent builds in 2026, the reference architecture we install looks like:

  • Telephony: Plivo or Exotel (Indian numbers, programmable SIP, sensible rates).
  • STT: Sarvam Saaras for any Indian-language traffic.
  • LLM: Claude Sonnet 4.6 for conversation quality; Gemini Flash for cheap intent classification.
  • TTS: Sarvam Bulbul v3 (Indian-language voices); ElevenLabs only when premium English is required.
  • Tools layer: MCP servers for Razorpay, Zoho, the client's own CRM, and WhatsApp Business.
  • Observability: call recording, transcripts, and per-turn latency captured to a Postgres table; weekly review of edge-case calls.
  • Compliance: TRAI scrubbing pre-call, consent announcement at call start, DPDP retention policy on transcripts, opt-out auto-honoured in the CRM.

This stack runs end-to-end at ₹3–6 per active minute all in, scales to thousands of concurrent calls without breaking, and has been battle-tested across half a dozen Bangalore SMB clients this quarter.

The honest limitations

Indian-language voice AI in 2026 is genuinely good, but it is not magic. The corners where it still breaks:

  • Very noisy environments. Construction sites, busy markets, moving auto-rickshaws — STT accuracy drops sharply. Human handoff is the only good answer.
  • Heavy regional accents. Sarvam handles standard urban accents brilliantly but can stumble on heavy Bhojpuri, deep Tirunelveli Tamil, or coastal Karnataka dialects. Most calls work; some don't. Build the fallback path.
  • Long emotional conversations. Collections from a genuinely distressed borrower, sensitive medical conversations — the model is technically capable, but ethically you want a human.
  • First-call cold outbound to strangers. Cultural friction is high. Voice AI works much better on warm leads or inbound calls where the customer initiated contact.

Knowing where not to deploy is as important as knowing where to deploy. SMBs who try to force voice AI into every workflow end up with worse outcomes than SMBs who pick three to five workflows and execute them well.


If you want a 45-minute call to scope your first voice-agent workflow — including the compliance layer, the integration map, and a realistic cost model — that's a conversation we have with five Bangalore SMBs every week. Get in touch and we'll come prepared with examples from your specific vertical.

Ready to 10x your inbound?

Free 30-min lead engine audit. We'll show you exactly where your website + automation can pay back in 90 days — or tell you honestly if it can't. Quote within 1 business day.