3 min readfrom Machine Learning

[P] Added 8 Indian languages to Chatterbox TTS via LoRA — 1.4% of parameters, no phoneme engineering [P]

TL;DR:
Fine-tuned Chatterbox-Multilingual (Resemble AI's open-source TTS) to support Telugu, Kannada, Bengali, Tamil, Malayalam, Marathi, Gujarati, and Hindi using LoRA adapters + tokenizer extension. Only 7.8M / 544M parameters trained. Model + audio samples available.

---

The Problem

Chatterbox-Multilingual supports 23 languages with zero-shot voice cloning, but no Dravidian languages (Telugu, Kannada, Tamil, Malayalam) and limited Indo-Aryan coverage beyond Hindi. That's 500M+ speakers with no representation.

The conventional approach would be: build G2P (grapheme-to-phoneme) for each language, retrain the full model, spend months on it. Hindi schwa deletion alone is an unsolved problem. Bengali G2P is notoriously hard.

The Approach

Instead of phonemes, I went grapheme-level:

  1. Extended the BPE tokenizer with Indic script characters (2454 → 2871 tokens). Telugu, Kannada, Bengali, Tamil, Malayalam, Gujarati graphemes added alongside their existing Devanagari.

Brahmic warm-start
— Initialized new character embeddings from phonetically equivalent Devanagari characters. Telugu "క" (ka) gets initialized from Hindi "क" (ka). This works because Brahmic scripts share phonetic structure — same sounds, different glyphs. The model starts with a reasonable prior instead of random noise.

  1. LoRA on T3 backbone
    — Rank-32 adapters on q/k/v/o projections of the Llama-based T3 module. ~7.8M trainable params (1.4% of 544M total). Everything else frozen: vocoder (S3Gen), speaker encoder, speech tokenizer.

  2. Incremental language training
    — Added languages one at a time with weighted sampling. Started with Hindi-only (validate pipeline), then Telugu+Hindi, then Kannada+Telugu+Hindi, finally all 8 languages. This prevents catastrophic forgetting — Hindi CER actually improved after adding 7 new languages.

Results

CER (Character Error Rate) via Whisper large-v3 ASR on 100 held-out samples per language:

Language CER Notes
Hindi 0.1058 Improved from 0.29 baseline
Kannada 0.1434
Tamil 0.1608
Marathi 0.1976
Gujarati 0.2377
Bengali 0.2450
Telugu 0.2853
Malayalam 0.8593 Experimental — needs more data

Malayalam struggles significantly. Likely needs more training data or a dedicated round. The rest produce intelligible, natural-sounding speech.

What Didn't Work / Limitations

-
Malayalam
— CER 0.86 is essentially unintelligible. Possibly the script complexity (many conjuncts) or insufficient data.
-
No MOS evaluation yet
— CER tells you the words are right, not that it sounds natural. Subjective eval is pending.
-
2 speakers per language
— Male + female from IndicTTS. Won't generalize to all voice types.
-
No code-mixing
— Hindi+English mixed sentences not specifically trained yet.

Links

-
Model + audio samples:
https://huggingface.co/reenigne314/chatterbox-indic-lora
-
Article (full writeup):
https://theatomsofai.substack.com/p/teaching-an-ai-to-speak-indian-languages
-
Base model:
[ResembleAI/chatterbox](
https://github.com/resemble-ai/chatterbox
) (MIT license)

Quick Start

```python
from chatterbox.mtl_tts import ChatterboxMultilingualTTS

model = ChatterboxMultilingualTTS.from_indic_lora(device="cuda", speaker="te_female")
wav = model.generate("నమస్కారం, మీరు ఎలా ఉన్నారు?", language_id="te")
```

Training Details

- Hardware: 1x RTX PRO 6000 Blackwell (96GB)
- Data: SPRINGLab IndicTTS + ai4bharat Rasa
- 6 training rounds, incremental language addition
- LoRA rank 32, alpha 64, bf16

Part 2 (technical deep-dive with code) coming this week. Happy to answer questions about the approach.

submitted by /u/Icy_Gas8807
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#natural language processing
#financial modeling with spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#real-time data collaboration
#rows.com
#big data management in spreadsheets
#conversational data analysis
#intelligent data visualization
#data visualization tools
#enterprise data management
#big data performance
#data analysis tools
#data cleaning solutions
#no-code spreadsheet solutions
#enterprise-level spreadsheet solutions
#large dataset processing
#row zero
#cloud-based spreadsheet applications