arXiv cs.CL
6/16/2026

Equity with Efficiency: An Empirical Study of Tokenizers for Multilingual Large Language Models
Short summary
Researchers systematically compared tokenization approaches for multilingual LLMs across 11 Southeast Asian languages, finding that standard tokenizers inflate inference costs for underrepresented languages. Parity-aware BPE balances compression and fairness, while Morphology-Driven Byte Encoding prioritizes semantic performance at higher compute cost. The study demonstrates that cross-lingual equity and computational efficiency are not fundamentally at odds.
- •First systematic comparison of equitable tokenizers for 11 Southeast Asian languages
- •Parity-aware BPE achieves strong compression while maintaining fairness; morphology-driven approach prioritizes reasoning
- •Proves tokenization choices can achieve both efficiency and cross-lingual equity
Generated with AI, which can make mistakes.
Is this a good recommendation for you?