SUTRA: Consistent Multilingual MMLU Performance Across Diverse Languages

27 Jun 2025

Table of Links

Abstract and 1 Introduction

3 SUTRA Approach

4 Training Multilingual Tokenizers

5 Multilingual MMLU

5.1 Massive Multitask Language Understanding

5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages

5.4 Comparing with leading models for Multilingual Performance

6 Quantitative Evaluation for Real-Time Queries

7 Discussion and Conclusion, and References

5.2 Extending MMLU to Multiple Languages

To assess our models’ effectiveness in various tasks and across multiple languages, we developed a multilingual evaluation suite that broadens the scope of evaluation linguistically. We utilized the multilingual assessment framework suggested by Lai et al. [2023] and Üstün et al. [2024], with certain distinctions. Notably, while Okapi uses a 25-shot evaluation, our methodology employs a 5-shot evaluation as per the original benchmark by Hendrycks et al. [2021]. We anticipate that a 5-shot evaluation, offering fewer examples, presents a more challenging benchmark. Recognizing the existence of over 200 major languages globally, our evaluation focuses on three distinct language groups: English, Korean, Japanese, Arabic, and Indian Languages. Although this selection is not exhaustive, it encompasses a significant portion of linguistic diversity, enabling thorough analysis of the models’ multilingual capabilities. These languages represent a substantial demographic, accounting for more than half of the global population as primary or secondary speakers. Additionally, they are key languages in global business, ensuring our evaluation has broad relevance.

5.3 Consistent Performance across Languages

The SUTRA model demonstrates a notable consistency in linguistic performance across a variety of languages, as evidenced by the MMLU benchmark results. It exhibits a minimal performance deviation from its English language results to other languages such as Hindi, Gujarati, and Arabic, highlighting its robust multilingual capabilities critical for applications on a global scale.

Superior concept and language modeling underpin the SUTRA model’s ability to maintain performance levels across different languages, distinguishing it from other leading models, including GPT-4, GPT-3.5, and Llama2. Many existing model architectures (including purpose built multilanguage models) experience a pronounced decline in performance in non-English languages, often regressing to baseline random chance performance, as detailed in Table 5. Note that random chance performance is at 25% on the MMLU benchmark. In contrast, SUTRA consistently achieves stable scores across languages, setting it apart, particularly in languages that are less commonly represented in language models, such as Hindi, Gujarati, Tamil, and Korean. The SUTRA model, therefore, not only excels in individual language performance but also promotes a more universal, language-agnostic approach to AI. It serves as a robust solution for international businesses, educational platforms, and cross-cultural communication, setting a new benchmark for LLMs in a multi-lingual, interconnected world.

Authors:

(1) Abhijit Bendale, Two Platforms (abhijit@two.ai);

(2) Michael Sapienza, Two Platforms (michael@two.ai);

(3) Steven Ripplinger, Two Platforms (steven@two.ai);

(4) Simon Gibbs, Two Platforms (simon@two.ai);

(5) Jaewon Lee, Two Platforms (jaewon@two.ai);

(6) Pranav Mistry, Two Platforms (pranav@two.ai).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

← Previous

Assessing LLM Knowledge: Multiple-Choice Questions in the MMLU Benchmark

Up Next →

SUTRA Outperforms Leading LLMs on Multilingual MMLU Benchmark