SUTRA Outperforms Leading LLMs on Multilingual MMLU Benchmark

27 Jun 2025

Table of Links

Abstract and 1 Introduction

3 SUTRA Approach

4 Training Multilingual Tokenizers

5 Multilingual MMLU

5.1 Massive Multitask Language Understanding

5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages

5.4 Comparing with leading models for Multilingual Performance

6 Quantitative Evaluation for Real-Time Queries

7 Discussion and Conclusion, and References

5.4 Comparing with leading models for Multilingual Performance

For our evaluation, we use multiple state of the art models and compare their performance on the multilingual MMLU benchmark, as shown in Table 6. We considered multiple leading models such as GPT-4 and GPT-3.5 from OpenAI,

Table 6: The table shows multilingual performance of various leading models on MMLU benchmark for multiple languages. SUTRA has competitive performance in English while maintaining strong multilingual performance in other languages. Many leading language models’ MMLU scores for non-english languages falls close to random chance (25% is random chance on MMLU task).

Mixtral-8x7b from Mistral, Llama2-13b, Llama2-70b and Llama3-70b from Meta, sonar-medium from Perplexity, HyperClovaX from Naver, and Airavata Model from Sarvam AI. Of these, GPT-4, GPT-3.5, Mixtral, Llama series and Perplexity are generic models i.e. they were not trained to optimize for specific languages. HyperClovaX was specifically trained to optimize performance on the Korean language, whilst Airavata was specifically trained to optimize performance in Hindi.

Overall, the evaluation results demonstrate that our SUTRA models can match and even outperform GPT-3.5 and Llama-7b on TWO-related use cases, particularly for providing natural and engaging responses across languages. Although GPT-4 is still state-of-the-art in terms of performance, cost continues to be a major hindrance for wide-scale deployment in cost-sensitive markets. Surpassing GPT-3.5 multilingual performance by 20-30% on the leading MMLU benchmark, SUTRA models excel in comprehending and generating responses across numerous languages. We find that SUTRA does well even compared to models that were specifically optimized for a particular language, showing promise for the approach followed by SUTRA, as shown in Table 7. More detailed results showing MMLU scores across groups of categories such as STEM, humanities etc. are listed in Table 8.

Authors:

(1) Abhijit Bendale, Two Platforms (abhijit@two.ai);

(2) Michael Sapienza, Two Platforms (michael@two.ai);

(3) Steven Ripplinger, Two Platforms (steven@two.ai);

(4) Simon Gibbs, Two Platforms (simon@two.ai);

(5) Jaewon Lee, Two Platforms (jaewon@two.ai);

(6) Pranav Mistry, Two Platforms (pranav@two.ai).

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

← Previous

SUTRA: Consistent Multilingual MMLU Performance Across Diverse Languages

Up Next →

SUTRA-Online: Quantitative Evaluation for Real-Time, Factual LLM Queries