SUTRA Architecture: Extended Context & Mixture of Experts for Multilingual LLMs

cover
25 Jun 2025

Abstract and 1 Introduction

2 Related Work

3 SUTRA Approach

3.1 What is SUTRA?

3.2 Architecture

3.3 Training Data

4 Training Multilingual Tokenizers

5 Multilingual MMLU

5.1 Massive Multitask Language Understanding

5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages

5.4 Comparing with leading models for Multilingual Performance

6 Quantitative Evaluation for Real-Time Queries

7 Discussion and Conclusion, and References

3.2 Architecture

The architecture of our model, referred to herein as SUTRA, is built upon the foundational principles of the transformer architecture as delineated by Vaswani et al. [2017]. Our model retains the enhancements specified by Jiang et al. [2023], with the critical adaptation that it facilitates an extended dense context length of up to 32k tokens. Moreover, we have employed MoE layers, enabling selective activation of experts and leading to efficiency in computation and memory consumption, as shown in Figure 2. The key architectural parameters of SUTRA are encapsulated in Table 2.

Given an input x, the output yielded by the Expert Mixture module is the sum of each expert network’s contribution, modulated by the gating network. Formally, for n experts {E0, E1, ..., En−1}, the resultant output is:

Figure 2: Expert Mixture Layer Configuration. Input vectors are routed to a subset of the available experts, specifically 2 out of 8, by a specialized router. The aggregate output of this layer is the sum of the individual outputs, each weighted accordingly. Each expert comprises a feedforward module similar to those found in conventional transformer models.

Table 2: The above table shows some selected model parameters for SUTRA.

where G(x)i represents the gating function’s output, producing an n-dimensional vector corresponding to the i-th expert’s activation, while Ei(x) delineates the i-th expert network’s output. The model capitalizes on sparsity by disregarding inactive experts, thereby conserving computational resources. Several mechanisms for constructing the gating function G(x) exist [Clark et al., 2022, Hazimeh et al., 2021, Zhou et al., 2022]; however, our implementation opts for the efficient approach of selecting the Top-K values from a linear projection, followed by a softmax operation [Shazeer et al., 2017]:

Authors:

(1) Abhijit Bendale, Two Platforms (abhijit@two.ai);

(2) Michael Sapienza, Two Platforms (michael@two.ai);

(3) Steven Ripplinger, Two Platforms (steven@two.ai);

(4) Simon Gibbs, Two Platforms (simon@two.ai);

(5) Jaewon Lee, Two Platforms (jaewon@two.ai);

(6) Pranav Mistry, Two Platforms (pranav@two.ai).


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.