Translation as a Scalable Proxy for Multilingual Evaluation
Abstract
The rapid proliferation of LLMs has created a critical evaluation paradox: while LLMs claim multilingual proficiency, comprehensive non-machine-translated benchmarks exist for fewer than 30 languages, leaving >98% of the world's 7,000 languages in an empirical void. Traditional benchmark construction faces scaling challenges such as cost, scarcity of domain experts, and data contamination. We evaluate the validity of a simpler alternative: can translation quality alone indicate a model's broader multilingual capabilities? Through systematic evaluation of 14 models (1B-72B parameters) across 9 diverse benchmarks and 7 translation metrics, we find that translation performance is a good indicator of downstream task success (e.g., Phi-4, median Pearson r: METRICX = 0.89, XCOMET = 0.91, SSA-COMET = 0.87). These results suggest that the representational abilities supporting faithful translation overlap with those required for multilingual understanding. Translation quality thus emerges as a strong, inexpensive first-pass proxy of multilingual performance.
BibTeX
@article{issaka2026translation,
title={Translation as a Scalable Proxy for Multilingual Evaluation},
author={Issaka, Sheriff and Rosas Gonzalez, Erick and Liu, Lieqi and Agyei, Evans Kofi and Bandarkar, Lucas and Peng, Nanyun and Adelani, David Ifeoluwa and Guzmán, Francisco and Gabriel, Saadia},
journal={Preprint},
year={2026},
url={https://translation-as-multilingual-proxy.github.io/}
}