HPLT-E: Comprehensive Multilingual LLM Evaluation

HPLT-E is a framework for comprehensive multilingual and multi-prompt k-shot evaluation across 124 tasks in nine typologically diverse languages: Catalan, Spanish, Basque, Galician, French, Norwegian, Ukrainian, Czech, and Finnish.

🚀 Updates

19.11.2025: We update HPLT-E and release our results of comparing HPLT 3.0, HPLT v2, FineWeb2, and MADLAD-400.
02.11.2025: Our pre-print is available on arXiv.
08.10.2025: We release HPLT 3.0 together with HPLT-E.

🗺️ Overview

HPLT-E combines existing monolingual benchmarks for Catalan (CatalanBench), Spanish (SpanishBench), Basque (BasqueBench), Galician (GalicianBench), French (FrenchBench), Norwegian (NorEval), Finnish (FinBench v2), and Czech (BenCzechMark). In addition, we create a multi-task benchmark for Ukrainian (UkrainianBench) and extend single-prompt benchmarks to the multi-prompt scenario (French, Catalan, Spanish, Basque, Galician, and Ukrainian). HPLT-E covers a diverse set of 124 natural language understanding and generation tasks, each supporting 3-7 human-written prompts. Our main evaluation principles include:

Diversity: broader representation of lesser-resourced languages in context of pretraining corpora comparison.
Data quality: use of human-curated datasets to ensure reliable evaluation.
Robust evaluation: evaluation across 500+ prompts written by native speakers to account for prompt sensitivity.
Reproducibility: full integration of HPLT-E into LM Evaluation Harness for user-friendly standardized evaluation.

🌐 Multilingual evaluation suite

HPLT-E covers different task categories in all languages: entailment, causal reasoning, mathematical reasoning, commonsense reasoning, language knowledge, language-specific & world knowledge, paraphrase detection, reading comprehension, sentiment analysis, toxicity detection, machine translation, and truthfulness. The supported tasks for each language are summarized below.

Catalan

Benchmark: CatalanBench
Paper: aclanthology.org/2025.coling-main.699
Homepage: N/A
Language code: cat_Latn
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/catalan_bench
HPLT-E multi-prompt implementation: cat_Latn

Name	LM Evaluation Harness	Task type	Task category
ARC-ca	`arc_ca_challenge_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
ARC-ca	`arc_ca_easy_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Belebele	`catbelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
CatalanQA	`catalanqa_p[0-2]`	Generative QA	Language-specific & world knowledge
CatCoLA	`catcola_p[0-2]`	Text classification	Language knowledge
COPA-ca	`copa_ca_p[0-2]`	Text cassification	Commonsense reasoning
CoQCat	`coqcat_p[0-2]`	Generative QA	Reading comprehension
MGSM-cat	`mgsm_direct_ca_p[0-2]`	Generative QA	Mathematical reasoning
OpenBookQA-cat	`openbookqa_ca_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Parafraseja	`parafraseja_p[0-2]`	Text classification	Paraphrase detection
PAWS-ca	`paws_ca_p[0-2]`	Text classification	Paraphrase detection
PIQA-ca	`piqa_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
SIQA-ca	`siqa_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
TE-ca	`teca_p[0-2]`	Text classification	Entailment
VeritasQA-cat Generation	`veritasqa_ca_gen_p[0-2]`	Generative QA	Truthfulness
VeritasQA-cat Multiple-choice	`veritasqa_ca_mc1_p[0-2]`	Multiple-choice QA	Truthfulness
VeritasQA-cat Multiple-choice	`veritasqa_ca_mc2_p[0-2]`	Multiple-choice QA	Truthfulness
WNLI	`wnli_ca_p[0-2]`	Text classification	Entailment
XNLI	`xnli_ca_p[0-2]`	Text classification	Entailment
XQuAD	`xquad_ca_p[0-2]`	Generative QA	Reading comprehension
xStoryCloze	`xstorycloze_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Cocoteros	`cocoteros_va_p[0-2]`	Text generation	Commonsense reasoning
FLORES	`flores_en-ca_p[0-2]`	Sequence-to-sequence generation	Machine translation

Spanish

Benchmark: SpanishBench
Paper: aclanthology.org/2025.coling-main.699
Homepage: N/A
Language code: spa_Latn
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/spanish_bench
HPLT-E multi-prompt implementation: spa_Latn

Name	LM Evaluation Harness	Task type	Task category
Belebele	`spabelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
COPA	`copa_es_p[0-2]`	Text cassification	Commonsense reasoning
ESCoLA	`escola_p[0-2]`	Text cassification	Language knowledge
MGSM-es	`mgsm_direct_es_p[0-2]`	Generative QA	Mathematical reasoning
OpenBookQA-es	`openbookqa_es_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
PAWS-es	`paws_es_p[0-2]`	Text cassification	Paraphrase detection
VeritasQA-es Generation	`veritasqa_es_gen_p[0-2]`	Generative QA	Truthfulness
VeritasQA-es Multiple-choice	`veritasqa_es_mc1_p[0-2]`	Multiple-choice QA	Truthfulness
VeritasQA-es Multiple-choice	`veritasqa_es_mc2_p[0-2]`	Multiple-choice QA	Truthfulness
XNLI	`xnli_es_p[0-2]`	Text cassification	Entailment
XQuAD	`xquad_es_p[0-2]`	Generative QA	Reading comprehension
xStoryCloze	`xstorycloze_es_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Cocoteros	`cocoteros_es_p[0-2]`	Text generation	Commonsense reasoning
FLORES	`flores_en-es_p[0-2]`	Sequence-to-sequence generation	Machine translation
INCLUDE	`include_spanish_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Global-MMLU	`global_mmlu_spanish_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge

Galician

Benchmark: GalicianBench
Paper: aclanthology.org/2025.coling-main.699
Homepage: N/A
Language code: glg_Latn
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/galician_bench
HPLT-E multi-prompt implementation: glg_Latn

Name	LM Evaluation Harness	Task type	Task category
Belebele	`glgbelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
MGSM-gl	`mgsm_direct_gl_p[0-2]`	Generative QA	Mathematical reasoning
GalCoLA	`galcola_p[0-2]`	Text classification	Language knowledge
OpenBookQA-gl	`openbookqa_gl_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Parafrases-gl	`parafrases_gl_p[0-2]`	Text classification	Paraphrase detection
PAWS-gl	`paws_gl_p[0-2]`	Text classification	Paraphrase detection
FLORES	`flores_en-glg_p[0-2]`	Sequence-to-sequence generation	Machine translation
VeritasQA-gl Generation	`veritasqa_gl_gen_p[0-2]`	Generative QA	Truthfulness
VeritasQA-gl Multiple-choice	`veritasqa_gl_mc1_p[0-2]`	Multiple-choice QA	Truthfulness
VeritasQA-gl Multiple-choice	`veritasqa_gl_mc2_p[0-2]`	Multiple-choice QA	Truthfulness

Basque

Benchmark: BasqueBench
Paper: aclanthology.org/2025.coling-main.699
Homepage: N/A
Language code: eus_Latn
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/basque_bench
HPLT-E multi-prompt implementation: eus_Latn

Name	LM Evaluation Harness	Task type	Task category
Belebele	`eusbelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
EusExams	`eus_exams_eu_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
EusProfficiency	`eus_proficiency_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
EusReading	`eus_reading_p[0-2]`	Multiple-choice QA	Reading comprehension
EusTrivia	`eus_trivia_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
MGSM-eu	`mgsm_direct_eu_p[0-2]`	Generative QA	Mathematical reasoning
PIQA-eu	`piqa_eu_p[0-2]`	Multiple-choice QA	Commonsense reasoning
WNLI	`wnli_eu_p[0-2]`	Text classification	Entailment
XCOPA	`xcopa_eu_p[0-2]`	Text cassification	Commonsense reasoning
XNLI	`xnli_eu_native_p[0-2]`	Text classification	Entailment
xStoryCloze	`xstorycloze_eu_p[0-2]`	Multiple-choice QA	Commonsense reasoning
PAWS-eu	`paws_eu_p[0-2]`	Text classification	Paraphrase detection
ARC-eu	`arc_eu_easy_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
ARC-eu	`arc_eu_challenge_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
FLORES	`flores_en-eu_p[0-2]`	Sequence-to-sequence generation	Machine translation
INCLUDE	`include_basque_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge

Norwegian

Benchmark: NorEval
Paper: aclanthology.org/2025.findings-acl.181
Homepage: github.com/ltgoslo/noreval
Language code: nor_Latn (Bokmål and Nynorsk)
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/noreval
HPLT-E multi-prompt implementation: N/A

Name	LM Evaluation Harness (Bokmål)	LM Evaluation Harness (Nynorsk)	Task type	Task category
NoReC Sentence	`norec_sentence_p[0-4]`	❌	Text classification	Sentiment analysis
NoReC Document	`norec_document_p[0-4]`	❌	Text classification	Sentiment analysis
NorIdiom	`noridiom_nob_p[0-4]`	`noridiom_nno_p[0-4]`	Sentence completion	Language knowledge
Belebele	`norbelebele_p[0-4]`	❌	Multiple-choice QA	Reading comprehension
NRK-Quiz-QA	`nrk_quiz_qa_nob_p[0-4]`	`nrk_quiz_qa_nno_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
NorOpenBookQA	`noropenbookqa_nob_p[0-4]`	`noropenbookqa_nno_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
NorCommonsenseQA	`norcommonsenseqa_nob_p[0-4]`	`norcommonsenseqa_nno_p[0-4]`	Multiple-choice QA	Commonsense reasoning
NorTruthfulQA Multiple choice	`nortruthfulqa_mc_nob_p[0-4]`	`nortruthfulqa_mc_nno_p[0-4]`	Multiple-choice QA	Truthfulness
NorQuAD	`norquad_p[0-4]`	❌	Generative QA	Reading comprehension
NorTruthfulQA Generation	`nortruthfulqa_gen_nob_p[0-4]`	`nortruthfulqa_gen_nno_p[0-4]`	Generative QA	Truthfulness
Tatoeba (English → Bokmål/Nynorsk)	`tatoeba_eng_nob_p[0-4]`	`tatoeba_eng_nno_p[0-4]`	Sequence-to-sequence generation	Machine translation

Ukrainian

Benchmark: UkrainianBench
Paper: arxiv.org/abs/2511.01066
Homepage: github.com/hplt-project/hplt-e
Language code: ukr_Cyrl
Original LM Evaluation Harness implementation: N/A
HPLT-E multi-prompt implementation: ukr_Cyrl

Name	LM Evaluation Harness	Task type	Task category
Global-MMLU	`global_mmlu_ukrainian_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
ZNO	`zno_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
INCLUDE	`include_ukrainian_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
TextDetox	`textdetox_ukr_p[0-2]`	Text classification	Toxicity detection
UA-SQuAD	`ua_squad_p[0-2]`	Generative QA	Reading comprehension
Belebele	`ukrbelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
WMT24PP	`wmt24pp_en-uk_p[0-2]`	Sequence-to-sequence generation	Machine translation

Czech

Benchmark: BenCzechMark
Paper: direct.mit.edu/tacl/article/doi/10.1162/TACL.a.32/132962
Homepage: github.com/DCGM/lm-evaluation-harness
Language code: ces_Latn
Original LM Evaluation Harness implementation: github.com/DCGM/lm-evaluation-harness/tree/main/lm_eval/tasks/benczechmark
HPLT-E multi-prompt implementation: ces_Latn

NB: we update BenCzechmark to enable support for latest LM Evaluation Harness versions and create new prompts for Global-MMLU.

Name	LM Evaluation Harness	Task type	Task category
Belebele	`cesbelebele_p[0-4]`	Multiple-choice QA	Reading comprehension
Global-MMLU	`global_mmlu_czech_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
SQAD3.2	`cs_sqad32_p[0-4]`	Generative QA	Reading comprehension
Umimeto	`umimeto_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
CERMAT OPEN	`cermat_czech_open_p[0-4]`	Generative QA	Language knowledge
CERMAT TF	`cermat_czech_tf_p[0-4]`	Multiple-choice QA	Language knowledge
CERMAT MC	`cermat_czech_mc_p[0-4]`	Multiple-choice QA	Language knowledge
Klokan QA	`klokan_qa_p[0-4]`	Multiple-choice QA	Mathematical reasoning
CERMAT (Math) MC	`cermat_czmath_mc_p[0-4]`	Multiple-choice QA	Mathematical reasoning
CERMAT (Math) OPEN	`cermat_czmath_open_p[0-4]`	Generative QA	Mathematical reasoning
CTKFacts	`ctkfacts_nli_p[0-4]`	Text classification	Entailment
Subjectivity	`ces_subjectivity_p[0-4]`	Text classification	Sentiment analysis
CzechSentiment - Mall	`sentiment_mall_p[0-4]`	Text classification	Sentiment analysis
CzechSentiment - CSFD	`sentiment_csfd_p[0-4]`	Text classification	Sentiment analysis
CzechSentiment - FB	`sentiment_fb_p[0-4]`	Text classification	Sentiment analysis

French

Benchmark: FrenchBench
Paper: arxiv.org/abs/2402.00786
Homepage: huggingface.co/croissantllm
Language code: fra_Latn
Original LM Evaluation Harness implementation: github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/french_bench
HPLT-E multi-prompt implementation: fra_Latn

Name	LM Evaluation Harness	Task type	Task category
FQuaD	`fquad_p[0-2]`	Generative QA	Reading comprehension
French Language Test: Grammar	`french_bench_grammar_p[0-2]`	Multiple-choice QA	Language knowledge
French Language Test: Vocabulary	`french_bench_vocabulary_p[0-2]`	Multiple-choice QA	Language knowledge
French Language Test: Reading	`french_bench_reading_p[0-2]`	Multiple-choice QA	Reading comprehension
Belebele	`frabelebele_p[0-2]`	Multiple-choice QA	Reading comprehension
French NLI	`topic_based_nli_p[0-2]`	Text classification	Entailment
XNLI	`french_xnli_p[0-2]`	Text classification	Entailment
INCLUDE	`include_french_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Global-MMLU	`global_mmlu_french_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge

Finnish

Benchmark: FinBench v2
Paper: TBA
Homepage: N/A
Language code: fin_Latn
Original LM Evaluation Harness implementation: github.com/LumiOpen/lm-evaluation-harness/tree/finbench_v2/lm_eval/tasks/finbench_v2
HPLT-E multi-prompt implementation: N/A

Name	Formulation	LM Evaluation Harness	Task type	Task category	FinBench v2 dataset version
ARC-challenge-fi	Multiple-choice	`arc_challenge_fi_mcf_fbv2_p[0-4]`	Multiple-choice QA	Language-scpecific & world knowledge	finbenchv2-arc-c-fi-ht
	Close	`arc_challenge_fi_cf_fbv2_p[0-4]`
Belebele	Multiple-choice	`belebele_fin_Latn_mcf_fbv2_p[0-4]`	Multiple-choice QA	Reading comprehension	finbenchv2-belebele-fi-og
	Close	`belebele_fin_Latn_cf_fbv2_p[0-4]`
GoldenSwag	Multiple-choice	`goldenswag_ht_fi_mcf_fbv2_p[0-4]`	Sentence completion	Commonsense reasoning	finbenchv2-goldenswag-fi-ht
	Close	`goldenswag_ht_fi_cf_fbv2_p[0-4]`
FIN-Bench	Multiple-choice	`finbench_analogies_mcf_fbv2_p[0-4]`	Multiple choice QA	Relational reasoning	FIN-bench
	Close	`finbench_analogies_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_emotions_mcf_fbv2_p[0-4]`	Text classification	Sentiment analysis	FIN-bench
	Close	`finbench_emotions_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_empirical_judgments_mcf_fbv2_p[0-4]`	Text classification	Causal reasoning	FIN-bench
	Close	`finbench_empirical_judgments_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_general_knowledge_mcf_fbv2_p[0-4]`	Multiple choice QA	Language-scpecific & world knowledge	FIN-bench
	Close	`finbench_general_knowledge_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_hhh_alignment_mcf_fbv2_p[0-4]`	Multiple choice QA	Alignment and safety	FIN-bench
	Close	`finbench_hhh_alignment_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_paraphrase_mcf_fbv2_p[0-4]`	Text classification	Paraphrase detection	FIN-bench
	Close	`finbench_paraphrase_cf_fbv2_p[0-4]`
	Multiple-choice	`finbench_similarities_abstraction_mcf_fbv2_p[0-4]`	Multiple choice QA	Commonsense reasoning	FIN-bench
	Close	`finbench_similarities_abstraction_cf_fbv2_p[0-4]`

🧪 Multilingual Evaluation Recipe

We provide results of our ablation studies evaluating different corpora and sampling strategies across multiple languages:

⚖️ Pre-HPLT 3.0 Comparison: Comparison of HPLT 2.0 and pre-HPLT 3.0 data deduplication strategies across nine selected languages (HPLT 3.0 pre-release).
📚 Corpora Comparison: Evaluation of HPLT 2.0, HPLT 3.0, FineWeb2, and MADLAD-400 on nine selected languages (HPLT 3.0 release).
🧰 Web Document Scorer (WDS) Comparison: Analysis of HPLT 3.0 corpora sampled using different WDS thresholds, focusing on Spanish and French (HPLT 3.0 release).

Our multilingual evaluation recipe consists of three main components:

🧩 Pretraining: Pretraining ablation models on various corpora configurations for the target languages.
🎯 Task selection: Selecting tasks that provide reliable pretraining evaluation signal based on the maximum performance across prompts.
📊 Performance aggregation: Aggregating performance on the selected tasks across languages.

🧩 Pretraining

Each evaluation series involves pretraining individual 2.15B-parameter models for every language, following a fixed pretraining setup. All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens. For lower-resource languages with less than 30B/100B tokens of available data, datasets are uniformly upsampled (repeated) following Muennighoff et al. (2023). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes and totaling approximately 1k GPU hours.

🎯 Task selection

We use the standard task-specific metrics and report the maximum score across the prompts as the main performance aggregation method. We adapt the FineWeb2 evaluation design to examine the signal HPLT-E tasks provide based on the criteria and statistics summarized below.

Monotonicity: performance should improve as pretraining progresses, even if the improvement differs across pretraining corpora. Tasks with fluctuating scores promote limited reliability.
Stable pretraining: relative variability of performance across checkpoints should be low, reflecting smooth pretraining dynamics.
Ranking consistency: relative ranking of models should remain consistent across consecutive pretraining intervals.
Prompt sensitivity: performance should be consistent across various prompt formulations.
Prompt-switch rate: frequent switches in best-performing prompt further reflects low evaluation reliability due to potential prompt lottery.
Signal-to-Noise ratio: differences in task performance should primarily reflect differences in corpora quality, not random variation due to prompt choice.
Non-randomness: final checkpoints should achieve performance above a random guessing baseline. Tasks where all models perform near random provide low discriminative power.

Specific evaluation criteria requirements are detailed on the corresponding evaluation page.

📊 Performance aggregation

We select tasks that provide the pretraining evaluation signal and aggregate the performance using a combination of several approaches.

🔤 Language score

We compute language scores across the selected tasks as follows:

Rescaling: Normalize performance scores relative to a random baseline using min–max normalization.
Category averaging: Compute the average of normalized scores within each task category.
Language score: Derive the final language-level score as the mean of the category averages.

🌍 "Multilingual" score

To compute the multilingual score, we utilize several approaches:

Average normalized score: We average min-max normalized language scores.
Average rank: We rank the final checkpoints' language scores across all corpora configurations and average their ranks.
Borda count: First, we rank the final checkpoints for each language; second, we apply the Borda count on the language-wise rankings to compute the final ranking. We utilize the Vote'n'Rank framework.

😎 HPLT-E Tasks

Based on our large-scale evaluations, we find that the set of selected tasks for each language can slightly differ depending on the number of corpora and checkpoints included in the comparison. Although we encourage users to perform task selection on their own data using our codebase, we also release a set of HPLT-E tasks for languages that less represented in multilingual evaluations, derived from our 100BT model evaluation results.

😎 HPLT-E tasks

Language	Name	LM Evaluation Harness	Task type	Task category
Spanish	COPA	`copa_es_p[0-2]`	Text cassification	Commonsense reasoning
Spanish	OpenBookQA-es	`openbookqa_es_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Spanish	XNLI	`xnli_es_p[0-2]`	Text cassification	Entailment
Spanish	xStoryCloze	`xstorycloze_es_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Catalan	ARC-ca	`arc_ca_easy_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Catalan	COPA-ca	`copa_ca_p[0-2]`	Text cassification	Commonsense reasoning
Catalan	CoQCat	`coqcat_p[0-2]`	Generative QA	Reading comprehension
Catalan	PIQA-ca	`piqa_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Catalan	SIQA-ca	`siqa_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Catalan	TE-ca	`teca_p[0-2]`	Text classification	Entailment
Catalan	xStoryCloze	`xstorycloze_ca_p[0-2]`	Multiple-choice QA	Commonsense reasoning
Ukrainian	ZNO	`zno_p[0-2]`	Multiple-choice QA	Language-specific & world knowledge
Ukrainian	UA-SQuAD	`ua_squad_p[0-2]`	Generative QA	Reading comprehension
Czech	SQAD3.2	`cs_sqad32_p[0-4]`	Generative QA	Reading comprehension
Finnish	ARC-challenge-fi	`arc_challenge_fi_cf_fbv2_p[0-4]`	Multiple-choice QA	Language-scpecific & world knowledge
Finnish	Belebele	`belebele_fin_Latn_cf_fbv2_p[0-4]`	Multiple-choice QA	Reading comprehension
Finnish	GoldenSwag	`goldenswag_ht_fi_cf_fbv2_p[0-4]`	Sentence completion	Commonsense reasoning
Finnish	FIN-Bench Analogies	`finbench_analogies_cf_fbv2_p[0-4]`	Multiple choice QA	Language-scpecific & world knowledge
Finnish	FIN-bench General Knowledge	`finbench_general_knowledge_cf_fbv2_p[0-4]`	Multiple choice QA	Language-scpecific & world knowledge
French	FQuaD	`fquad_p[0-2]`	Generative QA	Reading comprehension
French	French Language Test: Vocabulary	`french_bench_vocabulary_p[0-2]`	Multiple-choice QA	Language knowledge
Norwegian	NorIdiom Nynorsk	`noridiom_nno_p[0-4]`	Sentence completion	Language knowledge
Norwegian	NRK-Quiz-QA Bokmål	`nrk_quiz_qa_nob_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
Norwegian	NRK-Quiz-QA Nynorsk	`nrk_quiz_qa_nno_p[0-4]`	Multiple-choice QA	Language-specific & world knowledge
Norwegian	NorCommonsenseQA Bokmål	`norcommonsenseqa_nob_p[0-4]`	Multiple-choice QA	Commonsense reasoning
Norwegian	NorQuAD	`norquad_p[0-4]`	Generative QA	Reading comprehension

⚙️ Installation and usage

Install LM Evaluation Harness as described here.

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Clone our HPLT-E GitHub repository.

git clone https://github.com/hplt-project/hplt-e.git
cd hplt-e

Get the finbench_v2 folder from the FinBench v2 GitHub repository.

💻 How to run evaluation?

Detailed guidelines on how to use LM Evaluation Harness can be found here. The task names can be found in the LM Evaluation Harness column in the language-specific task tables provided above. _p[i-j] stands for the corresponding supported prompts.

Basic usage

Below is an example of a basic framework usage and must-have arguments. The evaluation requires the usage of the include_path argument to ensure our tasks are registered in the framework as these are not part of LM Evaluation Harness yet:

lm_eval \
  --model hf \
  --model_args pretrained=my_hf_model_name \
  --tasks global_mmlu_ukrainian_p0 \
  --include_path ./
  --output results/ukrainian/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

An example of a slurm script for the LUMI supercomputer is provided here.

sbatch scripts/run.sh <model_name> <task_name>

Task groups

An alternative approach to run all tasks of interest at once involves creating a task group. LM Evaluation Harness allows to group tasks as described here. An example for the Ukrainian global_mmlu_ukrainian_p0 group task can be found here.

🗂️ How to select tasks?

Please find the example on how to select tasks and vizualize the results below.

from datasets import load_dataset
from utils import *
from viz import *

dataset = load_dataset("HPLT/corpora-comparison-evals", "spa_Latn", split="results").to_pandas()
dataset_criteria_results, dataset_normalized_results, dataset_filtered_tasks = get_normalized_results(
    dataset,
    score_col="max_score",
    monotonicity_threshold=0.5,
    snr_threshold=3,
    mad_threshold=5,
    cv_threshold=15,
    higher_bound=100,
    thresholds=THRESHOLDS_100BT
)

# Viz. example 1: Plotting the results on the selected tasks.
plot_normalized_results(
  dataset_normalized_results,
  tick_style="100BT" # tick_style="30BT" for the 30BT models 
)

# Viz. example 2: Plotting the results on particular task (works for non- and selected tasks). 
plot_results_by_task(
  raw_results=dataset,
  task="spabelebele",
  score_col="max_score", # available: "max_score", "median_score", and "mean_score"
  tick_style="100BT" # tick_style="30BT" for the 30BT models 
)

We provide detailed codebase on task selection here, vizualization codebase here, and the recommended thresholds for 30BT and 100BT models.

🧾 Citation

Our pre-print is available on arXiv.

@article{oepen2025hplt,
  title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
  author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
  journal={arXiv preprint arXiv:2511.01066},
  year={2025}
}

🙏 Acknowledgements

We thank Étienne Simon (UiO), Lucas Georges Gabriel Charpentier (UiO), and Daryna Dementieva (TUM) for their contribution to our prompt collection for French and Ukrainian.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HPLT-E: Comprehensive Multilingual LLM Evaluation

🚀 Updates

📑 Contents

🗺️ Overview

🌐 Multilingual evaluation suite

🧪 Multilingual Evaluation Recipe

🔤 Language score

🌍 "Multilingual" score

😎 HPLT-E Tasks

⚙️ Installation and usage

Basic usage

Task groups

🧾 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
cat_Latn		cat_Latn
ces_Latn		ces_Latn
configs		configs
eus_Latn		eus_Latn
fra_Latn		fra_Latn
glg_Latn		glg_Latn
results		results
scripts		scripts
spa_Latn		spa_Latn
ukr_Cyrl		ukr_Cyrl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
utils.py		utils.py
viz.py		viz.py

License

hplt-project/hplt-e

Folders and files

Latest commit

History

Repository files navigation

HPLT-E: Comprehensive Multilingual LLM Evaluation

🚀 Updates

📑 Contents

🗺️ Overview

🌐 Multilingual evaluation suite

🧪 Multilingual Evaluation Recipe

🔤 Language score

🌍 "Multilingual" score

😎 HPLT-E Tasks

⚙️ Installation and usage

Basic usage

Task groups

🧾 Citation

🙏 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages