Skip to content

hplt-project/hplt-e

Repository files navigation

HPLT-E: Comprehensive Multilingual LLM Evaluation

HPLT-E is a framework for comprehensive multilingual and multi-prompt k-shot evaluation across 124 tasks in nine typologically diverse languages: Catalan, Spanish, Basque, Galician, French, Norwegian, Ukrainian, Czech, and Finnish.

🚀 Updates

  • 19.11.2025: We update HPLT-E and release our results of comparing HPLT 3.0, HPLT v2, FineWeb2, and MADLAD-400.
  • 02.11.2025: Our pre-print is available on arXiv.
  • 08.10.2025: We release HPLT 3.0 together with HPLT-E.

📑 Contents

🗺️ Overview

HPLT-E combines existing monolingual benchmarks for Catalan (CatalanBench), Spanish (SpanishBench), Basque (BasqueBench), Galician (GalicianBench), French (FrenchBench), Norwegian (NorEval), Finnish (FinBench v2), and Czech (BenCzechMark). In addition, we create a multi-task benchmark for Ukrainian (UkrainianBench) and extend single-prompt benchmarks to the multi-prompt scenario (French, Catalan, Spanish, Basque, Galician, and Ukrainian). HPLT-E covers a diverse set of 124 natural language understanding and generation tasks, each supporting 3-7 human-written prompts. Our main evaluation principles include:

  • Diversity: broader representation of lesser-resourced languages in context of pretraining corpora comparison.
  • Data quality: use of human-curated datasets to ensure reliable evaluation.
  • Robust evaluation: evaluation across 500+ prompts written by native speakers to account for prompt sensitivity.
  • Reproducibility: full integration of HPLT-E into LM Evaluation Harness for user-friendly standardized evaluation.

🌐 Multilingual evaluation suite

HPLT-E covers different task categories in all languages: entailment, causal reasoning, mathematical reasoning, commonsense reasoning, language knowledge, language-specific & world knowledge, paraphrase detection, reading comprehension, sentiment analysis, toxicity detection, machine translation, and truthfulness. The supported tasks for each language are summarized below.

Catalan
Name LM Evaluation Harness Task type Task category
ARC-ca arc_ca_challenge_p[0-2] Multiple-choice QA Language-specific & world knowledge
ARC-ca arc_ca_easy_p[0-2] Multiple-choice QA Language-specific & world knowledge
Belebele catbelebele_p[0-2] Multiple-choice QA Reading comprehension
CatalanQA catalanqa_p[0-2] Generative QA Language-specific & world knowledge
CatCoLA catcola_p[0-2] Text classification Language knowledge
COPA-ca copa_ca_p[0-2] Text cassification Commonsense reasoning
CoQCat coqcat_p[0-2] Generative QA Reading comprehension
MGSM-cat mgsm_direct_ca_p[0-2] Generative QA Mathematical reasoning
OpenBookQA-cat openbookqa_ca_p[0-2] Multiple-choice QA Language-specific & world knowledge
Parafraseja parafraseja_p[0-2] Text classification Paraphrase detection
PAWS-ca paws_ca_p[0-2] Text classification Paraphrase detection
PIQA-ca piqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
SIQA-ca siqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
TE-ca teca_p[0-2] Text classification Entailment
VeritasQA-cat Generation veritasqa_ca_gen_p[0-2] Generative QA Truthfulness
VeritasQA-cat Multiple-choice veritasqa_ca_mc1_p[0-2] Multiple-choice QA Truthfulness
VeritasQA-cat Multiple-choice veritasqa_ca_mc2_p[0-2] Multiple-choice QA Truthfulness
WNLI wnli_ca_p[0-2] Text classification Entailment
XNLI xnli_ca_p[0-2] Text classification Entailment
XQuAD xquad_ca_p[0-2] Generative QA Reading comprehension
xStoryCloze xstorycloze_ca_p[0-2] Multiple-choice QA Commonsense reasoning
Cocoteros cocoteros_va_p[0-2] Text generation Commonsense reasoning
FLORES flores_en-ca_p[0-2] Sequence-to-sequence generation Machine translation
Spanish
Name LM Evaluation Harness Task type Task category
Belebele spabelebele_p[0-2] Multiple-choice QA Reading comprehension
COPA copa_es_p[0-2] Text cassification Commonsense reasoning
ESCoLA escola_p[0-2] Text cassification Language knowledge
MGSM-es mgsm_direct_es_p[0-2] Generative QA Mathematical reasoning
OpenBookQA-es openbookqa_es_p[0-2] Multiple-choice QA Language-specific & world knowledge
PAWS-es paws_es_p[0-2] Text cassification Paraphrase detection
VeritasQA-es Generation veritasqa_es_gen_p[0-2] Generative QA Truthfulness
VeritasQA-es Multiple-choice veritasqa_es_mc1_p[0-2] Multiple-choice QA Truthfulness
VeritasQA-es Multiple-choice veritasqa_es_mc2_p[0-2] Multiple-choice QA Truthfulness
XNLI xnli_es_p[0-2] Text cassification Entailment
XQuAD xquad_es_p[0-2] Generative QA Reading comprehension
xStoryCloze xstorycloze_es_p[0-2] Multiple-choice QA Commonsense reasoning
Cocoteros cocoteros_es_p[0-2] Text generation Commonsense reasoning
FLORES flores_en-es_p[0-2] Sequence-to-sequence generation Machine translation
INCLUDE include_spanish_p[0-2] Multiple-choice QA Language-specific & world knowledge
Global-MMLU global_mmlu_spanish_p[0-2] Multiple-choice QA Language-specific & world knowledge
Galician
Name LM Evaluation Harness Task type Task category
Belebele glgbelebele_p[0-2] Multiple-choice QA Reading comprehension
MGSM-gl mgsm_direct_gl_p[0-2] Generative QA Mathematical reasoning
GalCoLA galcola_p[0-2] Text classification Language knowledge
OpenBookQA-gl openbookqa_gl_p[0-2] Multiple-choice QA Language-specific & world knowledge
Parafrases-gl parafrases_gl_p[0-2] Text classification Paraphrase detection
PAWS-gl paws_gl_p[0-2] Text classification Paraphrase detection
FLORES flores_en-glg_p[0-2] Sequence-to-sequence generation Machine translation
VeritasQA-gl Generation veritasqa_gl_gen_p[0-2] Generative QA Truthfulness
VeritasQA-gl Multiple-choice veritasqa_gl_mc1_p[0-2] Multiple-choice QA Truthfulness
VeritasQA-gl Multiple-choice veritasqa_gl_mc2_p[0-2] Multiple-choice QA Truthfulness
Basque
Name LM Evaluation Harness Task type Task category
Belebele eusbelebele_p[0-2] Multiple-choice QA Reading comprehension
EusExams eus_exams_eu_p[0-2] Multiple-choice QA Language-specific & world knowledge
EusProfficiency eus_proficiency_p[0-2] Multiple-choice QA Language-specific & world knowledge
EusReading eus_reading_p[0-2] Multiple-choice QA Reading comprehension
EusTrivia eus_trivia_p[0-2] Multiple-choice QA Language-specific & world knowledge
MGSM-eu mgsm_direct_eu_p[0-2] Generative QA Mathematical reasoning
PIQA-eu piqa_eu_p[0-2] Multiple-choice QA Commonsense reasoning
WNLI wnli_eu_p[0-2] Text classification Entailment
XCOPA xcopa_eu_p[0-2] Text cassification Commonsense reasoning
XNLI xnli_eu_native_p[0-2] Text classification Entailment
xStoryCloze xstorycloze_eu_p[0-2] Multiple-choice QA Commonsense reasoning
PAWS-eu paws_eu_p[0-2] Text classification Paraphrase detection
ARC-eu arc_eu_easy_p[0-2] Multiple-choice QA Language-specific & world knowledge
ARC-eu arc_eu_challenge_p[0-2] Multiple-choice QA Language-specific & world knowledge
FLORES flores_en-eu_p[0-2] Sequence-to-sequence generation Machine translation
INCLUDE include_basque_p[0-2] Multiple-choice QA Language-specific & world knowledge
Norwegian
Name LM Evaluation Harness (Bokmål) LM Evaluation Harness (Nynorsk) Task type Task category
NoReC Sentence norec_sentence_p[0-4] Text classification Sentiment analysis
NoReC Document norec_document_p[0-4] Text classification Sentiment analysis
NorIdiom noridiom_nob_p[0-4] noridiom_nno_p[0-4] Sentence completion Language knowledge
Belebele norbelebele_p[0-4] Multiple-choice QA Reading comprehension
NRK-Quiz-QA nrk_quiz_qa_nob_p[0-4] nrk_quiz_qa_nno_p[0-4] Multiple-choice QA Language-specific & world knowledge
NorOpenBookQA noropenbookqa_nob_p[0-4] noropenbookqa_nno_p[0-4] Multiple-choice QA Language-specific & world knowledge
NorCommonsenseQA norcommonsenseqa_nob_p[0-4] norcommonsenseqa_nno_p[0-4] Multiple-choice QA Commonsense reasoning
NorTruthfulQA Multiple choice nortruthfulqa_mc_nob_p[0-4] nortruthfulqa_mc_nno_p[0-4] Multiple-choice QA Truthfulness
NorQuAD norquad_p[0-4] Generative QA Reading comprehension
NorTruthfulQA Generation nortruthfulqa_gen_nob_p[0-4] nortruthfulqa_gen_nno_p[0-4] Generative QA Truthfulness
Tatoeba (English → Bokmål/Nynorsk) tatoeba_eng_nob_p[0-4] tatoeba_eng_nno_p[0-4] Sequence-to-sequence generation Machine translation
Ukrainian
Name LM Evaluation Harness Task type Task category
Global-MMLU global_mmlu_ukrainian_p[0-2] Multiple-choice QA Language-specific & world knowledge
ZNO zno_p[0-2] Multiple-choice QA Language-specific & world knowledge
INCLUDE include_ukrainian_p[0-2] Multiple-choice QA Language-specific & world knowledge
TextDetox textdetox_ukr_p[0-2] Text classification Toxicity detection
UA-SQuAD ua_squad_p[0-2] Generative QA Reading comprehension
Belebele ukrbelebele_p[0-2] Multiple-choice QA Reading comprehension
WMT24PP wmt24pp_en-uk_p[0-2] Sequence-to-sequence generation Machine translation
Czech

NB: we update BenCzechmark to enable support for latest LM Evaluation Harness versions and create new prompts for Global-MMLU.

Name LM Evaluation Harness Task type Task category
Belebele cesbelebele_p[0-4] Multiple-choice QA Reading comprehension
Global-MMLU global_mmlu_czech_p[0-4] Multiple-choice QA Language-specific & world knowledge
SQAD3.2 cs_sqad32_p[0-4] Generative QA Reading comprehension
Umimeto umimeto_p[0-4] Multiple-choice QA Language-specific & world knowledge
CERMAT OPEN cermat_czech_open_p[0-4] Generative QA Language knowledge
CERMAT TF cermat_czech_tf_p[0-4] Multiple-choice QA Language knowledge
CERMAT MC cermat_czech_mc_p[0-4] Multiple-choice QA Language knowledge
Klokan QA klokan_qa_p[0-4] Multiple-choice QA Mathematical reasoning
CERMAT (Math) MC cermat_czmath_mc_p[0-4] Multiple-choice QA Mathematical reasoning
CERMAT (Math) OPEN cermat_czmath_open_p[0-4] Generative QA Mathematical reasoning
CTKFacts ctkfacts_nli_p[0-4] Text classification Entailment
Subjectivity ces_subjectivity_p[0-4] Text classification Sentiment analysis
CzechSentiment - Mall sentiment_mall_p[0-4] Text classification Sentiment analysis
CzechSentiment - CSFD sentiment_csfd_p[0-4] Text classification Sentiment analysis
CzechSentiment - FB sentiment_fb_p[0-4] Text classification Sentiment analysis
French
Name LM Evaluation Harness Task type Task category
FQuaD fquad_p[0-2] Generative QA Reading comprehension
French Language Test: Grammar french_bench_grammar_p[0-2] Multiple-choice QA Language knowledge
French Language Test: Vocabulary french_bench_vocabulary_p[0-2] Multiple-choice QA Language knowledge
French Language Test: Reading french_bench_reading_p[0-2] Multiple-choice QA Reading comprehension
Belebele frabelebele_p[0-2] Multiple-choice QA Reading comprehension
French NLI topic_based_nli_p[0-2] Text classification Entailment
XNLI french_xnli_p[0-2] Text classification Entailment
INCLUDE include_french_p[0-2] Multiple-choice QA Language-specific & world knowledge
Global-MMLU global_mmlu_french_p[0-2] Multiple-choice QA Language-specific & world knowledge
Finnish
Name Formulation LM Evaluation Harness Task type Task category FinBench v2 dataset version
ARC-challenge-fi Multiple-choice arc_challenge_fi_mcf_fbv2_p[0-4] Multiple-choice QA Language-scpecific & world knowledge finbenchv2-arc-c-fi-ht
Close arc_challenge_fi_cf_fbv2_p[0-4]
Belebele Multiple-choice belebele_fin_Latn_mcf_fbv2_p[0-4] Multiple-choice QA Reading comprehension finbenchv2-belebele-fi-og
Close belebele_fin_Latn_cf_fbv2_p[0-4]
GoldenSwag Multiple-choice goldenswag_ht_fi_mcf_fbv2_p[0-4] Sentence completion Commonsense reasoning finbenchv2-goldenswag-fi-ht
Close goldenswag_ht_fi_cf_fbv2_p[0-4]
FIN-Bench Multiple-choice finbench_analogies_mcf_fbv2_p[0-4] Multiple choice QA Relational reasoning FIN-bench
Close finbench_analogies_cf_fbv2_p[0-4]
Multiple-choice finbench_emotions_mcf_fbv2_p[0-4] Text classification Sentiment analysis FIN-bench
Close finbench_emotions_cf_fbv2_p[0-4]
Multiple-choice finbench_empirical_judgments_mcf_fbv2_p[0-4] Text classification Causal reasoning FIN-bench
Close finbench_empirical_judgments_cf_fbv2_p[0-4]
Multiple-choice finbench_general_knowledge_mcf_fbv2_p[0-4] Multiple choice QA Language-scpecific & world knowledge FIN-bench
Close finbench_general_knowledge_cf_fbv2_p[0-4]
Multiple-choice finbench_hhh_alignment_mcf_fbv2_p[0-4] Multiple choice QA Alignment and safety FIN-bench
Close finbench_hhh_alignment_cf_fbv2_p[0-4]
Multiple-choice finbench_paraphrase_mcf_fbv2_p[0-4] Text classification Paraphrase detection FIN-bench
Close finbench_paraphrase_cf_fbv2_p[0-4]
Multiple-choice finbench_similarities_abstraction_mcf_fbv2_p[0-4] Multiple choice QA Commonsense reasoning FIN-bench
Close finbench_similarities_abstraction_cf_fbv2_p[0-4]

🧪 Multilingual Evaluation Recipe

We provide results of our ablation studies evaluating different corpora and sampling strategies across multiple languages:

Our multilingual evaluation recipe consists of three main components:

  • 🧩 Pretraining: Pretraining ablation models on various corpora configurations for the target languages.
  • 🎯 Task selection: Selecting tasks that provide reliable pretraining evaluation signal based on the maximum performance across prompts.
  • 📊 Performance aggregation: Aggregating performance on the selected tasks across languages.
🧩 Pretraining

Each evaluation series involves pretraining individual 2.15B-parameter models for every language, following a fixed pretraining setup. All models follow the Llama architecture with 24 layers, 32 attention heads, and a sequence length of 2048. The tokenizer is Gemma-3 with the vocabulary size of 262K tokens. For lower-resource languages with less than 30B/100B tokens of available data, datasets are uniformly upsampled (repeated) following Muennighoff et al. (2023). Pretraining is run using the Megatron-LM framework on the LUMI supercomputer, employing 16 AMD MI250x nodes and totaling approximately 1k GPU hours.

🎯 Task selection

We use the standard task-specific metrics and report the maximum score across the prompts as the main performance aggregation method. We adapt the FineWeb2 evaluation design to examine the signal HPLT-E tasks provide based on the criteria and statistics summarized below.

  • Monotonicity: performance should improve as pretraining progresses, even if the improvement differs across pretraining corpora. Tasks with fluctuating scores promote limited reliability.
  • Stable pretraining: relative variability of performance across checkpoints should be low, reflecting smooth pretraining dynamics.
  • Ranking consistency: relative ranking of models should remain consistent across consecutive pretraining intervals.
  • Prompt sensitivity: performance should be consistent across various prompt formulations.
  • Prompt-switch rate: frequent switches in best-performing prompt further reflects low evaluation reliability due to potential prompt lottery.
  • Signal-to-Noise ratio: differences in task performance should primarily reflect differences in corpora quality, not random variation due to prompt choice.
  • Non-randomness: final checkpoints should achieve performance above a random guessing baseline. Tasks where all models perform near random provide low discriminative power.

Specific evaluation criteria requirements are detailed on the corresponding evaluation page.

📊 Performance aggregation We select tasks that provide the pretraining evaluation signal and aggregate the performance using a combination of several approaches.

🔤 Language score

We compute language scores across the selected tasks as follows:

  1. Rescaling: Normalize performance scores relative to a random baseline using min–max normalization.
  2. Category averaging: Compute the average of normalized scores within each task category.
  3. Language score: Derive the final language-level score as the mean of the category averages.

🌍 "Multilingual" score

To compute the multilingual score, we utilize several approaches:

  1. Average normalized score: We average min-max normalized language scores.
  2. Average rank: We rank the final checkpoints' language scores across all corpora configurations and average their ranks.
  3. Borda count: First, we rank the final checkpoints for each language; second, we apply the Borda count on the language-wise rankings to compute the final ranking. We utilize the Vote'n'Rank framework.

😎 HPLT-E Tasks

Based on our large-scale evaluations, we find that the set of selected tasks for each language can slightly differ depending on the number of corpora and checkpoints included in the comparison. Although we encourage users to perform task selection on their own data using our codebase, we also release a set of HPLT-E tasks for languages that less represented in multilingual evaluations, derived from our 100BT model evaluation results.

😎 HPLT-E tasks
Language Name LM Evaluation Harness Task type Task category
Spanish COPA copa_es_p[0-2] Text cassification Commonsense reasoning
Spanish OpenBookQA-es openbookqa_es_p[0-2] Multiple-choice QA Language-specific & world knowledge
Spanish XNLI xnli_es_p[0-2] Text cassification Entailment
Spanish xStoryCloze xstorycloze_es_p[0-2] Multiple-choice QA Commonsense reasoning
Catalan ARC-ca arc_ca_easy_p[0-2] Multiple-choice QA Language-specific & world knowledge
Catalan COPA-ca copa_ca_p[0-2] Text cassification Commonsense reasoning
Catalan CoQCat coqcat_p[0-2] Generative QA Reading comprehension
Catalan PIQA-ca piqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
Catalan SIQA-ca siqa_ca_p[0-2] Multiple-choice QA Commonsense reasoning
Catalan TE-ca teca_p[0-2] Text classification Entailment
Catalan xStoryCloze xstorycloze_ca_p[0-2] Multiple-choice QA Commonsense reasoning
Ukrainian ZNO zno_p[0-2] Multiple-choice QA Language-specific & world knowledge
Ukrainian UA-SQuAD ua_squad_p[0-2] Generative QA Reading comprehension
Czech SQAD3.2 cs_sqad32_p[0-4] Generative QA Reading comprehension
Finnish ARC-challenge-fi arc_challenge_fi_cf_fbv2_p[0-4] Multiple-choice QA Language-scpecific & world knowledge
Finnish Belebele belebele_fin_Latn_cf_fbv2_p[0-4] Multiple-choice QA Reading comprehension
Finnish GoldenSwag goldenswag_ht_fi_cf_fbv2_p[0-4] Sentence completion Commonsense reasoning
Finnish FIN-Bench Analogies finbench_analogies_cf_fbv2_p[0-4] Multiple choice QA Language-scpecific & world knowledge
Finnish FIN-bench General Knowledge finbench_general_knowledge_cf_fbv2_p[0-4] Multiple choice QA Language-scpecific & world knowledge
French FQuaD fquad_p[0-2] Generative QA Reading comprehension
French French Language Test: Vocabulary french_bench_vocabulary_p[0-2] Multiple-choice QA Language knowledge
Norwegian NorIdiom Nynorsk noridiom_nno_p[0-4] Sentence completion Language knowledge
Norwegian NRK-Quiz-QA Bokmål nrk_quiz_qa_nob_p[0-4] Multiple-choice QA Language-specific & world knowledge
Norwegian NRK-Quiz-QA Nynorsk nrk_quiz_qa_nno_p[0-4] Multiple-choice QA Language-specific & world knowledge
Norwegian NorCommonsenseQA Bokmål norcommonsenseqa_nob_p[0-4] Multiple-choice QA Commonsense reasoning
Norwegian NorQuAD norquad_p[0-4] Generative QA Reading comprehension

⚙️ Installation and usage

  1. Install LM Evaluation Harness as described here.
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
  1. Clone our HPLT-E GitHub repository.
git clone https://github.com/hplt-project/hplt-e.git
cd hplt-e
  1. Get the finbench_v2 folder from the FinBench v2 GitHub repository.
💻 How to run evaluation?

Detailed guidelines on how to use LM Evaluation Harness can be found here. The task names can be found in the LM Evaluation Harness column in the language-specific task tables provided above. _p[i-j] stands for the corresponding supported prompts.

Basic usage

Below is an example of a basic framework usage and must-have arguments. The evaluation requires the usage of the include_path argument to ensure our tasks are registered in the framework as these are not part of LM Evaluation Harness yet:

lm_eval \
  --model hf \
  --model_args pretrained=my_hf_model_name \
  --tasks global_mmlu_ukrainian_p0 \
  --include_path ./
  --output results/ukrainian/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

An example of a slurm script for the LUMI supercomputer is provided here.

sbatch scripts/run.sh <model_name> <task_name>

Task groups

An alternative approach to run all tasks of interest at once involves creating a task group. LM Evaluation Harness allows to group tasks as described here. An example for the Ukrainian global_mmlu_ukrainian_p0 group task can be found here.

🗂️ How to select tasks?

Please find the example on how to select tasks and vizualize the results below.

from datasets import load_dataset
from utils import *
from viz import *

dataset = load_dataset("HPLT/corpora-comparison-evals", "spa_Latn", split="results").to_pandas()
dataset_criteria_results, dataset_normalized_results, dataset_filtered_tasks = get_normalized_results(
    dataset,
    score_col="max_score",
    monotonicity_threshold=0.5,
    snr_threshold=3,
    mad_threshold=5,
    cv_threshold=15,
    higher_bound=100,
    thresholds=THRESHOLDS_100BT
)
# Viz. example 1: Plotting the results on the selected tasks.
plot_normalized_results(
  dataset_normalized_results,
  tick_style="100BT" # tick_style="30BT" for the 30BT models 
) 
# Viz. example 2: Plotting the results on particular task (works for non- and selected tasks). 
plot_results_by_task(
  raw_results=dataset,
  task="spabelebele",
  score_col="max_score", # available: "max_score", "median_score", and "mean_score"
  tick_style="100BT" # tick_style="30BT" for the 30BT models 
)

We provide detailed codebase on task selection here, vizualization codebase here, and the recommended thresholds for 30BT and 100BT models.

🧾 Citation

Our pre-print is available on arXiv.

@article{oepen2025hplt,
  title={HPLT\~{} 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono-and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models},
  author={Oepen, Stephan and Arefev, Nikolay and Aulamo, Mikko and Ba{\~n}{\'o}n, Marta and Buljan, Maja and Burchell, Laurie and Charpentier, Lucas and Chen, Pinzhen and Fedorova, Mariya and de Gibert, Ona and others},
  journal={arXiv preprint arXiv:2511.01066},
  year={2025}
}

🙏 Acknowledgements

We thank Étienne Simon (UiO), Lucas Georges Gabriel Charpentier (UiO), and Daryna Dementieva (TUM) for their contribution to our prompt collection for French and Ukrainian.

About

Multilingual evaluation framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published