Mixed parallel corpora

A curated benchmark dataset for creative localization evaluation

60k+ segment evaluations in the pipeline (62,203 measured rows), across 299 language pairs. Built from open sources with emphasis on creative content.

What is this dataset

Mixed parallel corpora is the evaluation dataset behind the Algebras fluency benchmark. It contains 510 source segments per target language, translated across 299 language pairs by 20 frontier LLMs. Each translation is scored on four metrics: fluency v2.0, chrF, BLEU, and COMET.

The dataset was curated from open sources with a deliberate emphasis on creative content: marketing copy, game dialogue, product descriptions, and culturally-loaded text where literal translation fails. A smaller portion includes technical and informational content for genre diversity.

Why not just use WMT

WMT test sets are the standard benchmark for machine translation evaluation. We use WMT25 human evaluation data to validate our metrics. But WMT has limitations as a benchmark for creative localization:

Dimension	WMT25	Mixed parallel corpora
Domain	News, primarily	Creative, marketing, game, product
Language pairs	16 (EN→X + X→X)	299 directions (8 sources × 44 targets in the grid corpus)
Segments	~1,000 per pair	510 per target language
Content type	Formal, informational	Creative, culturally-loaded
Evaluation	Human ESA	Automated 4-metric
Availability	Fully public	Description public, data by request

WMT25 is designed to test translation systems on news text. It does this well. But a game publisher localizing dialogue, or a brand localizing marketing copy, faces different challenges. Calques, register mismatches, and collocational errors matter more in creative content than in news.

Mixed parallel corpora fills this gap: a benchmark dataset where the content resembles what localization teams actually translate.

Dataset statistics

510

segments per target language

62,203

measured segment rows (pipeline)

299

language pairs

source languages

target languages

median segment length (words)

Median length and pair counts reflect the parallel grid corpus used in the pipeline (source side). The creative share is not a stored field in the public schema; the build favors creative and marketing-style segments with intentional genre mix.

Source languages: English, Arabic, Spanish, French, Hindi, Portuguese, Russian, Chinese.

Genre distribution: Labels are not exported in the public segment table; curation prioritizes creative content (marketing, game, product) over news/Wikipedia, with informational and technical slices for diversity—roughly in line with a creative-heavy mix rather than a single genre.

Segment characteristics (source texts in the grid corpus):

Median length: 12 words
Typical range (5th–95th percentile): 3–36 words
Mix of short (UI strings, slogans) and long (paragraphs, dialogue)

How it was built

Segments were randomly sampled from open multilingual sources. The collection process prioritized:

Creative and marketing content over news/Wikipedia
Culturally-loaded expressions that test localization quality
Genre diversity (not 100% one type)
Practical segment lengths matching real localization workloads

The dataset is not publicly available. For access or partnership inquiries, contact support@algebras.ai.

How it's used in the benchmark

Each source segment is translated by all 20 models in the benchmark. Each translation is evaluated on:

fluency v2.0 (LLM-judge, 5 subdimensions, 0–10)
chrF (character n-gram, sacrebleu)
BLEU (token n-gram, sacrebleu)
COMET (neural, wmt22-comet-da, 0–1)

The router selects the best model per language pair based on fluency scores across 10 evaluation segments per pair.

See benchmark results →Learn about fluency metric →