Mixed parallel corpora
A curated benchmark dataset for creative localization evaluation
60k+ segment evaluations in the pipeline (62,203 measured rows), across 299 language pairs. Built from open sources with emphasis on creative content.
What is this dataset
Mixed parallel corpora is the evaluation dataset behind the Algebras fluency benchmark. It contains 510 source segments per target language, translated across 299 language pairs by 20 frontier LLMs. Each translation is scored on four metrics: fluency v2.0, chrF, BLEU, and COMET.
The dataset was curated from open sources with a deliberate emphasis on creative content: marketing copy, game dialogue, product descriptions, and culturally-loaded text where literal translation fails. A smaller portion includes technical and informational content for genre diversity.
Why not just use WMT
WMT test sets are the standard benchmark for machine translation evaluation. We use WMT25 human evaluation data to validate our metrics. But WMT has limitations as a benchmark for creative localization:
| Dimension | WMT25 | Mixed parallel corpora |
|---|---|---|
| Domain | News, primarily | Creative, marketing, game, product |
| Language pairs | 16 (EN→X + X→X) | 299 directions (8 sources × 44 targets in the grid corpus) |
| Segments | ~1,000 per pair | 510 per target language |
| Content type | Formal, informational | Creative, culturally-loaded |
| Evaluation | Human ESA | Automated 4-metric |
| Availability | Fully public | Description public, data by request |
WMT25 is designed to test translation systems on news text. It does this well. But a game publisher localizing dialogue, or a brand localizing marketing copy, faces different challenges. Calques, register mismatches, and collocational errors matter more in creative content than in news.
Mixed parallel corpora fills this gap: a benchmark dataset where the content resembles what localization teams actually translate.
Dataset statistics
Median length and pair counts reflect the parallel grid corpus used in the pipeline (source side). The creative share is not a stored field in the public schema; the build favors creative and marketing-style segments with intentional genre mix.
Source languages: English, Arabic, Spanish, French, Hindi, Portuguese, Russian, Chinese.
Genre distribution: Labels are not exported in the public segment table; curation prioritizes creative content (marketing, game, product) over news/Wikipedia, with informational and technical slices for diversity—roughly in line with a creative-heavy mix rather than a single genre.
Segment characteristics (source texts in the grid corpus):
- Median length: 12 words
- Typical range (5th–95th percentile): 3–36 words
- Mix of short (UI strings, slogans) and long (paragraphs, dialogue)
How it was built
Segments were randomly sampled from open multilingual sources. The collection process prioritized:
- Creative and marketing content over news/Wikipedia
- Culturally-loaded expressions that test localization quality
- Genre diversity (not 100% one type)
- Practical segment lengths matching real localization workloads
The dataset is not publicly available. For access or partnership inquiries, contact support@algebras.ai.
How it's used in the benchmark
Each source segment is translated by all 20 models in the benchmark. Each translation is evaluated on:
- fluency v2.0 (LLM-judge, 5 subdimensions, 0–10)
- chrF (character n-gram, sacrebleu)
- BLEU (token n-gram, sacrebleu)
- COMET (neural, wmt22-comet-da, 0–1)
The router selects the best model per language pair based on fluency scores across 10 evaluation segments per pair.