fluency
translation quality beyond a single score
An LLM-judge metric that evaluates translations across five explainable dimensions. Validated on WMT25. 299 language pairs.
What it measures
Most translation metrics produce one number. That number tells you something is wrong but not what. fluency evaluates translation quality across five dimensions that matter for creative and product localization.
Each dimension receives a separate score from 0 to 10. The final score aggregates all five. When a translation scores 9.2 overall but 4.1 on calque detection, you know exactly where the problem is: the translation follows source-language structure instead of sounding native.
fluency is an LLM-judge metric. It uses a frontier language model to evaluate translations the way a professional linguist would: reading for naturalness, not just correctness.
Two versions
fluency v2.0 (frontier)
- Backbone: Gemini 3 Flash
- Reference: not required
- Latency: sub-second per segment
- Coverage: 299 language pairs, 8 source languages
- Best for: continuous evaluation, CI/CD pipelines, high-volume QA
fluency pro v1 (enterprise)
- Backbone: proprietary Algebras model
- Reference: optional
- Deployment: cloud API or on-premise container
- Audit trail: full scoring breakdown per segment
- Pricing: monthly subscription + per-word usage
- Best for: LocQA workflows, compliance-sensitive content, publisher sign-off
Five subdimensions
Calque detection
Identifies literal transfer of source-language structure into the target. A calque is grammatically correct but structurally foreign. Example: English "make a decision" translated to Spanish as "hacer una decisión" instead of the natural "tomar una decisión." Surface metrics like chrF cannot detect this because the words are individually correct.
Idiomatic naturalness
Measures whether a native speaker would produce this phrasing. A translation can be accurate and grammatical but still sound translated. This dimension catches the gap between "correct" and "natural" that separates machine output from human-quality text.
Collocational accuracy
Evaluates whether words combine the way they do in natural target-language usage. "Strong tea" is natural in English; "powerful tea" is not, even though "powerful" is a valid synonym of "strong." Collocational errors are invisible to reference-based metrics when the reference uses a different paraphrase.
Discourse coherence
Assesses whether sentences work together as a text, not just individually. A paragraph where each sentence is well-translated but the connections between them are broken scores high on segment-level metrics and low on discourse coherence. This dimension matters most for longer content: game dialogue, marketing copy, documentation.
Pragmatic adequacy
Checks whether the translation uses the right register, tone, and level of formality for the intended audience. A mobile game localized with formal academic language fails pragmatically even if every word is correct. This dimension is especially important for creative localization where tone carries meaning.
Why not COMET alone
COMET is a strong metric for European language pairs. We use it ourselves as one of four evaluation signals. But COMET has a structural limitation: its reliability depends on how similar the source and target languages are.
We validated fluency v2.0 against WMT25 human evaluation data across 11 English-to-X language pairs.
On typologically distant pairs (EN→JA, EN→AR, EN→Maasai, EN→CS): fluency outperforms COMET on pairwise accuracy and Spearman correlation with human judgments. The judge wins on 4 of 4 pairs with syntactic distance above 0.30.
On typologically close pairs (EN→UK, EN→RU, EN→ET, EN→SR): COMET performs comparably or better.
The correlation between syntactic distance and judge advantage is ρ = 0.77 (p = 0.005). This means metric reliability is not uniform across languages.
| Finding | Evidence |
|---|---|
| Judge outperforms COMET on distant pairs | 4/4 pairs with syntax distance > 0.30 |
| COMET outperforms judge on close pairs | 4/5 pairs with syntax distance < 0.20 |
| Syntactic distance predicts advantage | ρ = 0.77, p = 0.005 |
| Pattern holds across surface metrics | chrF ρ = 0.77, BLEU ρ = 0.76 |
| Survives multiple comparison correction | Holm-Bonferroni p = 0.038 |
Benchmark coverage
Source languages: English, Arabic, Spanish, French, Hindi, Portuguese, Russian, Chinese. 42 target languages including Amharic, Azerbaijani, Bengali, Cantonese, Japanese, Kazakh, Maasai, Swahili, and others. Metrics per pair: fluency v2.0, chrF, BLEU, COMET.
How it works
Double-pass consistency
Each segment scored twice independently. Disagreements flagged for review. Reduces single-pass LLM variance.
Severity levels
Minor (noticeable, not blocking) vs major (user confusion, brand damage). QA teams prioritize by impact, not just score.
Length stratification
Short segments (button labels) and long segments (paragraphs) evaluated with adapted expectations. No one-size-fits-all rubric.
No reference required
Quality estimation mode by default. Evaluates from source text only. Practical for 299 LP where references rarely exist.
Validation
Validated on WMT25 human evaluation data (Kocmi et al., 2025). ESA protocol, 0–100 scale, two independent annotators per segment. 11 EN→X pairs covering all available WMT25 data. See key results in Why not COMET alone above.
Related
Benchmark data description
| Column | Description | Range |
|---|---|---|
| language_pair | Source-target (EN-JA) | 299 pairs |
| model | Model identifier | 21 models |
| fluency2_mean | Mean fluency score | 0–10 |
| chrF | Corpus chrF (sacrebleu) | 0–100 |
| BLEU | Corpus BLEU (sacrebleu) | 0–100 |
| COMET | wmt22-comet-da | 0–1 |
Aggregate stats (299 language pairs)
| Metric | Mean | Min LP | Max LP |
|---|---|---|---|
| fluency v2.0 | 8.83 | 5.96 (EN→OR) | 9.95 (RU→PT) |
| chrF | 41.55 | 0.09 (HI→YUE) | 75.19 (ES→EN) |
| BLEU | 13.90 | 0.00 (CJK) | 61.34 (ES→EN) |
| COMET | 0.807 | 0.203 (HI→YUE) | 0.927 (EN→RO) |
Contact: team@algebras.ai