fluency

translation quality beyond a single score

An LLM-judge metric that evaluates translations across five explainable dimensions. Validated on WMT25. 299 language pairs.

Explore benchmark data →

What it measures

Most translation metrics produce one number. That number tells you something is wrong but not what. fluency evaluates translation quality across five dimensions that matter for creative and product localization.

Each dimension receives a separate score from 0 to 10. The final score aggregates all five. When a translation scores 9.2 overall but 4.1 on calque detection, you know exactly where the problem is: the translation follows source-language structure instead of sounding native.

fluency is an LLM-judge metric. It uses a frontier language model to evaluate translations the way a professional linguist would: reading for naturalness, not just correctness.

Two versions

fluency v2.0 (frontier)

Backbone: Gemini 3 Flash
Reference: not required
Latency: sub-second per segment
Coverage: 299 language pairs, 8 source languages
Best for: continuous evaluation, CI/CD pipelines, high-volume QA

fluency pro v1 (enterprise)

Backbone: proprietary Algebras model
Reference: optional
Deployment: cloud API or on-premise container
Audit trail: full scoring breakdown per segment
Pricing: monthly subscription + per-word usage
Best for: LocQA workflows, compliance-sensitive content, publisher sign-off

Book a demo

Five subdimensions

Calque detection

Identifies literal transfer of source-language structure into the target. A calque is grammatically correct but structurally foreign. Example: English "make a decision" translated to Spanish as "hacer una decisión" instead of the natural "tomar una decisión." Surface metrics like chrF cannot detect this because the words are individually correct.

Idiomatic naturalness

Measures whether a native speaker would produce this phrasing. A translation can be accurate and grammatical but still sound translated. This dimension catches the gap between "correct" and "natural" that separates machine output from human-quality text.

Collocational accuracy

Evaluates whether words combine the way they do in natural target-language usage. "Strong tea" is natural in English; "powerful tea" is not, even though "powerful" is a valid synonym of "strong." Collocational errors are invisible to reference-based metrics when the reference uses a different paraphrase.

Discourse coherence

Assesses whether sentences work together as a text, not just individually. A paragraph where each sentence is well-translated but the connections between them are broken scores high on segment-level metrics and low on discourse coherence. This dimension matters most for longer content: game dialogue, marketing copy, documentation.

Pragmatic adequacy

Checks whether the translation uses the right register, tone, and level of formality for the intended audience. A mobile game localized with formal academic language fails pragmatically even if every word is correct. This dimension is especially important for creative localization where tone carries meaning.

Why not COMET alone

COMET is a strong metric for European language pairs. We use it ourselves as one of four evaluation signals. But COMET has a structural limitation: its reliability depends on how similar the source and target languages are.

We validated fluency v2.0 against WMT25 human evaluation data across 11 English-to-X language pairs.

On typologically distant pairs (EN→JA, EN→AR, EN→Maasai, EN→CS): fluency outperforms COMET on pairwise accuracy and Spearman correlation with human judgments. The judge wins on 4 of 4 pairs with syntactic distance above 0.30.

On typologically close pairs (EN→UK, EN→RU, EN→ET, EN→SR): COMET performs comparably or better.

The correlation between syntactic distance and judge advantage is ρ = 0.77 (p = 0.005). This means metric reliability is not uniform across languages.

Finding	Evidence
Judge outperforms COMET on distant pairs	4/4 pairs with syntax distance > 0.30
COMET outperforms judge on close pairs	4/5 pairs with syntax distance < 0.20
Syntactic distance predicts advantage	ρ = 0.77, p = 0.005
Pattern holds across surface metrics	chrF ρ = 0.77, BLEU ρ = 0.76
Survives multiple comparison correction	Holm-Bonferroni p = 0.038

Benchmark coverage

299

language pairs

source languages

frontier LLMs

segments per pair

Source languages: English, Arabic, Spanish, French, Hindi, Portuguese, Russian, Chinese. 42 target languages including Amharic, Azerbaijani, Bengali, Cantonese, Japanese, Kazakh, Maasai, Swahili, and others. Metrics per pair: fluency v2.0, chrF, BLEU, COMET.

How it works

Double-pass consistency

Each segment scored twice independently. Disagreements flagged for review. Reduces single-pass LLM variance.

Severity levels

Minor (noticeable, not blocking) vs major (user confusion, brand damage). QA teams prioritize by impact, not just score.

Length stratification

Short segments (button labels) and long segments (paragraphs) evaluated with adapted expectations. No one-size-fits-all rubric.

No reference required

Quality estimation mode by default. Evaluates from source text only. Practical for 299 LP where references rarely exist.

Validation

Validated on WMT25 human evaluation data (Kocmi et al., 2025). ESA protocol, 0–100 scale, two independent annotators per segment. 11 EN→X pairs covering all available WMT25 data. See key results in Why not COMET alone above.

Kocmi et al. (2025). WMT25 General MT Shared Task. Link TBD
Rei et al. (2022). COMET-22. Link TBD
Kocmi & Federmann (2023). GEMBA-MQM. Link TBD

Benchmark data description

Column	Description	Range
language_pair	Source-target (EN-JA)	299 pairs
model	Model identifier	21 models
fluency2_mean	Mean fluency score	0–10
chrF	Corpus chrF (sacrebleu)	0–100
BLEU	Corpus BLEU (sacrebleu)	0–100
COMET	wmt22-comet-da	0–1

Aggregate stats (299 language pairs)

Metric	Mean	Min LP	Max LP
fluency v2.0	8.83	5.96 (EN→OR)	9.95 (RU→PT)
chrF	41.55	0.09 (HI→YUE)	75.19 (ES→EN)
BLEU	13.90	0.00 (CJK)	61.34 (ES→EN)
COMET	0.807	0.203 (HI→YUE)	0.927 (EN→RO)

Explore the full benchmark →

Contact: team@algebras.ai