fluency

translation quality beyond a single score

An LLM-judge metric that evaluates translations across five explainable dimensions. Validated on WMT25. 299 language pairs.

Explore benchmark data →


What it measures

Most translation metrics produce one number. That number tells you something is wrong but not what. fluency evaluates translation quality across five dimensions that matter for creative and product localization.

Each dimension receives a separate score from 0 to 10. The final score aggregates all five. When a translation scores 9.2 overall but 4.1 on calque detection, you know exactly where the problem is: the translation follows source-language structure instead of sounding native.

fluency is an LLM-judge metric. It uses a frontier language model to evaluate translations the way a professional linguist would: reading for naturalness, not just correctness.


Two versions

fluency v2.0 (frontier)

  • Backbone: Gemini 3 Flash
  • Reference: not required
  • Latency: sub-second per segment
  • Coverage: 299 language pairs, 8 source languages
  • Best for: continuous evaluation, CI/CD pipelines, high-volume QA

fluency pro v1 (enterprise)

  • Backbone: proprietary Algebras model
  • Reference: optional
  • Deployment: cloud API or on-premise container
  • Audit trail: full scoring breakdown per segment
  • Pricing: monthly subscription + per-word usage
  • Best for: LocQA workflows, compliance-sensitive content, publisher sign-off

Five subdimensions

Calque detection

Identifies literal transfer of source-language structure into the target. A calque is grammatically correct but structurally foreign. Example: English "make a decision" translated to Spanish as "hacer una decisión" instead of the natural "tomar una decisión." Surface metrics like chrF cannot detect this because the words are individually correct.

Idiomatic naturalness

Measures whether a native speaker would produce this phrasing. A translation can be accurate and grammatical but still sound translated. This dimension catches the gap between "correct" and "natural" that separates machine output from human-quality text.

Collocational accuracy

Evaluates whether words combine the way they do in natural target-language usage. "Strong tea" is natural in English; "powerful tea" is not, even though "powerful" is a valid synonym of "strong." Collocational errors are invisible to reference-based metrics when the reference uses a different paraphrase.

Discourse coherence

Assesses whether sentences work together as a text, not just individually. A paragraph where each sentence is well-translated but the connections between them are broken scores high on segment-level metrics and low on discourse coherence. This dimension matters most for longer content: game dialogue, marketing copy, documentation.

Pragmatic adequacy

Checks whether the translation uses the right register, tone, and level of formality for the intended audience. A mobile game localized with formal academic language fails pragmatically even if every word is correct. This dimension is especially important for creative localization where tone carries meaning.


Why not COMET alone

COMET is a strong metric for European language pairs. We use it ourselves as one of four evaluation signals. But COMET has a structural limitation: its reliability depends on how similar the source and target languages are.

We validated fluency v2.0 against WMT25 human evaluation data across 11 English-to-X language pairs.

On typologically distant pairs (EN→JA, EN→AR, EN→Maasai, EN→CS): fluency outperforms COMET on pairwise accuracy and Spearman correlation with human judgments. The judge wins on 4 of 4 pairs with syntactic distance above 0.30.

On typologically close pairs (EN→UK, EN→RU, EN→ET, EN→SR): COMET performs comparably or better.

The correlation between syntactic distance and judge advantage is ρ = 0.77 (p = 0.005). This means metric reliability is not uniform across languages.

FindingEvidence
Judge outperforms COMET on distant pairs4/4 pairs with syntax distance > 0.30
COMET outperforms judge on close pairs4/5 pairs with syntax distance < 0.20
Syntactic distance predicts advantageρ = 0.77, p = 0.005
Pattern holds across surface metricschrF ρ = 0.77, BLEU ρ = 0.76
Survives multiple comparison correctionHolm-Bonferroni p = 0.038

Benchmark coverage

299
language pairs
8
source languages
20
frontier LLMs
10
segments per pair

Source languages: English, Arabic, Spanish, French, Hindi, Portuguese, Russian, Chinese. 42 target languages including Amharic, Azerbaijani, Bengali, Cantonese, Japanese, Kazakh, Maasai, Swahili, and others. Metrics per pair: fluency v2.0, chrF, BLEU, COMET.


How it works

Double-pass consistency

Each segment scored twice independently. Disagreements flagged for review. Reduces single-pass LLM variance.

Severity levels

Minor (noticeable, not blocking) vs major (user confusion, brand damage). QA teams prioritize by impact, not just score.

Length stratification

Short segments (button labels) and long segments (paragraphs) evaluated with adapted expectations. No one-size-fits-all rubric.

No reference required

Quality estimation mode by default. Evaluates from source text only. Practical for 299 LP where references rarely exist.


Validation

Validated on WMT25 human evaluation data (Kocmi et al., 2025). ESA protocol, 0–100 scale, two independent annotators per segment. 11 EN→X pairs covering all available WMT25 data. See key results in Why not COMET alone above.


Related

  • Kocmi et al. (2025). WMT25 General MT Shared Task. Link TBD
  • Rei et al. (2022). COMET-22. Link TBD
  • Kocmi & Federmann (2023). GEMBA-MQM. Link TBD

Benchmark data description

ColumnDescriptionRange
language_pairSource-target (EN-JA)299 pairs
modelModel identifier21 models
fluency2_meanMean fluency score0–10
chrFCorpus chrF (sacrebleu)0–100
BLEUCorpus BLEU (sacrebleu)0–100
COMETwmt22-comet-da0–1

Aggregate stats (299 language pairs)

MetricMeanMin LPMax LP
fluency v2.08.835.96 (EN→OR)9.95 (RU→PT)
chrF41.550.09 (HI→YUE)75.19 (ES→EN)
BLEU13.900.00 (CJK)61.34 (ES→EN)
COMET0.8070.203 (HI→YUE)0.927 (EN→RO)

Explore the full benchmark →

Contact: team@algebras.ai