DHI · jsmariegaard · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
@@ -48,6 +48,7 @@ website:
         - user-guide/selecting-data.qmd
         - user-guide/plotting.qmd
         - user-guide/statistics.qmd
+        - user-guide/metrics.qmd
         - section: "Extensions"
           contents:
             - user-guide/network.qmd
@@ -286,6 +287,7 @@ quartodoc:
             - c_mae
             - c_mean_absolute_error
             - mape
+            - mean_absolute_percentage_error
             - cc
             - corrcoef
             - rho

@@ -8,7 +8,7 @@ format-links: false
 ModelSkill compares model results with observations. The workflow can be split in two phases:
 
 1. [Matching](matching.qmd) - making sure that observations and model results are in the same space and time
-2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data
+2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data (see the [metrics guide](metrics.qmd) for choosing skill metrics)
 
 If the observations and model results are already matched (i.e. are stored in the same data source), 
 the `from_matched()` function can be used to go directly to the analysis phase. 

@@ -0,0 +1,220 @@
+# Metrics
+
+ModelSkill comes with a comprehensive set of skill metrics. This page is a *when-to-use*
+guide: how the metrics differ, which one answers which question, and the common pitfalls.
+Each metric name links to its full mathematical definition in the
+[metrics API reference](../api/metrics.qmd).
+
+Metrics are passed to [](`~modelskill.Comparer.skill`) (and
+[](`~modelskill.ComparerCollection.mean_skill`), [](`~modelskill.Comparer.score`)) as
+lower-case strings, or as callables:
+
+```{python}
+#| code-fold: true
+#| code-summary: "Construct a comparer from observation and model data"
+import modelskill as ms
+o1 = ms.observation("../data/SW/HKNA_Hm0.dfs0", item=0,
+                    x=4.2420, y=52.6887, name="HKNA")
+o2 = ms.observation("../data/SW/eur_Hm0.dfs0", item=0,
+                    x=3.2760, y=51.9990, name="EPL")
+mr = ms.model_result("../data/SW/HKZN_local_2017_DutchCoast.dfsu",
+                     item="Sign. Wave Height", name="m1")
+cc = ms.match([o1, o2], mr)
+```
+
+```{python}
+cc.skill(metrics=["bias", "rmse", "kge"])
+```
+
+In the tables below the first column gives the metric string ModelSkill accepts; the
+**alias** (a longer synonym that also works) is listed underneath it. Both link to the API.
+
+When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default
+set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can
+change the default via `ms.options.metrics.list = [...]`.
+
+```{=html}
+<style>
+.metrics-ref table { table-layout: fixed; width: 100%; }
+.metrics-ref table th:nth-child(1), .metrics-ref table td:nth-child(1) { width: 22%; }
+.metrics-ref table th:nth-child(2), .metrics-ref table td:nth-child(2) { width: 12%; }
+.metrics-ref table th:nth-child(3), .metrics-ref table td:nth-child(3) { width: 6%;  }
+.metrics-ref table td { vertical-align: top; }
+.metrics-ref table code { overflow-wrap: anywhere; }
+</style>
+```
+
+---
+
+## Error magnitude — same units as the data (scale-dependent)
+
+These are in the data's units (m, m³/s, …), so they are **not** comparable across stations
+of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary.
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.bias`) | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. |
+| [](`~modelskill.metrics.mae`)<br>[](`~modelskill.metrics.mean_absolute_error`) | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. |
+| [](`~modelskill.metrics.rmse`)<br>[](`~modelskill.metrics.root_mean_squared_error`) | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. |
+| [](`~modelskill.metrics.urmse`) | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. |
+| [](`~modelskill.metrics.max_error`) | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. |
+:::
+
+**Use when:** you want the error in physical units. `bias` for systematic offset, `rmse` as
+the default, `mae` when a few outliers shouldn't dominate, `urmse` to isolate random error
+from bias, `max_error` for safety-/threshold-critical checks.
+
+---
+
+## Relative / normalised error — dimensionless
+
+Divide the error by a scale, so they *can* be compared across stations — but they divide by
+observed values, so they are **unstable when observations pass through zero** (watch out for
+water level around a datum).
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.si`)<br>[](`~modelskill.metrics.scatter_index`) | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
+| [](`~modelskill.metrics.mape`)<br>[](`~modelskill.metrics.mean_absolute_percentage_error`) | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
+:::
+
+**Use when:** comparing error across stations of very different magnitude, **and** the
+observed values stay comfortably away from zero. For water level, prefer the efficiency
+scores below over `mape`/`si`.
+
+---
+
+## Efficiency / skill scores — dimensionless, comparable across stations
+
+The "one number for overall skill" family. All reference the observed mean or variance, so a
+score of 0 (for NSE/KGE) means "no better than predicting the mean."
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.nse`)<br>[](`~modelskill.metrics.nash_sutcliffe_efficiency`) | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. |
+| [](`~modelskill.metrics.r2`) | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see below). |
+| [](`~modelskill.metrics.kge`)<br>[](`~modelskill.metrics.kling_gupta_efficiency`) | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. |
+| [](`~modelskill.metrics.willmott`) | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. |
+| [](`~modelskill.metrics.ev`)<br>[](`~modelskill.metrics.explained_variance`) | (−∞, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). |
+| [](`~modelskill.metrics.mef`)<br>[](`~modelskill.metrics.model_efficiency_factor`) | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). |
+:::
+
+**Use when:** you need a single dimensionless score to rank models or compare stations.
+Reach for **`kge`** when you want to *diagnose why* a model fails (it separates correlation,
+bias, and variance); **`nse`** is the hydrology standard for overall predictive power;
+`willmott` if you want a strictly bounded [0,1] score.
+
+### `cc`, `ev` and `r2`: a nested hierarchy
+
+A common question is how `cc`, `ev` and `r2` relate. They answer increasingly strict
+versions of *"how much of the observed variation does the model capture?"*, each penalising
+one more kind of error:
+
+| Score | Penalises bias (offset)? | Penalises wrong amplitude? | Scored against |
+|---|---|---|---|
+| `cc²` (square of `cc`) | no | no | the best-fit line (free slope + intercept) |
+| `ev` | no | yes | obs variance, ignoring a constant offset |
+| `r2` (= `nse`) | yes | yes | the 1:1 line |
+
+For the same data this gives the ordering **`cc²` ≥ `ev` ≥ `r2`**: `cc²` forgives both a
+constant offset and a wrong amplitude, `ev` forgives only the offset, and `r2`/`nse` forgives
+nothing (it scores against the 1:1 line). The gap `ev − r2` is exactly the squared,
+normalised bias.
+
+In practice, report `cc` (timing/phase) **plus** one efficiency score (`nse` or `kge`)
+**plus** `bias` separately — that trio localises a failure to phase, offset, or amplitude,
+which no single number can. `cc²` rarely adds anything once you already report `cc`, and `ev`
+(scikit-learn's `explained_variance_score`) sits between the two and is seldom reported on its
+own in water modelling.
+
+::: {.callout-warning}
+## `r2` is the coefficient of determination, not squared correlation
+ModelSkill's `r2` equals **NSE** — they are the same number under two names — *not* squared
+Pearson correlation. The two generally differ (`r2` penalises bias, `cc²` does not). If you
+want squared correlation, compute `cc` and square it yourself.
+:::
+
+---
+
+## Correlation & amplitude — dimensionless
+
+These ignore systematic bias, so always read them **alongside** `bias`.
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.cc`)<br>[](`~modelskill.metrics.corrcoef`) | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. |
+| [](`~modelskill.metrics.rho`)<br>[](`~modelskill.metrics.spearmanr`) | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. |
+| [](`~modelskill.metrics.lin_slope`) | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. |
+:::
+
+**Use when:** the question is about **phase/timing** (`cc`), a monotonic-but-non-linear
+relationship (`rho`), or **amplitude** of the response (`lin_slope`).
+
+---
+
+## Event & distribution
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.peak_ratio`)<br>[](`~modelskill.metrics.pr`) | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. |
+| [](`~modelskill.metrics.hit_ratio`) | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. |
+:::
+
+**Use when:** `peak_ratio` for storm-surge / flood-peak capture. `hit_ratio` for
+acceptance-criterion reporting — "X % of points within ±0.1 m" (set the tolerance with `a=`).
+
+---
+
+## Directional / circular — for direction variables only
+
+For wind / wave / current **direction**, where 359° and 1° are 2° apart, not 358°. These
+require the quantity to be flagged directional
+(`Quantity(..., is_directional=True)`); they handle the 0–360° wrap-around. See the
+[directional data example](../examples/Directional_data_comparison.qmd).
+
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.c_bias`) | [−180, 180] | 0 | Circular bias (mean angular error). |
+| [](`~modelskill.metrics.c_mae`)<br>[](`~modelskill.metrics.c_mean_absolute_error`) | [0, 180] | 0 | Circular mean absolute error. |
+| [](`~modelskill.metrics.c_rmse`)<br>[](`~modelskill.metrics.c_root_mean_squared_error`) | [0, 180] | 0 | Circular RMSE. |
+| [](`~modelskill.metrics.c_urmse`)<br>[](`~modelskill.metrics.c_unbiased_root_mean_squared_error`) | [0, 180] | 0 | Circular unbiased RMSE. |
+| [](`~modelskill.metrics.c_max_error`) | [0, 180] | 0 | Largest circular (angular) error. |
+:::
+
+**Use when:** the variable is an angle. Don't use the scalar metrics above on direction data.
+
+---
+
+## Quick decision guide
+
+| Your question | Reach for |
+|---|---|
+| Is the model systematically high or low? | `bias` |
+| Does it under-/over-respond in amplitude? | `lin_slope` |
+| How big is a typical error, in my units? | `rmse` (penalise big misses) or `mae` (robust) |
+| Split random vs systematic error? | `urmse` vs `bias` (`rmse² = bias² + urmse²`) |
+| What's the worst single miss? | `max_error` |
+| One dimensionless score, comparable across stations? | `nse` or `kge` |
+| *Why* does the model fail (corr / bias / variance)? | `kge` |
+| Compare error across very different-sized stations? | `si` or `nse`/`kge` (avoid `rmse`) |
+| Timing / phase agreement? | `cc` |
+| Do we capture the storm peaks? | `peak_ratio` |
+| What fraction meets an acceptance tolerance? | `hit_ratio` (set `a=`) |
+| Working with direction (wind/wave/current)? | the `c_*` family |
+
+## Cross-cutting reminders
+
+- **Scale-dependent vs dimensionless.** The error-magnitude metrics are in data units — never
+  compare them across stations of different magnitude. The rest are dimensionless and comparable.
+- **Always pair correlation with bias.** `cc`/`rho` are blind to offset; a model can correlate
+  perfectly while sitting 1 m too high.
+- **Division-by-obs metrics** (`mape`, `si`) are fragile near zero observations.
+- **Custom metrics**: any `f(obs, model) -> float` callable drops straight into
+  `metrics=[...]`; the column takes the function name. See the
+  [custom metric example](../examples/Metrics_custom_metric.qmd).
@@ -483,7 +483,7 @@ def scatter_index2(obs: ArrayLike, model: ArrayLike) -> Any:
     {\sum_{i=1}^n obs_i^2}}
     $$
 
-    Range: [0, 100]; Best: 0
+    Range: $[0, \infty)$; Best: 0
     """
     assert obs.size == model.size
     if len(obs) == 0:
@@ -506,8 +506,10 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
     r"""EV: Explained variance
 
      EV is the explained variance and measures the proportion
-     [0 - 1] to which the model accounts for the variation
-     (dispersion) of the observations.
+     to which the model accounts for the variation (dispersion)
+     of the observations. A perfect match gives 1; a model that
+     explains less variance than the observed mean gives a
+     negative value.
 
      In cases with no bias, EV is equal to r2
 
@@ -518,7 +520,7 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
     (obs_i - \overline{obs})^2}
     $$
 
-    Range: [0, 1]; Best: 1
+    Range: $(-\infty, 1]$; Best: 1
 
     See Also
     --------
@@ -958,7 +960,7 @@ def c_max_error(obs: ArrayLike, model: ArrayLike) -> Any:
 
     Notes
     -----
-    Range: $[0, \\infty)$; Best: 0
+    Range: $[0, 180]$; Best: 0
 
     Returns
     -------