From 5bb7421a98fbf5bb65ef7d03b08271ae45e75cd4 Mon Sep 17 00:00:00 2001
From: Jesper Sandvig Mariegaard
 <34088801+jsmariegaard@users.noreply.github.com>
Date: Wed, 10 Jun 2026 12:43:04 +0200
Subject: [PATCH 1/4] Add metrics page to user guide

Adds a when-to-use guide for skill metrics: how they differ, which metric answers which question, and common pitfalls. Resolves the request in #670.

- New docs/user-guide/metrics.qmd with a live intro example and reference tables grouped by metric family, verified against src/modelskill/metrics.py
- Wire into the User Guide sidebar in docs/_quarto.yml
- Link from docs/user-guide/index.qmd

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/_quarto.yml            |   1 +
 docs/user-guide/index.qmd   |   2 +-
 docs/user-guide/metrics.qmd | 174 ++++++++++++++++++++++++++++++++++++
 3 files changed, 176 insertions(+), 1 deletion(-)
 create mode 100644 docs/user-guide/metrics.qmd

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
index 4c3e92d85..3d457b785 100644
--- a/docs/_quarto.yml
+++ b/docs/_quarto.yml
@@ -48,6 +48,7 @@ website:
         - user-guide/selecting-data.qmd
         - user-guide/plotting.qmd
         - user-guide/statistics.qmd
+        - user-guide/metrics.qmd
         - section: "Extensions"
           contents:
             - user-guide/network.qmd
diff --git a/docs/user-guide/index.qmd b/docs/user-guide/index.qmd
index d8702d1e2..e5bf54fd1 100644
--- a/docs/user-guide/index.qmd
+++ b/docs/user-guide/index.qmd
@@ -8,7 +8,7 @@ format-links: false
 ModelSkill compares model results with observations. The workflow can be split in two phases:
 
 1. [Matching](matching.qmd) - making sure that observations and model results are in the same space and time
-2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data
+2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data (see the [metrics guide](metrics.qmd) for choosing skill metrics)
 
 If the observations and model results are already matched (i.e. are stored in the same data source), 
 the `from_matched()` function can be used to go directly to the analysis phase. 
diff --git a/docs/user-guide/metrics.qmd b/docs/user-guide/metrics.qmd
new file mode 100644
index 000000000..455e8c289
--- /dev/null
+++ b/docs/user-guide/metrics.qmd
@@ -0,0 +1,174 @@
+# Metrics
+
+ModelSkill comes with a comprehensive set of skill metrics. This page is a *when-to-use*
+guide: how the metrics differ, which one answers which question, and the common pitfalls.
+For the full mathematical definitions, see the [metrics API reference](../api/metrics.qmd).
+
+Metrics are passed to [](`~modelskill.Comparer.skill`) (and
+[](`~modelskill.ComparerCollection.mean_skill`), [](`~modelskill.Comparer.score`)) as
+lower-case strings, or as callables:
+
+```{python}
+#| code-fold: true
+#| code-summary: "Construct a comparer from observation and model data"
+import modelskill as ms
+o1 = ms.observation("../data/SW/HKNA_Hm0.dfs0", item=0,
+                    x=4.2420, y=52.6887, name="HKNA")
+o2 = ms.observation("../data/SW/eur_Hm0.dfs0", item=0,
+                    x=3.2760, y=51.9990, name="EPL")
+mr = ms.model_result("../data/SW/HKZN_local_2017_DutchCoast.dfsu",
+                     item="Sign. Wave Height", name="m1")
+cc = ms.match([o1, o2], mr)
+```
+
+```{python}
+cc.skill(metrics=["bias", "rmse", "kge"])
+```
+
+In the tables below, **name** is the exact string ModelSkill accepts and **alias** is a
+longer synonym that also works.
+
+When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default
+set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can
+change the default with [](`~modelskill.options`) (`ms.options.metrics.list = [...]`).
+
+---
+
+## Error magnitude — same units as the data (scale-dependent)
+
+These are in the data's units (m, m³/s, …), so they are **not** comparable across stations
+of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary.
+
+| Metric | Alias | Range | Best | What it is |
+|---|---|---|---|---|
+| `bias` | — | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. |
+| `mae` | `mean_absolute_error` | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. |
+| `rmse` | `root_mean_squared_error` | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. |
+| `urmse` | — | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. |
+| `max_error` | — | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. |
+
+**Use when:** you want the error in physical units. `bias` for systematic offset, `rmse` as
+the default, `mae` when a few outliers shouldn't dominate, `urmse` to isolate random error
+from bias, `max_error` for safety-/threshold-critical checks.
+
+---
+
+## Relative / normalised error — dimensionless
+
+Divide the error by a scale, so they *can* be compared across stations — but they divide by
+observed values, so they are **unstable when observations pass through zero** (watch out for
+water level around a datum).
+
+| Metric | Alias | Range | Best | What it is |
+|---|---|---|---|---|
+| `si` | `scatter_index` | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
+| `mape` | `mean_absolute_percentage_error` | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
+| `scatter_index2` | — | [0, 100] | 0 | Alternative SI formulation (centred, normalised by `Σobs²`). Less common. |
+
+**Use when:** comparing error across stations of very different magnitude, **and** the
+observed values stay comfortably away from zero. For water level, prefer the efficiency
+scores below over `mape`/`si`.
+
+---
+
+## Efficiency / skill scores — dimensionless, comparable across stations
+
+The "one number for overall skill" family. All reference the observed mean or variance, so a
+score of 0 (for NSE/KGE) means "no better than predicting the mean."
+
+| Metric | Alias | Range | Best | What it is |
+|---|---|---|---|---|
+| `nse` | `nash_sutcliffe_efficiency` | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. |
+| `r2` | — | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see gotcha below). |
+| `kge` | `kling_gupta_efficiency` | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. |
+| `willmott` | — | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. |
+| `ev` | `explained_variance` | [0, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). |
+| `mef` | `model_efficiency_factor` | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). |
+
+**Use when:** you need a single dimensionless score to rank models or compare stations.
+Reach for **`kge`** when you want to *diagnose why* a model fails (it separates correlation,
+bias, and variance); **`nse`** is the hydrology standard for overall predictive power;
+`willmott` if you want a strictly bounded [0,1] score.
+
+::: {.callout-warning}
+## `r2` is the coefficient of determination, not squared correlation
+ModelSkill's `r2` equals **NSE**, *not* squared Pearson correlation. The two generally
+differ — `r2` accounts for bias, `cc²` does not. If you want squared correlation, compute
+`cc` and square it yourself.
+:::
+
+---
+
+## Correlation & amplitude — dimensionless
+
+These ignore systematic bias, so always read them **alongside** `bias`.
+
+| Metric | Alias | Range | Best | What it is |
+|---|---|---|---|---|
+| `cc` | `corrcoef` | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. |
+| `rho` | `spearmanr` | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. |
+| `lin_slope` | — | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. |
+
+**Use when:** the question is about **phase/timing** (`cc`), a monotonic-but-non-linear
+relationship (`rho`), or **amplitude** of the response (`lin_slope`).
+
+---
+
+## Event & distribution
+
+| Metric | Alias | Range | Best | What it is |
+|---|---|---|---|---|
+| `peak_ratio` | `pr` | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. |
+| `hit_ratio` | — | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. |
+
+**Use when:** `peak_ratio` for storm-surge / flood-peak capture. `hit_ratio` for
+acceptance-criterion reporting — "X % of points within ±0.1 m" (set the tolerance with `a=`).
+
+---
+
+## Directional / circular — for direction variables only
+
+For wind / wave / current **direction**, where 359° and 1° are 2° apart, not 358°. These
+require the quantity to be flagged directional
+(`Quantity(..., is_directional=True)`); they handle the 0–360° wrap-around. See the
+[directional data example](../examples/Directional_data_comparison.qmd).
+
+| Metric | Alias | Best | What it is |
+|---|---|---|---|
+| `c_bias` | — | 0 | Circular bias (mean angular error). |
+| `c_mae` | `c_mean_absolute_error` | 0 | Circular mean absolute error. |
+| `c_rmse` | `c_root_mean_squared_error` | 0 | Circular RMSE. |
+| `c_urmse` | `c_unbiased_root_mean_squared_error` | 0 | Circular unbiased RMSE. |
+| `c_max_error` | — | 0 | Largest circular (angular) error. |
+
+**Use when:** the variable is an angle. Don't use the scalar metrics above on direction data.
+
+---
+
+## Quick decision guide
+
+| Your question | Reach for |
+|---|---|
+| Is the model systematically high or low? | `bias` |
+| Does it under-/over-respond in amplitude? | `lin_slope` |
+| How big is a typical error, in my units? | `rmse` (penalise big misses) or `mae` (robust) |
+| Split random vs systematic error? | `urmse` vs `bias` (`rmse² = bias² + urmse²`) |
+| What's the worst single miss? | `max_error` |
+| One dimensionless score, comparable across stations? | `nse` or `kge` |
+| *Why* does the model fail (corr / bias / variance)? | `kge` |
+| Compare error across very different-sized stations? | `si` or `nse`/`kge` (avoid `rmse`) |
+| Timing / phase agreement? | `cc` |
+| Do we capture the storm peaks? | `peak_ratio` |
+| What fraction meets an acceptance tolerance? | `hit_ratio` (set `a=`) |
+| Working with direction (wind/wave/current)? | the `c_*` family |
+
+## Cross-cutting reminders
+
+- **Scale-dependent vs dimensionless.** The error-magnitude metrics are in data units — never
+  compare them across stations of different magnitude. The rest are dimensionless and comparable.
+- **Always pair correlation with bias.** `cc`/`rho` are blind to offset; a model can correlate
+  perfectly while sitting 1 m too high.
+- **Division-by-obs metrics** (`mape`, `si`) are fragile near zero observations.
+- **Custom metrics**: any `f(obs, model) -> float` callable drops straight into
+  `metrics=[...]`; the column takes the function name. See the
+  [custom metric example](../examples/Metrics_custom_metric.qmd).

From 91e482a048b363b77567c6cffd24c19849dd8fb3 Mon Sep 17 00:00:00 2001
From: Jesper Sandvig Mariegaard
 <34088801+jsmariegaard@users.noreply.github.com>
Date: Wed, 10 Jun 2026 13:54:49 +0200
Subject: [PATCH 2/4] Simplify default-metrics note to a plain code reference

Avoids an interlink to modelskill.options, which is not in the
quartodoc inventory and produced a build warning.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/user-guide/metrics.qmd | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/user-guide/metrics.qmd b/docs/user-guide/metrics.qmd
index 455e8c289..7444da38d 100644
--- a/docs/user-guide/metrics.qmd
+++ b/docs/user-guide/metrics.qmd
@@ -30,7 +30,7 @@ longer synonym that also works.
 
 When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default
 set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can
-change the default with [](`~modelskill.options`) (`ms.options.metrics.list = [...]`).
+change the default via `ms.options.metrics.list = [...]`.
 
 ---
 

From 0523cb5ab8b687f9d76ac940c414acd0226b54df Mon Sep 17 00:00:00 2001
From: Jesper Sandvig Mariegaard
 <34088801+jsmariegaard@users.noreply.github.com>
Date: Wed, 10 Jun 2026 14:11:06 +0200
Subject: [PATCH 3/4] Address review: fix metric ranges, link to API, improve
 table layout

Responds to @ecomodeller's review on #671.

Docstring range fixes in metrics.py:
- explained_variance: [0, 1] -> (-inf, 1] (can be negative)
- scatter_index2: [0, 100] -> [0, inf)
- c_max_error: [0, inf) -> [0, 180] (circular error is bounded)

User-guide metrics page:
- Combine metric + alias into one column; every name now links to
  its API entry (via interlinks) for easy navigation
- Slim the Range/Best columns with table-layout CSS so the
  'What it is' column gets the room it needs
- Apply the corrected ranges; add ranges to the circular table
- New 'cc, ev and r2: a nested hierarchy' section answering the
  core question from #670 (cc^2 >= ev >= r2, and why)

Also document scatter_index2 and mean_absolute_percentage_error in
the API reference so the new links resolve.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/_quarto.yml            |   2 +
 docs/user-guide/metrics.qmd | 129 ++++++++++++++++++++++++------------
 src/modelskill/metrics.py   |  12 ++--
 3 files changed, 97 insertions(+), 46 deletions(-)

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
index 3d457b785..35007406d 100644
--- a/docs/_quarto.yml
+++ b/docs/_quarto.yml
@@ -287,6 +287,7 @@ quartodoc:
             - c_mae
             - c_mean_absolute_error
             - mape
+            - mean_absolute_percentage_error
             - cc
             - corrcoef
             - rho
@@ -303,6 +304,7 @@ quartodoc:
             - model_efficiency_factor
             - si
             - scatter_index
+            - scatter_index2
             - pr
             - peak_ratio
             - willmott
diff --git a/docs/user-guide/metrics.qmd b/docs/user-guide/metrics.qmd
index 7444da38d..8fd2a0586 100644
--- a/docs/user-guide/metrics.qmd
+++ b/docs/user-guide/metrics.qmd
@@ -2,7 +2,8 @@
 
 ModelSkill comes with a comprehensive set of skill metrics. This page is a *when-to-use*
 guide: how the metrics differ, which one answers which question, and the common pitfalls.
-For the full mathematical definitions, see the [metrics API reference](../api/metrics.qmd).
+Each metric name links to its full mathematical definition in the
+[metrics API reference](../api/metrics.qmd).
 
 Metrics are passed to [](`~modelskill.Comparer.skill`) (and
 [](`~modelskill.ComparerCollection.mean_skill`), [](`~modelskill.Comparer.score`)) as
@@ -25,13 +26,24 @@ cc = ms.match([o1, o2], mr)
 cc.skill(metrics=["bias", "rmse", "kge"])
 ```
 
-In the tables below, **name** is the exact string ModelSkill accepts and **alias** is a
-longer synonym that also works.
+In the tables below the first column gives the metric string ModelSkill accepts; the
+**alias** (a longer synonym that also works) is listed underneath it. Both link to the API.
 
 When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default
 set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can
 change the default via `ms.options.metrics.list = [...]`.
 
+```{=html}
+<style>
+.metrics-ref table { table-layout: fixed; width: 100%; }
+.metrics-ref table th:nth-child(1), .metrics-ref table td:nth-child(1) { width: 22%; }
+.metrics-ref table th:nth-child(2), .metrics-ref table td:nth-child(2) { width: 12%; }
+.metrics-ref table th:nth-child(3), .metrics-ref table td:nth-child(3) { width: 6%;  }
+.metrics-ref table td { vertical-align: top; }
+.metrics-ref table code { overflow-wrap: anywhere; }
+</style>
+```
+
 ---
 
 ## Error magnitude — same units as the data (scale-dependent)
@@ -39,13 +51,15 @@ change the default via `ms.options.metrics.list = [...]`.
 These are in the data's units (m, m³/s, …), so they are **not** comparable across stations
 of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary.
 
-| Metric | Alias | Range | Best | What it is |
-|---|---|---|---|---|
-| `bias` | — | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. |
-| `mae` | `mean_absolute_error` | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. |
-| `rmse` | `root_mean_squared_error` | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. |
-| `urmse` | — | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. |
-| `max_error` | — | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.bias`) | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. |
+| [](`~modelskill.metrics.mae`)<br>[](`~modelskill.metrics.mean_absolute_error`) | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. |
+| [](`~modelskill.metrics.rmse`)<br>[](`~modelskill.metrics.root_mean_squared_error`) | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. |
+| [](`~modelskill.metrics.urmse`) | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. |
+| [](`~modelskill.metrics.max_error`) | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. |
+:::
 
 **Use when:** you want the error in physical units. `bias` for systematic offset, `rmse` as
 the default, `mae` when a few outliers shouldn't dominate, `urmse` to isolate random error
@@ -59,11 +73,13 @@ Divide the error by a scale, so they *can* be compared across stations — but t
 observed values, so they are **unstable when observations pass through zero** (watch out for
 water level around a datum).
 
-| Metric | Alias | Range | Best | What it is |
-|---|---|---|---|---|
-| `si` | `scatter_index` | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
-| `mape` | `mean_absolute_percentage_error` | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
-| `scatter_index2` | — | [0, 100] | 0 | Alternative SI formulation (centred, normalised by `Σobs²`). Less common. |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.si`)<br>[](`~modelskill.metrics.scatter_index`) | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
+| [](`~modelskill.metrics.mape`)<br>[](`~modelskill.metrics.mean_absolute_percentage_error`) | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
+| [](`~modelskill.metrics.scatter_index2`) | [0, ∞) | 0 | Alternative SI formulation (centred, normalised by `Σobs²`). Less common. |
+:::
 
 **Use when:** comparing error across stations of very different magnitude, **and** the
 observed values stay comfortably away from zero. For water level, prefer the efficiency
@@ -76,25 +92,50 @@ scores below over `mape`/`si`.
 The "one number for overall skill" family. All reference the observed mean or variance, so a
 score of 0 (for NSE/KGE) means "no better than predicting the mean."
 
-| Metric | Alias | Range | Best | What it is |
-|---|---|---|---|---|
-| `nse` | `nash_sutcliffe_efficiency` | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. |
-| `r2` | — | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see gotcha below). |
-| `kge` | `kling_gupta_efficiency` | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. |
-| `willmott` | — | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. |
-| `ev` | `explained_variance` | [0, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). |
-| `mef` | `model_efficiency_factor` | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.nse`)<br>[](`~modelskill.metrics.nash_sutcliffe_efficiency`) | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. |
+| [](`~modelskill.metrics.r2`) | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see below). |
+| [](`~modelskill.metrics.kge`)<br>[](`~modelskill.metrics.kling_gupta_efficiency`) | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. |
+| [](`~modelskill.metrics.willmott`) | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. |
+| [](`~modelskill.metrics.ev`)<br>[](`~modelskill.metrics.explained_variance`) | (−∞, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). |
+| [](`~modelskill.metrics.mef`)<br>[](`~modelskill.metrics.model_efficiency_factor`) | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). |
+:::
 
 **Use when:** you need a single dimensionless score to rank models or compare stations.
 Reach for **`kge`** when you want to *diagnose why* a model fails (it separates correlation,
 bias, and variance); **`nse`** is the hydrology standard for overall predictive power;
 `willmott` if you want a strictly bounded [0,1] score.
 
+### `cc`, `ev` and `r2`: a nested hierarchy
+
+A common question is how `cc`, `ev` and `r2` relate. They answer increasingly strict
+versions of *"how much of the observed variation does the model capture?"*, each penalising
+one more kind of error:
+
+| Score | Penalises bias (offset)? | Penalises wrong amplitude? | Scored against |
+|---|---|---|---|
+| `cc²` (square of `cc`) | no | no | the best-fit line (free slope + intercept) |
+| `ev` | no | yes | obs variance, ignoring a constant offset |
+| `r2` (= `nse`) | yes | yes | the 1:1 line |
+
+For the same data this gives the ordering **`cc²` ≥ `ev` ≥ `r2`**: `cc²` forgives both a
+constant offset and a wrong amplitude, `ev` forgives only the offset, and `r2`/`nse` forgives
+nothing (it scores against the 1:1 line). The gap `ev − r2` is exactly the squared,
+normalised bias.
+
+In practice, report `cc` (timing/phase) **plus** one efficiency score (`nse` or `kge`)
+**plus** `bias` separately — that trio localises a failure to phase, offset, or amplitude,
+which no single number can. `cc²` rarely adds anything once you already report `cc`, and `ev`
+(scikit-learn's `explained_variance_score`) sits between the two and is seldom reported on its
+own in water modelling.
+
 ::: {.callout-warning}
 ## `r2` is the coefficient of determination, not squared correlation
-ModelSkill's `r2` equals **NSE**, *not* squared Pearson correlation. The two generally
-differ — `r2` accounts for bias, `cc²` does not. If you want squared correlation, compute
-`cc` and square it yourself.
+ModelSkill's `r2` equals **NSE** — they are the same number under two names — *not* squared
+Pearson correlation. The two generally differ (`r2` penalises bias, `cc²` does not). If you
+want squared correlation, compute `cc` and square it yourself.
 :::
 
 ---
@@ -103,11 +144,13 @@ differ — `r2` accounts for bias, `cc²` does not. If you want squared correlat
 
 These ignore systematic bias, so always read them **alongside** `bias`.
 
-| Metric | Alias | Range | Best | What it is |
-|---|---|---|---|---|
-| `cc` | `corrcoef` | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. |
-| `rho` | `spearmanr` | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. |
-| `lin_slope` | — | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.cc`)<br>[](`~modelskill.metrics.corrcoef`) | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. |
+| [](`~modelskill.metrics.rho`)<br>[](`~modelskill.metrics.spearmanr`) | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. |
+| [](`~modelskill.metrics.lin_slope`) | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. |
+:::
 
 **Use when:** the question is about **phase/timing** (`cc`), a monotonic-but-non-linear
 relationship (`rho`), or **amplitude** of the response (`lin_slope`).
@@ -116,10 +159,12 @@ relationship (`rho`), or **amplitude** of the response (`lin_slope`).
 
 ## Event & distribution
 
-| Metric | Alias | Range | Best | What it is |
-|---|---|---|---|---|
-| `peak_ratio` | `pr` | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. |
-| `hit_ratio` | — | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
+|---|---|---|---|
+| [](`~modelskill.metrics.peak_ratio`)<br>[](`~modelskill.metrics.pr`) | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. |
+| [](`~modelskill.metrics.hit_ratio`) | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. |
+:::
 
 **Use when:** `peak_ratio` for storm-surge / flood-peak capture. `hit_ratio` for
 acceptance-criterion reporting — "X % of points within ±0.1 m" (set the tolerance with `a=`).
@@ -133,13 +178,15 @@ require the quantity to be flagged directional
 (`Quantity(..., is_directional=True)`); they handle the 0–360° wrap-around. See the
 [directional data example](../examples/Directional_data_comparison.qmd).
 
-| Metric | Alias | Best | What it is |
+::: {.metrics-ref}
+| Metric | Range | Best | What it is |
 |---|---|---|---|
-| `c_bias` | — | 0 | Circular bias (mean angular error). |
-| `c_mae` | `c_mean_absolute_error` | 0 | Circular mean absolute error. |
-| `c_rmse` | `c_root_mean_squared_error` | 0 | Circular RMSE. |
-| `c_urmse` | `c_unbiased_root_mean_squared_error` | 0 | Circular unbiased RMSE. |
-| `c_max_error` | — | 0 | Largest circular (angular) error. |
+| [](`~modelskill.metrics.c_bias`) | [−180, 180] | 0 | Circular bias (mean angular error). |
+| [](`~modelskill.metrics.c_mae`)<br>[](`~modelskill.metrics.c_mean_absolute_error`) | [0, 180] | 0 | Circular mean absolute error. |
+| [](`~modelskill.metrics.c_rmse`)<br>[](`~modelskill.metrics.c_root_mean_squared_error`) | [0, 180] | 0 | Circular RMSE. |
+| [](`~modelskill.metrics.c_urmse`)<br>[](`~modelskill.metrics.c_unbiased_root_mean_squared_error`) | [0, 180] | 0 | Circular unbiased RMSE. |
+| [](`~modelskill.metrics.c_max_error`) | [0, 180] | 0 | Largest circular (angular) error. |
+:::
 
 **Use when:** the variable is an angle. Don't use the scalar metrics above on direction data.
 
diff --git a/src/modelskill/metrics.py b/src/modelskill/metrics.py
index 5a6f008a0..3c1addadb 100644
--- a/src/modelskill/metrics.py
+++ b/src/modelskill/metrics.py
@@ -483,7 +483,7 @@ def scatter_index2(obs: ArrayLike, model: ArrayLike) -> Any:
     {\sum_{i=1}^n obs_i^2}}
     $$
 
-    Range: [0, 100]; Best: 0
+    Range: $[0, \infty)$; Best: 0
     """
     assert obs.size == model.size
     if len(obs) == 0:
@@ -506,8 +506,10 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
     r"""EV: Explained variance
 
      EV is the explained variance and measures the proportion
-     [0 - 1] to which the model accounts for the variation
-     (dispersion) of the observations.
+     to which the model accounts for the variation (dispersion)
+     of the observations. A perfect match gives 1; a model that
+     explains less variance than the observed mean gives a
+     negative value.
 
      In cases with no bias, EV is equal to r2
 
@@ -518,7 +520,7 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
     (obs_i - \overline{obs})^2}
     $$
 
-    Range: [0, 1]; Best: 1
+    Range: $(-\infty, 1]$; Best: 1
 
     See Also
     --------
@@ -958,7 +960,7 @@ def c_max_error(obs: ArrayLike, model: ArrayLike) -> Any:
 
     Notes
     -----
-    Range: $[0, \\infty)$; Best: 0
+    Range: $[0, 180]$; Best: 0
 
     Returns
     -------

From 9eed9548cde6df1b75c3f0005892b9635f6afce7 Mon Sep 17 00:00:00 2001
From: Jesper Sandvig Mariegaard
 <34088801+jsmariegaard@users.noreply.github.com>
Date: Wed, 10 Jun 2026 19:27:24 +0200
Subject: [PATCH 4/4] Drop scatter_index2 from the metrics guide

It exists only for backward compatibility and isn't recommended, so
it doesn't belong in a when-to-use guide. Also revert documenting it
in the API reference (it was only added there to resolve the page
link). The docstring range fix in metrics.py is kept for accuracy.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 docs/_quarto.yml            | 1 -
 docs/user-guide/metrics.qmd | 1 -
 2 files changed, 2 deletions(-)

diff --git a/docs/_quarto.yml b/docs/_quarto.yml
index 35007406d..7569b6bd0 100644
--- a/docs/_quarto.yml
+++ b/docs/_quarto.yml
@@ -304,7 +304,6 @@ quartodoc:
             - model_efficiency_factor
             - si
             - scatter_index
-            - scatter_index2
             - pr
             - peak_ratio
             - willmott
diff --git a/docs/user-guide/metrics.qmd b/docs/user-guide/metrics.qmd
index 8fd2a0586..6460a495d 100644
--- a/docs/user-guide/metrics.qmd
+++ b/docs/user-guide/metrics.qmd
@@ -78,7 +78,6 @@ water level around a datum).
 |---|---|---|---|
 | [](`~modelskill.metrics.si`)<br>[](`~modelskill.metrics.scatter_index`) | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
 | [](`~modelskill.metrics.mape`)<br>[](`~modelskill.metrics.mean_absolute_percentage_error`) | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
-| [](`~modelskill.metrics.scatter_index2`) | [0, ∞) | 0 | Alternative SI formulation (centred, normalised by `Σobs²`). Less common. |
 :::
 
 **Use when:** comparing error across stations of very different magnitude, **and** the