Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ website:
- user-guide/selecting-data.qmd
- user-guide/plotting.qmd
- user-guide/statistics.qmd
- user-guide/metrics.qmd
- section: "Extensions"
contents:
- user-guide/network.qmd
Expand Down Expand Up @@ -286,6 +287,7 @@ quartodoc:
- c_mae
- c_mean_absolute_error
- mape
- mean_absolute_percentage_error
- cc
- corrcoef
- rho
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ format-links: false
ModelSkill compares model results with observations. The workflow can be split in two phases:

1. [Matching](matching.qmd) - making sure that observations and model results are in the same space and time
2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data
2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data (see the [metrics guide](metrics.qmd) for choosing skill metrics)

If the observations and model results are already matched (i.e. are stored in the same data source),
the `from_matched()` function can be used to go directly to the analysis phase.
Expand Down
220 changes: 220 additions & 0 deletions docs/user-guide/metrics.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Metrics

ModelSkill comes with a comprehensive set of skill metrics. This page is a *when-to-use*
guide: how the metrics differ, which one answers which question, and the common pitfalls.
Each metric name links to its full mathematical definition in the
[metrics API reference](../api/metrics.qmd).

Metrics are passed to [](`~modelskill.Comparer.skill`) (and
[](`~modelskill.ComparerCollection.mean_skill`), [](`~modelskill.Comparer.score`)) as
lower-case strings, or as callables:

```{python}
#| code-fold: true
#| code-summary: "Construct a comparer from observation and model data"
import modelskill as ms
o1 = ms.observation("../data/SW/HKNA_Hm0.dfs0", item=0,
x=4.2420, y=52.6887, name="HKNA")
o2 = ms.observation("../data/SW/eur_Hm0.dfs0", item=0,
x=3.2760, y=51.9990, name="EPL")
mr = ms.model_result("../data/SW/HKZN_local_2017_DutchCoast.dfsu",
item="Sign. Wave Height", name="m1")
cc = ms.match([o1, o2], mr)
```

```{python}
cc.skill(metrics=["bias", "rmse", "kge"])
```

In the tables below the first column gives the metric string ModelSkill accepts; the
**alias** (a longer synonym that also works) is listed underneath it. Both link to the API.

When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default
set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can
change the default via `ms.options.metrics.list = [...]`.

```{=html}
<style>
.metrics-ref table { table-layout: fixed; width: 100%; }
.metrics-ref table th:nth-child(1), .metrics-ref table td:nth-child(1) { width: 22%; }
.metrics-ref table th:nth-child(2), .metrics-ref table td:nth-child(2) { width: 12%; }
.metrics-ref table th:nth-child(3), .metrics-ref table td:nth-child(3) { width: 6%; }
.metrics-ref table td { vertical-align: top; }
.metrics-ref table code { overflow-wrap: anywhere; }
</style>
```

---

## Error magnitude — same units as the data (scale-dependent)

These are in the data's units (m, m³/s, …), so they are **not** comparable across stations
of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary.

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.bias`) | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. |
| [](`~modelskill.metrics.mae`)<br>[](`~modelskill.metrics.mean_absolute_error`) | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. |
| [](`~modelskill.metrics.rmse`)<br>[](`~modelskill.metrics.root_mean_squared_error`) | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. |
| [](`~modelskill.metrics.urmse`) | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. |
| [](`~modelskill.metrics.max_error`) | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. |
:::

**Use when:** you want the error in physical units. `bias` for systematic offset, `rmse` as
the default, `mae` when a few outliers shouldn't dominate, `urmse` to isolate random error
from bias, `max_error` for safety-/threshold-critical checks.

---

## Relative / normalised error — dimensionless

Divide the error by a scale, so they *can* be compared across stations — but they divide by
observed values, so they are **unstable when observations pass through zero** (watch out for
water level around a datum).

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.si`)<br>[](`~modelskill.metrics.scatter_index`) | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. |
| [](`~modelskill.metrics.mape`)<br>[](`~modelskill.metrics.mean_absolute_percentage_error`) | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. |
:::

**Use when:** comparing error across stations of very different magnitude, **and** the
observed values stay comfortably away from zero. For water level, prefer the efficiency
scores below over `mape`/`si`.

---

## Efficiency / skill scores — dimensionless, comparable across stations

The "one number for overall skill" family. All reference the observed mean or variance, so a
score of 0 (for NSE/KGE) means "no better than predicting the mean."

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.nse`)<br>[](`~modelskill.metrics.nash_sutcliffe_efficiency`) | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. |
| [](`~modelskill.metrics.r2`) | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see below). |
| [](`~modelskill.metrics.kge`)<br>[](`~modelskill.metrics.kling_gupta_efficiency`) | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. |
| [](`~modelskill.metrics.willmott`) | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. |
| [](`~modelskill.metrics.ev`)<br>[](`~modelskill.metrics.explained_variance`) | (−∞, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). |
| [](`~modelskill.metrics.mef`)<br>[](`~modelskill.metrics.model_efficiency_factor`) | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). |
:::

**Use when:** you need a single dimensionless score to rank models or compare stations.
Reach for **`kge`** when you want to *diagnose why* a model fails (it separates correlation,
bias, and variance); **`nse`** is the hydrology standard for overall predictive power;
`willmott` if you want a strictly bounded [0,1] score.

### `cc`, `ev` and `r2`: a nested hierarchy

A common question is how `cc`, `ev` and `r2` relate. They answer increasingly strict
versions of *"how much of the observed variation does the model capture?"*, each penalising
one more kind of error:

| Score | Penalises bias (offset)? | Penalises wrong amplitude? | Scored against |
|---|---|---|---|
| `cc²` (square of `cc`) | no | no | the best-fit line (free slope + intercept) |
| `ev` | no | yes | obs variance, ignoring a constant offset |
| `r2` (= `nse`) | yes | yes | the 1:1 line |

For the same data this gives the ordering **`cc²` ≥ `ev` ≥ `r2`**: `cc²` forgives both a
constant offset and a wrong amplitude, `ev` forgives only the offset, and `r2`/`nse` forgives
nothing (it scores against the 1:1 line). The gap `ev − r2` is exactly the squared,
normalised bias.

In practice, report `cc` (timing/phase) **plus** one efficiency score (`nse` or `kge`)
**plus** `bias` separately — that trio localises a failure to phase, offset, or amplitude,
which no single number can. `cc²` rarely adds anything once you already report `cc`, and `ev`
(scikit-learn's `explained_variance_score`) sits between the two and is seldom reported on its
own in water modelling.

::: {.callout-warning}
## `r2` is the coefficient of determination, not squared correlation
ModelSkill's `r2` equals **NSE** — they are the same number under two names — *not* squared
Pearson correlation. The two generally differ (`r2` penalises bias, `cc²` does not). If you
want squared correlation, compute `cc` and square it yourself.
:::

---

## Correlation & amplitude — dimensionless

These ignore systematic bias, so always read them **alongside** `bias`.

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.cc`)<br>[](`~modelskill.metrics.corrcoef`) | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. |
| [](`~modelskill.metrics.rho`)<br>[](`~modelskill.metrics.spearmanr`) | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. |
| [](`~modelskill.metrics.lin_slope`) | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. |
:::

**Use when:** the question is about **phase/timing** (`cc`), a monotonic-but-non-linear
relationship (`rho`), or **amplitude** of the response (`lin_slope`).

---

## Event & distribution

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.peak_ratio`)<br>[](`~modelskill.metrics.pr`) | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. |
| [](`~modelskill.metrics.hit_ratio`) | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. |
:::

**Use when:** `peak_ratio` for storm-surge / flood-peak capture. `hit_ratio` for
acceptance-criterion reporting — "X % of points within ±0.1 m" (set the tolerance with `a=`).

---

## Directional / circular — for direction variables only

For wind / wave / current **direction**, where 359° and 1° are 2° apart, not 358°. These
require the quantity to be flagged directional
(`Quantity(..., is_directional=True)`); they handle the 0–360° wrap-around. See the
[directional data example](../examples/Directional_data_comparison.qmd).

::: {.metrics-ref}
| Metric | Range | Best | What it is |
|---|---|---|---|
| [](`~modelskill.metrics.c_bias`) | [−180, 180] | 0 | Circular bias (mean angular error). |
| [](`~modelskill.metrics.c_mae`)<br>[](`~modelskill.metrics.c_mean_absolute_error`) | [0, 180] | 0 | Circular mean absolute error. |
| [](`~modelskill.metrics.c_rmse`)<br>[](`~modelskill.metrics.c_root_mean_squared_error`) | [0, 180] | 0 | Circular RMSE. |
| [](`~modelskill.metrics.c_urmse`)<br>[](`~modelskill.metrics.c_unbiased_root_mean_squared_error`) | [0, 180] | 0 | Circular unbiased RMSE. |
| [](`~modelskill.metrics.c_max_error`) | [0, 180] | 0 | Largest circular (angular) error. |
:::

**Use when:** the variable is an angle. Don't use the scalar metrics above on direction data.

---

## Quick decision guide

| Your question | Reach for |
|---|---|
| Is the model systematically high or low? | `bias` |
| Does it under-/over-respond in amplitude? | `lin_slope` |
| How big is a typical error, in my units? | `rmse` (penalise big misses) or `mae` (robust) |
| Split random vs systematic error? | `urmse` vs `bias` (`rmse² = bias² + urmse²`) |
| What's the worst single miss? | `max_error` |
| One dimensionless score, comparable across stations? | `nse` or `kge` |
| *Why* does the model fail (corr / bias / variance)? | `kge` |
| Compare error across very different-sized stations? | `si` or `nse`/`kge` (avoid `rmse`) |
| Timing / phase agreement? | `cc` |
| Do we capture the storm peaks? | `peak_ratio` |
| What fraction meets an acceptance tolerance? | `hit_ratio` (set `a=`) |
| Working with direction (wind/wave/current)? | the `c_*` family |

## Cross-cutting reminders

- **Scale-dependent vs dimensionless.** The error-magnitude metrics are in data units — never
compare them across stations of different magnitude. The rest are dimensionless and comparable.
- **Always pair correlation with bias.** `cc`/`rho` are blind to offset; a model can correlate
perfectly while sitting 1 m too high.
- **Division-by-obs metrics** (`mape`, `si`) are fragile near zero observations.
- **Custom metrics**: any `f(obs, model) -> float` callable drops straight into
`metrics=[...]`; the column takes the function name. See the
[custom metric example](../examples/Metrics_custom_metric.qmd).
12 changes: 7 additions & 5 deletions src/modelskill/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -483,7 +483,7 @@ def scatter_index2(obs: ArrayLike, model: ArrayLike) -> Any:
{\sum_{i=1}^n obs_i^2}}
$$

Range: [0, 100]; Best: 0
Range: $[0, \infty)$; Best: 0
"""
assert obs.size == model.size
if len(obs) == 0:
Expand All @@ -506,8 +506,10 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
r"""EV: Explained variance

EV is the explained variance and measures the proportion
[0 - 1] to which the model accounts for the variation
(dispersion) of the observations.
to which the model accounts for the variation (dispersion)
of the observations. A perfect match gives 1; a model that
explains less variance than the observed mean gives a
negative value.

In cases with no bias, EV is equal to r2

Expand All @@ -518,7 +520,7 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any:
(obs_i - \overline{obs})^2}
$$

Range: [0, 1]; Best: 1
Range: $(-\infty, 1]$; Best: 1

See Also
--------
Expand Down Expand Up @@ -958,7 +960,7 @@ def c_max_error(obs: ArrayLike, model: ArrayLike) -> Any:

Notes
-----
Range: $[0, \\infty)$; Best: 0
Range: $[0, 180]$; Best: 0

Returns
-------
Expand Down
Loading