diff --git a/docs/_quarto.yml b/docs/_quarto.yml index 4c3e92d85..7569b6bd0 100644 --- a/docs/_quarto.yml +++ b/docs/_quarto.yml @@ -48,6 +48,7 @@ website: - user-guide/selecting-data.qmd - user-guide/plotting.qmd - user-guide/statistics.qmd + - user-guide/metrics.qmd - section: "Extensions" contents: - user-guide/network.qmd @@ -286,6 +287,7 @@ quartodoc: - c_mae - c_mean_absolute_error - mape + - mean_absolute_percentage_error - cc - corrcoef - rho diff --git a/docs/user-guide/index.qmd b/docs/user-guide/index.qmd index d8702d1e2..e5bf54fd1 100644 --- a/docs/user-guide/index.qmd +++ b/docs/user-guide/index.qmd @@ -8,7 +8,7 @@ format-links: false ModelSkill compares model results with observations. The workflow can be split in two phases: 1. [Matching](matching.qmd) - making sure that observations and model results are in the same space and time -2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data +2. Analysis - [plots](plotting.qmd) and [statistics](statistics.qmd) of the matched data (see the [metrics guide](metrics.qmd) for choosing skill metrics) If the observations and model results are already matched (i.e. are stored in the same data source), the `from_matched()` function can be used to go directly to the analysis phase. diff --git a/docs/user-guide/metrics.qmd b/docs/user-guide/metrics.qmd new file mode 100644 index 000000000..6460a495d --- /dev/null +++ b/docs/user-guide/metrics.qmd @@ -0,0 +1,220 @@ +# Metrics + +ModelSkill comes with a comprehensive set of skill metrics. This page is a *when-to-use* +guide: how the metrics differ, which one answers which question, and the common pitfalls. +Each metric name links to its full mathematical definition in the +[metrics API reference](../api/metrics.qmd). + +Metrics are passed to [](`~modelskill.Comparer.skill`) (and +[](`~modelskill.ComparerCollection.mean_skill`), [](`~modelskill.Comparer.score`)) as +lower-case strings, or as callables: + +```{python} +#| code-fold: true +#| code-summary: "Construct a comparer from observation and model data" +import modelskill as ms +o1 = ms.observation("../data/SW/HKNA_Hm0.dfs0", item=0, + x=4.2420, y=52.6887, name="HKNA") +o2 = ms.observation("../data/SW/eur_Hm0.dfs0", item=0, + x=3.2760, y=51.9990, name="EPL") +mr = ms.model_result("../data/SW/HKZN_local_2017_DutchCoast.dfsu", + item="Sign. Wave Height", name="m1") +cc = ms.match([o1, o2], mr) +``` + +```{python} +cc.skill(metrics=["bias", "rmse", "kge"]) +``` + +In the tables below the first column gives the metric string ModelSkill accepts; the +**alias** (a longer synonym that also works) is listed underneath it. Both link to the API. + +When you pass no `metrics=` argument, [](`~modelskill.Comparer.skill`) reports the default +set — the point count `n` plus `bias`, `rmse`, `urmse`, `mae`, `cc`, `si`, `r2`. You can +change the default via `ms.options.metrics.list = [...]`. + +```{=html} + +``` + +--- + +## Error magnitude — same units as the data (scale-dependent) + +These are in the data's units (m, m³/s, …), so they are **not** comparable across stations +of different magnitude — a 5,000 m³/s river will always out-RMSE a 50 m³/s tributary. + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.bias`) | (−∞, ∞) | 0 | Mean error, `mean(model − obs)`. **Sign matters**: + = model runs high, − = runs low. | +| [](`~modelskill.metrics.mae`)
[](`~modelskill.metrics.mean_absolute_error`) | [0, ∞) | 0 | Mean absolute error. Typical error magnitude, **robust to outliers**. | +| [](`~modelskill.metrics.rmse`)
[](`~modelskill.metrics.root_mean_squared_error`) | [0, ∞) | 0 | Root-mean-square error. Like MAE but **penalises large misses** disproportionately. | +| [](`~modelskill.metrics.urmse`) | [0, ∞) | 0 | Unbiased RMSE — the **random/scatter** part after removing bias. `rmse² = bias² + urmse²`. | +| [](`~modelskill.metrics.max_error`) | [0, ∞) | 0 | Largest single absolute error — the **worst-case** miss. | +::: + +**Use when:** you want the error in physical units. `bias` for systematic offset, `rmse` as +the default, `mae` when a few outliers shouldn't dominate, `urmse` to isolate random error +from bias, `max_error` for safety-/threshold-critical checks. + +--- + +## Relative / normalised error — dimensionless + +Divide the error by a scale, so they *can* be compared across stations — but they divide by +observed values, so they are **unstable when observations pass through zero** (watch out for +water level around a datum). + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.si`)
[](`~modelskill.metrics.scatter_index`) | [0, ∞) | 0 | Scatter index = `urmse / mean(obs)`. Random error as a fraction of the mean. | +| [](`~modelskill.metrics.mape`)
[](`~modelskill.metrics.mean_absolute_percentage_error`) | [0, ∞) | 0 | Mean absolute % error, `mean(\|model−obs\| / \|obs\|)`. Intuitive %, but blows up near obs = 0. | +::: + +**Use when:** comparing error across stations of very different magnitude, **and** the +observed values stay comfortably away from zero. For water level, prefer the efficiency +scores below over `mape`/`si`. + +--- + +## Efficiency / skill scores — dimensionless, comparable across stations + +The "one number for overall skill" family. All reference the observed mean or variance, so a +score of 0 (for NSE/KGE) means "no better than predicting the mean." + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.nse`)
[](`~modelskill.metrics.nash_sutcliffe_efficiency`) | (−∞, 1] | 1 | Nash–Sutcliffe efficiency. 0 = no better than the obs mean; < 0 = worse. | +| [](`~modelskill.metrics.r2`) | (−∞, 1] | 1 | Coefficient of determination. Identical to NSE (see below). | +| [](`~modelskill.metrics.kge`)
[](`~modelskill.metrics.kling_gupta_efficiency`) | (−∞, 1] | 1 | Kling–Gupta — composite of **correlation + bias ratio + variability ratio**. | +| [](`~modelskill.metrics.willmott`) | [0, 1] | 1 | Willmott's Index of Agreement — bounded [0,1], less harsh on bias than NSE. | +| [](`~modelskill.metrics.ev`)
[](`~modelskill.metrics.explained_variance`) | (−∞, 1] | 1 | Proportion of variance explained (differs from NSE when the model is biased). | +| [](`~modelskill.metrics.mef`)
[](`~modelskill.metrics.model_efficiency_factor`) | [0, ∞) | 0 | `RMSE / std(obs) = √(1 − NSE)`. Same information as NSE, expressed as an **error** (lower is better). | +::: + +**Use when:** you need a single dimensionless score to rank models or compare stations. +Reach for **`kge`** when you want to *diagnose why* a model fails (it separates correlation, +bias, and variance); **`nse`** is the hydrology standard for overall predictive power; +`willmott` if you want a strictly bounded [0,1] score. + +### `cc`, `ev` and `r2`: a nested hierarchy + +A common question is how `cc`, `ev` and `r2` relate. They answer increasingly strict +versions of *"how much of the observed variation does the model capture?"*, each penalising +one more kind of error: + +| Score | Penalises bias (offset)? | Penalises wrong amplitude? | Scored against | +|---|---|---|---| +| `cc²` (square of `cc`) | no | no | the best-fit line (free slope + intercept) | +| `ev` | no | yes | obs variance, ignoring a constant offset | +| `r2` (= `nse`) | yes | yes | the 1:1 line | + +For the same data this gives the ordering **`cc²` ≥ `ev` ≥ `r2`**: `cc²` forgives both a +constant offset and a wrong amplitude, `ev` forgives only the offset, and `r2`/`nse` forgives +nothing (it scores against the 1:1 line). The gap `ev − r2` is exactly the squared, +normalised bias. + +In practice, report `cc` (timing/phase) **plus** one efficiency score (`nse` or `kge`) +**plus** `bias` separately — that trio localises a failure to phase, offset, or amplitude, +which no single number can. `cc²` rarely adds anything once you already report `cc`, and `ev` +(scikit-learn's `explained_variance_score`) sits between the two and is seldom reported on its +own in water modelling. + +::: {.callout-warning} +## `r2` is the coefficient of determination, not squared correlation +ModelSkill's `r2` equals **NSE** — they are the same number under two names — *not* squared +Pearson correlation. The two generally differ (`r2` penalises bias, `cc²` does not). If you +want squared correlation, compute `cc` and square it yourself. +::: + +--- + +## Correlation & amplitude — dimensionless + +These ignore systematic bias, so always read them **alongside** `bias`. + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.cc`)
[](`~modelskill.metrics.corrcoef`) | [−1, 1] | 1 | Pearson correlation — **linear co-variation / timing**. Blind to bias and amplitude. | +| [](`~modelskill.metrics.rho`)
[](`~modelskill.metrics.spearmanr`) | [−1, 1] | 1 | Spearman rank correlation — monotonic, **robust** to outliers and non-linearity. | +| [](`~modelskill.metrics.lin_slope`) | (−∞, ∞) | 1 | Slope of the model-vs-obs regression. < 1 = model **under-responds** in amplitude. | +::: + +**Use when:** the question is about **phase/timing** (`cc`), a monotonic-but-non-linear +relationship (`rho`), or **amplitude** of the response (`lin_slope`). + +--- + +## Event & distribution + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.peak_ratio`)
[](`~modelskill.metrics.pr`) | [0, ∞) | 1 | Ratio of modelled to observed **peaks** (mean over matched peak events). < 1 = peaks under-predicted. | +| [](`~modelskill.metrics.hit_ratio`) | [0, 1] | 1 | Fraction of points within an acceptable deviation `a` of the observation. Takes an `a=` argument. | +::: + +**Use when:** `peak_ratio` for storm-surge / flood-peak capture. `hit_ratio` for +acceptance-criterion reporting — "X % of points within ±0.1 m" (set the tolerance with `a=`). + +--- + +## Directional / circular — for direction variables only + +For wind / wave / current **direction**, where 359° and 1° are 2° apart, not 358°. These +require the quantity to be flagged directional +(`Quantity(..., is_directional=True)`); they handle the 0–360° wrap-around. See the +[directional data example](../examples/Directional_data_comparison.qmd). + +::: {.metrics-ref} +| Metric | Range | Best | What it is | +|---|---|---|---| +| [](`~modelskill.metrics.c_bias`) | [−180, 180] | 0 | Circular bias (mean angular error). | +| [](`~modelskill.metrics.c_mae`)
[](`~modelskill.metrics.c_mean_absolute_error`) | [0, 180] | 0 | Circular mean absolute error. | +| [](`~modelskill.metrics.c_rmse`)
[](`~modelskill.metrics.c_root_mean_squared_error`) | [0, 180] | 0 | Circular RMSE. | +| [](`~modelskill.metrics.c_urmse`)
[](`~modelskill.metrics.c_unbiased_root_mean_squared_error`) | [0, 180] | 0 | Circular unbiased RMSE. | +| [](`~modelskill.metrics.c_max_error`) | [0, 180] | 0 | Largest circular (angular) error. | +::: + +**Use when:** the variable is an angle. Don't use the scalar metrics above on direction data. + +--- + +## Quick decision guide + +| Your question | Reach for | +|---|---| +| Is the model systematically high or low? | `bias` | +| Does it under-/over-respond in amplitude? | `lin_slope` | +| How big is a typical error, in my units? | `rmse` (penalise big misses) or `mae` (robust) | +| Split random vs systematic error? | `urmse` vs `bias` (`rmse² = bias² + urmse²`) | +| What's the worst single miss? | `max_error` | +| One dimensionless score, comparable across stations? | `nse` or `kge` | +| *Why* does the model fail (corr / bias / variance)? | `kge` | +| Compare error across very different-sized stations? | `si` or `nse`/`kge` (avoid `rmse`) | +| Timing / phase agreement? | `cc` | +| Do we capture the storm peaks? | `peak_ratio` | +| What fraction meets an acceptance tolerance? | `hit_ratio` (set `a=`) | +| Working with direction (wind/wave/current)? | the `c_*` family | + +## Cross-cutting reminders + +- **Scale-dependent vs dimensionless.** The error-magnitude metrics are in data units — never + compare them across stations of different magnitude. The rest are dimensionless and comparable. +- **Always pair correlation with bias.** `cc`/`rho` are blind to offset; a model can correlate + perfectly while sitting 1 m too high. +- **Division-by-obs metrics** (`mape`, `si`) are fragile near zero observations. +- **Custom metrics**: any `f(obs, model) -> float` callable drops straight into + `metrics=[...]`; the column takes the function name. See the + [custom metric example](../examples/Metrics_custom_metric.qmd). diff --git a/src/modelskill/metrics.py b/src/modelskill/metrics.py index 5a6f008a0..3c1addadb 100644 --- a/src/modelskill/metrics.py +++ b/src/modelskill/metrics.py @@ -483,7 +483,7 @@ def scatter_index2(obs: ArrayLike, model: ArrayLike) -> Any: {\sum_{i=1}^n obs_i^2}} $$ - Range: [0, 100]; Best: 0 + Range: $[0, \infty)$; Best: 0 """ assert obs.size == model.size if len(obs) == 0: @@ -506,8 +506,10 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any: r"""EV: Explained variance EV is the explained variance and measures the proportion - [0 - 1] to which the model accounts for the variation - (dispersion) of the observations. + to which the model accounts for the variation (dispersion) + of the observations. A perfect match gives 1; a model that + explains less variance than the observed mean gives a + negative value. In cases with no bias, EV is equal to r2 @@ -518,7 +520,7 @@ def explained_variance(obs: ArrayLike, model: ArrayLike) -> Any: (obs_i - \overline{obs})^2} $$ - Range: [0, 1]; Best: 1 + Range: $(-\infty, 1]$; Best: 1 See Also -------- @@ -958,7 +960,7 @@ def c_max_error(obs: ArrayLike, model: ArrayLike) -> Any: Notes ----- - Range: $[0, \\infty)$; Best: 0 + Range: $[0, 180]$; Best: 0 Returns -------