From b2f245efad51701cac07a630d755e49e26a5ce47 Mon Sep 17 00:00:00 2001 From: Simon Strandgaard Date: Mon, 18 May 2026 16:17:21 +0200 Subject: [PATCH] docs(napkin-math): add methodology doc with assessment-slideshow content Co-Authored-By: Claude Opus 4.7 (1M context) --- experiments/napkin_math/docs/methology.md | 246 ++++++++++++++++++++++ 1 file changed, 246 insertions(+) create mode 100644 experiments/napkin_math/docs/methology.md diff --git a/experiments/napkin_math/docs/methology.md b/experiments/napkin_math/docs/methology.md new file mode 100644 index 00000000..60f8ad2e --- /dev/null +++ b/experiments/napkin_math/docs/methology.md @@ -0,0 +1,246 @@ +# PlanExe Napkin-Math Monte Carlo — Methodology + +The PlanExe napkin-math pipeline turns a long PlanExe report into a small +deterministic model and runs 10,000 Monte Carlo simulations against it. +The last section is slideshow content intended for the assessment +slideshow, placed before the plan roster. + +## Pipeline overview + +Eight stages: + +1. `compress_report_section` — compresses four long sections of the PlanExe + report (Selected Scenario, Review Plan, Premortem, Expert Criticism) + into inline-tagged digests. +2. `prepare_extract_input.py` — concatenates the four compressed digests + with the four passthrough raw sections (Executive Summary, Project Plan, + Assumptions, Data Collection) into one `extract_parameters_input.md` + digest. +3. `extract-parameters-from-digest` — reads the digest and emits a small + JSON: 4–8 key values, ≤5 missing inputs, ≤5 first calculations, ≤5 + unmodelled existential gates. Inputs carry an inline + `[source_status | e=N r=N | quote: verified]` tag that records *how* the + value is known. +4. `validate_parameters.py` — enforces 16 structural rules + (id uniqueness, dependency declaration, formula-RHS declared, comment + word caps, threshold-friendly naming, no dead-end variables, …) before + any simulation runs. +5. `generate-bounds` — proposes `low / base / high` triangular bounds for + every missing input and for every key value whose `value_type` or + `uncertainty` says it needs a distribution. Each bounds entry carries + `source: "data"` or `source: "assumption"` and a one-sentence rationale. +6. `generate-calculations` — emits one Python function per declared + formula. The functions are pure: no I/O, no globals, no classes. +7. `run_monte_carlo.py` — the sampling and threshold-evaluation runner. + This is the only stage that touches randomness. +8. `summarize_assessment.py` — renders the assessment document with the + manifest, provenance map, gate verdicts, decision implications, failure + drivers, missing-input rankings, and scenario sanity-check tables. + +## How a single run works + +Given parameters.json, bounds.json, calculations.py, and +montecarlo_settings.json, the runner does the following for each of the +configured `n_runs` (defaulting to 10,000) trials: + +1. **Sample each input from its triangular distribution.** Each bounded + variable is drawn from `triangular(low, base, high)`. The + `sampling_discipline` tag on the bounds entry post-processes the draw: + - `continuous` — plain triangular draw clamped to `[low, high]`. + - `fraction` — draw clamped to `[0, 1]` and then to `[low, high]`. + - `integer` — continuous draw, rounded, re-clamped to `[low, high]`. + - `bernoulli_gate` — Bernoulli trial with `default_pass_probability`; + the result is `low` on failure and `high` on success. + - `fixed` — `low == base == high`; the variable is genuinely pinned. + Inputs that are not in bounds.json (key values with a single declared + numeric `value`) keep that value across all 10,000 runs. + +2. **Run the deterministic calculations.** The runner invokes each function + in `calculations.py` in declaration order. Inputs are pulled from the + current run's input pool plus any earlier-stage outputs. There is no + re-sampling within a run; the same draw of every input flows through + every calculation. + +3. **Evaluate each declared threshold.** Each entry in the `thresholds` + block of `montecarlo_settings.json` is of the form + `{ "operator": ">=", "value": 0 }` and points at the id of a computed + output. The output is named so that *positive = pass* — the validator + enforces the threshold-friendly naming rule at extraction time, so + margins, surplus, and coverage variables read in the "positive = good" + direction. The runner records pass / fail per threshold per run. + +4. **Tally and aggregate.** After 10,000 runs the per-threshold pass rate + is the empirical probability that the declared gate holds under the + declared bounds. Bands: + - **ROBUST** if pass rate ≥ 80%. + - **MARGINAL** if 50% ≤ pass rate < 80%. + - **FRAGILE** if 20% ≤ pass rate < 50%. + - **DOOM** if pass rate < 20%. + The plan's overall risk band is the band of its worst declared gate. + +The seed defaults to `12345` so two runs against the same parameters / +bounds / calculations produce identical output. Re-running after a bounds +tweak is the right way to do sensitivity analysis — change the bounds, +rerun, compare. + +## Where the bounds come from + +Two layers and one label. + +**Layer one — uncertain inputs are flagged by rule.** During parameter +extraction every value gets a `value_type` and an `uncertainty` tag. A +bounds entry is then opened for any variable that is in +`missing_values_to_estimate`, has `value_type ∈ {inferred, +missing_but_needed}`, has `value: null`, has `uncertainty: high`, or has +`uncertainty: medium` paired with `modelling_priority ∈ {critical, high}`. +Variables that fail none of those tests are treated as facts and held +fixed across all 10,000 runs. + +**Layer two — the triangular bounds are proposed.** For each flagged +variable, `low / base / high` is chosen informed by: + +- *Source-report anchors* — explicit numbers the plan names directly + (e.g., a stated cost range, a risk-register sensitivity, an expert-review + sensitivity bracket). The plan's own framing comes first. +- *Source-status tags* on the digest line — `explicit`, `derived`, + `inferred`, `stress_test`, or `missing`. `stress_test` figures from the + Premortem are the right anchor for the *upper bound* on a cost variable + or the *lower bound* on a coverage variable. +- *Spread heuristics* — `±10–20%` on low-uncertainty variables, + `±25–50%` on medium-uncertainty variables, `≥±50%` (up to 2–5×) on + high-uncertainty variables. +- *Sampling discipline* — fractions stay in `[0, 1]`, integer counts use + whole-number bounds, bernoulli gates declare a default-pass probability. + +**The label that exposes this to the reader.** Every bounds entry carries +`"source": "data"` (anchored on a source-report number that the rationale +must cite) or `"source": "assumption"` (extrapolated where the report is +silent). This is rendered downstream as a `Basis` column in the +assessment's *Missing inputs ranked by impact* table, so the reader can +see at a glance whether a given driver is grounded in the plan or +extrapolated by the model. + +The pipeline never claims the bounds *are* the truth. It claims the bounds +make every assumption visible and editable in a single 10-line JSON +fragment per variable. + +## Where the thresholds come from + +Each line in the `thresholds` block of `montecarlo_settings.json` points +at one calculation output and says *"this output, evaluated by this +operator against this value, is the gate."* The thresholds themselves are +lifted from the plan, not invented by the model. Every threshold carries a +`threshold_basis` tag: + +- **`report_explicit`** — the plan states the threshold directly + ("if X exceeds 1.20 …", "fleet uptime must hold above 90%"). This is + the strongest provenance. +- **`report_inferred`** — the plan implies the threshold through framing + but does not state it as a hard number. The proposed value carries a + one-line rationale citing the implicit framing. +- **`report_derived`** — the threshold is calculable from other explicit + plan numbers (e.g., a break-even volume derived from a stated budget + and per-unit margin). +- **`model_defined`** — the calculation produces a margin or surplus + whose *direction* is plan-defined but whose zero-crossing the plan does + not name. The runner uses `>= 0` as the default because the + threshold-friendly naming rule guarantees positive = pass. + +The assessment renders this basis as a column in the *Gate verdicts* +table, so a skeptical reader can immediately see which gates are grounded +in source-report numbers and which were derived by the model. + +## Worst-gate framing and its limits + +The `overall_risk_band` field in the assessment manifest is the *band of +the plan's worst declared gate*. This is intentional and load-bearing: +when a plan declares multiple binary commitments, the failure of any one +of them is sufficient to fail the plan. The min-over-gates aggregation +respects that — it does not let a 100% pass on the budget gate paper over +a 1% pass on a coverage gate. + +What it is *not*: + +- It is not a calibrated whole-plan probability. The pass rate is + conditional on the declared bounds and on the unmodelled gates holding. +- It is not a verdict on the plan as a whole. It is a verdict on whether + the plan's *modelled gates* are credible under the declared bounds. +- It does not combine across gates with different units (USD vs hours vs + fraction). The `Aggregation warning` block in the assessment names this + explicitly. + +The assessment also lists *unmodelled existential gates* — gates whose +failure would end the plan independently of any financial or operational +margin (regulatory approval, political signoff, supply continuity, +counterparty acceptance). These never enter the Monte Carlo. They are +listed so a reader sees at a glance that the modelled DOOM/FRAGILE verdict +is conditional on those existential gates holding. + +--- + +## Slideshow content + +The two slides below are written to be inserted into the assessment +slideshow *before* the plan roster. They directly answer "where did the +distributions come from?" and "who decided the thresholds?" so the +roster slides land on a reader who already trusts the pipeline. + +### Slide A — How the simulation works + +- Each uncertain input is drawn **10,000 times** from a triangular + distribution over its `low / base / high` bounds. Each draw flows through + a deterministic Python calculation, and the result is checked against the + declared threshold. +- Same seed (`12345`) reproduces the run exactly. +- The pass rate over 10,000 runs lands the gate in one of four bands: + ≥80% **ROBUST**, 50–80% **MARGINAL**, 20–50% **FRAGILE**, <20% **DOOM**. +- The plan's **overall risk band is the band of its worst declared gate**. + A budget gate passing 100% does not paper over a coverage gate passing + 1%; the min-over-gates rule respects that each declared commitment must + hold on its own. +- Existential gates that cannot be tested as numbers (regulatory approval, + political signoff, supply continuity) are listed in `unmodelled_gates`. + They qualify the modelled verdict but never enter the simulation. + +*The model produces a categorical verdict per gate. It does not produce a +calibrated whole-plan probability.* + +### Slide B — Where the numbers come from + +**Bounds (the input distributions):** + +- Uncertain inputs are flagged by rule: every entry in + `missing_values_to_estimate`, plus any key value tagged `inferred`, + `missing_but_needed`, `uncertainty: high`, or `uncertainty: medium` at + critical / high modelling priority. +- Each uncertain input's `low / base / high` is anchored on **the plan's + own numbers** — risk-register figures, expert-review sensitivities, + scenario ranges — with assumption-driven spread (±10–50% by + uncertainty band) only where the plan is silent. +- Every bounds entry carries a `source` label: **`data`** (anchored on a + source-report number that the rationale must cite) or **`assumption`** + (extrapolated where the report is silent). The assessment surfaces this + as the `Basis` column in *Missing inputs ranked by impact* so the reader + sees, for each driver, whether the bound came from the plan or was + extrapolated. + +**Thresholds (the pass/fail gates):** + +- Thresholds are **lifted from the plan, not invented**. Each threshold + carries a `threshold_basis` tag rendered in the *Gate verdicts* table: + - **`report_explicit`** — the plan states the threshold directly + ("if uptime < 90%", "if budget exceeds €15B"). + - **`report_inferred`** — the plan implies the threshold; the proposed + value carries a cited rationale. + - **`report_derived`** — the threshold is computable from other + explicit plan numbers. + - **`model_defined`** — the calculation produces a margin / surplus + whose direction is plan-defined but whose zero-crossing the plan + does not name; the runner defaults to `>= 0`. +- The threshold-friendly naming rule guarantees *positive = pass* on every + margin, surplus, and coverage variable, so the reader never has to + guess which sign is the good one. + +*Every distribution and every threshold in the simulation is traceable +back to a labeled source — either the plan or an explicit modelling +assumption. Nothing in the verdict is unattributed.*