Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions experiments/napkin_math/docs/methology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# PlanExe Napkin-Math Monte Carlo β€” Methodology

The PlanExe napkin-math pipeline turns a long PlanExe report into a small
deterministic model and runs 10,000 Monte Carlo simulations against it.
The last section is slideshow content intended for the assessment
slideshow, placed before the plan roster.

## Pipeline overview

Eight stages:

1. `compress_report_section` β€” compresses four long sections of the PlanExe
report (Selected Scenario, Review Plan, Premortem, Expert Criticism)
into inline-tagged digests.
2. `prepare_extract_input.py` β€” concatenates the four compressed digests
with the four passthrough raw sections (Executive Summary, Project Plan,
Assumptions, Data Collection) into one `extract_parameters_input.md`
digest.
3. `extract-parameters-from-digest` β€” reads the digest and emits a small
JSON: 4–8 key values, ≀5 missing inputs, ≀5 first calculations, ≀5
unmodelled existential gates. Inputs carry an inline
`[source_status | e=N r=N | quote: verified]` tag that records *how* the
value is known.
4. `validate_parameters.py` β€” enforces 16 structural rules
(id uniqueness, dependency declaration, formula-RHS declared, comment
word caps, threshold-friendly naming, no dead-end variables, …) before
any simulation runs.
5. `generate-bounds` β€” proposes `low / base / high` triangular bounds for
every missing input and for every key value whose `value_type` or
`uncertainty` says it needs a distribution. Each bounds entry carries
`source: "data"` or `source: "assumption"` and a one-sentence rationale.
6. `generate-calculations` β€” emits one Python function per declared
formula. The functions are pure: no I/O, no globals, no classes.
7. `run_monte_carlo.py` β€” the sampling and threshold-evaluation runner.
This is the only stage that touches randomness.
8. `summarize_assessment.py` β€” renders the assessment document with the
manifest, provenance map, gate verdicts, decision implications, failure
drivers, missing-input rankings, and scenario sanity-check tables.

## How a single run works

Given parameters.json, bounds.json, calculations.py, and
montecarlo_settings.json, the runner does the following for each of the
configured `n_runs` (defaulting to 10,000) trials:

1. **Sample each input from its triangular distribution.** Each bounded
variable is drawn from `triangular(low, base, high)`. The
`sampling_discipline` tag on the bounds entry post-processes the draw:
- `continuous` β€” plain triangular draw clamped to `[low, high]`.
- `fraction` β€” draw clamped to `[0, 1]` and then to `[low, high]`.
- `integer` β€” continuous draw, rounded, re-clamped to `[low, high]`.
- `bernoulli_gate` β€” Bernoulli trial with `default_pass_probability`;
the result is `low` on failure and `high` on success.
- `fixed` β€” `low == base == high`; the variable is genuinely pinned.
Inputs that are not in bounds.json (key values with a single declared
numeric `value`) keep that value across all 10,000 runs.

2. **Run the deterministic calculations.** The runner invokes each function
in `calculations.py` in declaration order. Inputs are pulled from the
current run's input pool plus any earlier-stage outputs. There is no
re-sampling within a run; the same draw of every input flows through
every calculation.

3. **Evaluate each declared threshold.** Each entry in the `thresholds`
block of `montecarlo_settings.json` is of the form
`{ "operator": ">=", "value": 0 }` and points at the id of a computed
output. The output is named so that *positive = pass* β€” the validator
enforces the threshold-friendly naming rule at extraction time, so
margins, surplus, and coverage variables read in the "positive = good"
direction. The runner records pass / fail per threshold per run.

4. **Tally and aggregate.** After 10,000 runs the per-threshold pass rate
is the empirical probability that the declared gate holds under the
declared bounds. Bands:
- **ROBUST** if pass rate β‰₯ 80%.
- **MARGINAL** if 50% ≀ pass rate < 80%.
- **FRAGILE** if 20% ≀ pass rate < 50%.
- **DOOM** if pass rate < 20%.
The plan's overall risk band is the band of its worst declared gate.

The seed defaults to `12345` so two runs against the same parameters /
bounds / calculations produce identical output. Re-running after a bounds
tweak is the right way to do sensitivity analysis β€” change the bounds,
rerun, compare.

## Where the bounds come from

Two layers and one label.

**Layer one β€” uncertain inputs are flagged by rule.** During parameter
extraction every value gets a `value_type` and an `uncertainty` tag. A
bounds entry is then opened for any variable that is in
`missing_values_to_estimate`, has `value_type ∈ {inferred,
missing_but_needed}`, has `value: null`, has `uncertainty: high`, or has
`uncertainty: medium` paired with `modelling_priority ∈ {critical, high}`.
Variables that fail none of those tests are treated as facts and held
fixed across all 10,000 runs.

**Layer two β€” the triangular bounds are proposed.** For each flagged
variable, `low / base / high` is chosen informed by:

- *Source-report anchors* β€” explicit numbers the plan names directly
(e.g., a stated cost range, a risk-register sensitivity, an expert-review
sensitivity bracket). The plan's own framing comes first.
- *Source-status tags* on the digest line β€” `explicit`, `derived`,
`inferred`, `stress_test`, or `missing`. `stress_test` figures from the
Premortem are the right anchor for the *upper bound* on a cost variable
or the *lower bound* on a coverage variable.
- *Spread heuristics* β€” `Β±10–20%` on low-uncertainty variables,
`Β±25–50%` on medium-uncertainty variables, `β‰₯Β±50%` (up to 2–5Γ—) on
high-uncertainty variables.
- *Sampling discipline* β€” fractions stay in `[0, 1]`, integer counts use
whole-number bounds, bernoulli gates declare a default-pass probability.

**The label that exposes this to the reader.** Every bounds entry carries
`"source": "data"` (anchored on a source-report number that the rationale
must cite) or `"source": "assumption"` (extrapolated where the report is
silent). This is rendered downstream as a `Basis` column in the
assessment's *Missing inputs ranked by impact* table, so the reader can
see at a glance whether a given driver is grounded in the plan or
extrapolated by the model.

The pipeline never claims the bounds *are* the truth. It claims the bounds
make every assumption visible and editable in a single 10-line JSON
fragment per variable.

## Where the thresholds come from

Each line in the `thresholds` block of `montecarlo_settings.json` points
at one calculation output and says *"this output, evaluated by this
operator against this value, is the gate."* The thresholds themselves are
lifted from the plan, not invented by the model. Every threshold carries a
`threshold_basis` tag:

- **`report_explicit`** β€” the plan states the threshold directly
("if X exceeds 1.20 …", "fleet uptime must hold above 90%"). This is
the strongest provenance.
- **`report_inferred`** β€” the plan implies the threshold through framing
but does not state it as a hard number. The proposed value carries a
one-line rationale citing the implicit framing.
- **`report_derived`** β€” the threshold is calculable from other explicit
plan numbers (e.g., a break-even volume derived from a stated budget
and per-unit margin).
- **`model_defined`** β€” the calculation produces a margin or surplus
whose *direction* is plan-defined but whose zero-crossing the plan does
not name. The runner uses `>= 0` as the default because the
threshold-friendly naming rule guarantees positive = pass.

The assessment renders this basis as a column in the *Gate verdicts*
table, so a skeptical reader can immediately see which gates are grounded
in source-report numbers and which were derived by the model.

## Worst-gate framing and its limits

The `overall_risk_band` field in the assessment manifest is the *band of
the plan's worst declared gate*. This is intentional and load-bearing:
when a plan declares multiple binary commitments, the failure of any one
of them is sufficient to fail the plan. The min-over-gates aggregation
respects that β€” it does not let a 100% pass on the budget gate paper over
a 1% pass on a coverage gate.

What it is *not*:

- It is not a calibrated whole-plan probability. The pass rate is
conditional on the declared bounds and on the unmodelled gates holding.
- It is not a verdict on the plan as a whole. It is a verdict on whether
the plan's *modelled gates* are credible under the declared bounds.
- It does not combine across gates with different units (USD vs hours vs
fraction). The `Aggregation warning` block in the assessment names this
explicitly.

The assessment also lists *unmodelled existential gates* β€” gates whose
failure would end the plan independently of any financial or operational
margin (regulatory approval, political signoff, supply continuity,
counterparty acceptance). These never enter the Monte Carlo. They are
listed so a reader sees at a glance that the modelled DOOM/FRAGILE verdict
is conditional on those existential gates holding.

---

## Slideshow content

The two slides below are written to be inserted into the assessment
slideshow *before* the plan roster. They directly answer "where did the
distributions come from?" and "who decided the thresholds?" so the
roster slides land on a reader who already trusts the pipeline.

### Slide A β€” How the simulation works

- Each uncertain input is drawn **10,000 times** from a triangular
distribution over its `low / base / high` bounds. Each draw flows through
a deterministic Python calculation, and the result is checked against the
declared threshold.
- Same seed (`12345`) reproduces the run exactly.
- The pass rate over 10,000 runs lands the gate in one of four bands:
β‰₯80% **ROBUST**, 50–80% **MARGINAL**, 20–50% **FRAGILE**, <20% **DOOM**.
- The plan's **overall risk band is the band of its worst declared gate**.
A budget gate passing 100% does not paper over a coverage gate passing
1%; the min-over-gates rule respects that each declared commitment must
hold on its own.
- Existential gates that cannot be tested as numbers (regulatory approval,
political signoff, supply continuity) are listed in `unmodelled_gates`.
They qualify the modelled verdict but never enter the simulation.

*The model produces a categorical verdict per gate. It does not produce a
calibrated whole-plan probability.*

### Slide B β€” Where the numbers come from

**Bounds (the input distributions):**

- Uncertain inputs are flagged by rule: every entry in
`missing_values_to_estimate`, plus any key value tagged `inferred`,
`missing_but_needed`, `uncertainty: high`, or `uncertainty: medium` at
critical / high modelling priority.
- Each uncertain input's `low / base / high` is anchored on **the plan's
own numbers** β€” risk-register figures, expert-review sensitivities,
scenario ranges β€” with assumption-driven spread (Β±10–50% by
uncertainty band) only where the plan is silent.
- Every bounds entry carries a `source` label: **`data`** (anchored on a
source-report number that the rationale must cite) or **`assumption`**
(extrapolated where the report is silent). The assessment surfaces this
as the `Basis` column in *Missing inputs ranked by impact* so the reader
sees, for each driver, whether the bound came from the plan or was
extrapolated.

**Thresholds (the pass/fail gates):**

- Thresholds are **lifted from the plan, not invented**. Each threshold
carries a `threshold_basis` tag rendered in the *Gate verdicts* table:
- **`report_explicit`** β€” the plan states the threshold directly
("if uptime < 90%", "if budget exceeds €15B").
- **`report_inferred`** β€” the plan implies the threshold; the proposed
value carries a cited rationale.
- **`report_derived`** β€” the threshold is computable from other
explicit plan numbers.
- **`model_defined`** β€” the calculation produces a margin / surplus
whose direction is plan-defined but whose zero-crossing the plan
does not name; the runner defaults to `>= 0`.
- The threshold-friendly naming rule guarantees *positive = pass* on every
margin, surplus, and coverage variable, so the reader never has to
guess which sign is the good one.

*Every distribution and every threshold in the simulation is traceable
back to a labeled source β€” either the plan or an explicit modelling
assumption. Nothing in the verdict is unattributed.*