From fcb9692bf11c01e66373f03c58bf2b03ab99f825 Mon Sep 17 00:00:00 2001
From: lmoresi <louis.moresi@anu.edu.au>
Date: Tue, 16 Jun 2026 14:12:32 +1000
Subject: [PATCH] docs(solvers): preserve SNESFAS / Vanka / grid-sequencing
 investigation (design notes + working prototypes)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Lands the parked multilevel-nonlinear-Stokes investigation as documentation so the
significant trial-and-error and the WORKING implementations are not lost. No src/ or
test changes — everything runs through petsc_options + petsc4py on stock UW3.

Design notes (docs/developer/design/):
 - snesfas-feasibility.md: SNESFAS via options; mesh-independent on scalar nonlinear;
   adaptation benign.
 - snesfas-vanka-feasibility-study.md: the Vanka deep-dive; custom-IS pressure-support
   patches work where stock PCPATCH fails; mesh-independent Vanka-MG (5/5/6); FAS-Vanka
   2-vs-10 on plasticity but fails at extreme contrast; 3-way FMG/GAMG/FAS-Vanka.
 - multilevel-nonlinear-stokes-strategy.md: capstone + decision to park (small 2D win,
   high engineering load, cleanup first) with return-conditions (3D/parallel/real
   FMG-failing benchmark) and the cheapest re-entry (Tier-1 grid_sequence helper).

Working prototypes (docs/examples/snesfas_investigation/, precedent: submesh_investigation/):
 vanka_asm.py (core custom-IS Vanka recipe), vanka_mg.py (mesh-independent Vanka MG),
 fas_vanka.py (working SNESFAS-Vanka for nonlinear, modest contrast — the 2-vs-10 case),
 benchmark_3way.py (FMG/GAMG/FAS-Vanka), notched_beam_cascade.py (grid sequencing).
 Cross-imports made self-contained; all byte-compile. README documents each + the
 recipes + known limits.

Underworld development team with AI support from Claude Code
---
 .../multilevel-nonlinear-stokes-strategy.md   | 135 +++++++
 docs/developer/design/snesfas-feasibility.md  | 377 ++++++++++++++++++
 .../design/snesfas-vanka-feasibility-study.md | 337 ++++++++++++++++
 docs/examples/snesfas_investigation/README.md |  49 +++
 .../snesfas_investigation/benchmark_3way.py   | 110 +++++
 .../snesfas_investigation/fas_vanka.py        | 101 +++++
 .../notched_beam_cascade.py                   | 102 +++++
 .../snesfas_investigation/vanka_asm.py        | 102 +++++
 .../snesfas_investigation/vanka_mg.py         | 138 +++++++
 9 files changed, 1451 insertions(+)
 create mode 100644 docs/developer/design/multilevel-nonlinear-stokes-strategy.md
 create mode 100644 docs/developer/design/snesfas-feasibility.md
 create mode 100644 docs/developer/design/snesfas-vanka-feasibility-study.md
 create mode 100644 docs/examples/snesfas_investigation/README.md
 create mode 100644 docs/examples/snesfas_investigation/benchmark_3way.py
 create mode 100644 docs/examples/snesfas_investigation/fas_vanka.py
 create mode 100644 docs/examples/snesfas_investigation/notched_beam_cascade.py
 create mode 100644 docs/examples/snesfas_investigation/vanka_asm.py
 create mode 100644 docs/examples/snesfas_investigation/vanka_mg.py

diff --git a/docs/developer/design/multilevel-nonlinear-stokes-strategy.md b/docs/developer/design/multilevel-nonlinear-stokes-strategy.md
new file mode 100644
index 00000000..d26c221d
--- /dev/null
+++ b/docs/developer/design/multilevel-nonlinear-stokes-strategy.md
@@ -0,0 +1,135 @@
+---
+title: "Multilevel solvers for nonlinear Stokes — strategy, findings, and a decision to park"
+status: PARKED (2026-06-16)
+---
+
+# Multilevel solvers for nonlinear Stokes — strategy note + decision
+
+**Decision (2026-06-16): park this. Capture the understanding, ship nothing yet.**
+The conceptual payoff of the investigation was large; the *implementation* payoff
+*right now* is small relative to the engineering load, and the codebase wants a
+cleanup before features of this size land. This note records the landscape, the
+findings, and the conditions under which it is worth coming back. Companion docs
+with the detailed experiments: `snesfas-feasibility.md` and
+`snesfas-vanka-feasibility-study.md`. **The working prototypes are preserved in-repo
+at `docs/examples/snesfas_investigation/`** (curated set + README; the full scratch
+set is in `~/+Simulations/snesfas_spike/`). No `src/` changes were made — this was a
+spike.
+
+## What we were chasing
+
+A robust, scalable solver for **nonlinear Stokes** in the regime that actually
+matters for geodynamics: **strong nonlinearity (plasticity, T-dependent viscosity)
+*and* high viscosity contrast — which co-occur, because the nonlinearity is what
+creates the contrast.** FMG (the geometric-MG-on-the-velocity-block preconditioner,
+#231) is the current robust default; the question was whether SNESFAS / Vanka /
+grid-sequencing could do better where FMG-inside-Newton struggles.
+
+## The conceptual map (the durable result)
+
+Three multilevel strategies, and the key distinctions that took a while to get
+straight:
+
+- **FMG (Galerkin).** Coarse operator `A_H = R A_h P`, built *algebraically from the
+  fine matrix* — it never re-visits the PDE. The multigrid acts on the **velocity
+  block only**, inside the Schur fieldsplit; the saddle/pressure coupling and the
+  viscosity jump are handled by the Schur, *not* by a coarse-grid correction. That is
+  why FMG is robust to high contrast: it keeps the hard, indefinite, high-contrast
+  part **out of the multigrid**. Transfers can be approximate (barycentric / general);
+  they only move a *correction* with homogeneous boundary values, and the smoother +
+  Krylov clean up any imperfection — so it tolerates UW3's non-nested (curved /
+  mover-deformed) hierarchies.
+
+- **SNESFAS (re-discretisation + τ-correction).** Nonlinear multigrid on the **full
+  coupled saddle**. Each level uses the *genuine re-discretised* coarse problem (not
+  Galerkin — `-snes_view` confirms "Not using Galerkin… function evaluation"), plus
+  the **τ-correction** that couples levels (the coarse grid solves the full problem
+  with the restricted fine residual as forcing, and prolongs only the *change*). The
+  smoother is a separate choice — **Vanka** (patch-local mini-Stokes solves) is one
+  option for the saddle, *not* part of FAS itself. The τ-correction is the extra
+  nonlinear machinery — **and it is exactly the part that is fragile to high
+  contrast**, because it is a repeated coarse-grid *correction* across a jump the
+  coarse grid cannot resolve.
+
+- **Grid sequencing / nested iteration (Brandt).** Solve the re-discretised coarse
+  problem (cheap, well-conditioned) to **find the basin**, interpolate up as the
+  initial guess, solve the next level, repeat. Coarse grids are used for
+  *initialisation*, never *correction* — so there is **no coarse-correction-across-
+  the-jump**, hence it is robust to contrast. It lacks FAS's τ-correction (so it
+  doesn't get the textbook MG convergence *rate*), but for problems that become
+  *more ill-conditioned with refinement* it is the effective, robust tool: each
+  level is only a bounded correction to the one below.
+
+The unifying insight: **nonlinearity wants a coarse-grid correction (FAS); high
+contrast forbids one (the coarse grid can't represent the jump).** Those are
+opposite demands, which is why no single monolithic cycle covers both, and why the
+robust workhorse stays "split the saddle off (FMG) + handle the nonlinearity by
+Newton/Picard + continuation (timestepping)."
+
+## What was actually demonstrated (iteration counts — timings were unreliable)
+
+(Wall-clock numbers in the companion docs are *soft* — JIT, machine noise,
+resolution/tolerance-sensitive. Trust the iteration counts.)
+
+- **SNESFAS runs through UW3 with zero `src` changes** (petsc_options only) — `copyDS`
+  + PETSc's FAS setup give coarse residuals. Mesh-independent on scalar nonlinear
+  problems; beats Newton on stiff nonlinear diffusion.
+- **Vanka works** on UW3's simplex P2–P1 once patches are built as the *support of
+  each pressure basis function* (custom-IS `PCASM`, exact local LU) — the stock
+  `PCPATCH` constructs are the wrong tool and fail. A geometric MG with that Vanka
+  smoother is **mesh-independent (5/5/6 outer iters across 16× DOFs)** — needs a GMRES
+  Krylov smoother (additive Schwarz alone diverges).
+- **FAS-Vanka wins on plasticity** — viscoplastic τ_y=1: **2 nonlinear iterations vs
+  10** for Newton+FMG and GAMG (the coarse problem is better conditioned, as
+  predicted). But with *additive* Vanka it **fails at extreme contrast (10⁶)** no
+  matter how hard you smooth — needs multiplicative Vanka + interface-aware coarsening.
+- **The cascade helps iff the per-level solve is iteration-limited.** On SolCx it does
+  nothing for FMG (already 1–3 iters → cascade is pure overhead) but rescues a
+  struggling GAMG (**8→2 iters**). Its payoff is exactly proportional to how badly the
+  cold fine solve struggles.
+- **PETSc's `-snes_grid_sequence` does not work on UW3 meshes** — the built-in
+  interpolator hard-fails (point location) on UW3's non-nested (boundary-snapped /
+  mover-deformed) hierarchies and on the BC-constrained boundary DOFs. A hand-rolled
+  cascade using `uw.function.evaluate` (robust, never hard-fails; BCs re-imposed per
+  level) sidesteps this entirely.
+
+## Why park it now
+
+- **The marginal win is small in the current regime.** FMG already handles the
+  linear / moderate / high-contrast-*linear* cases well; for nonlinear, Picard/Newton
+  + FMG + timestepping-as-continuation is robust if unspectacular. FAS-Vanka's clear
+  advantage (plasticity iteration count) does **not** yet translate to wall-clock
+  (LU/Vanka smoother cost), and it breaks on exactly the high-contrast that real
+  nonlinear problems also carry.
+- **The engineering load is high.** Production FAS-Vanka needs: a multiplicative /
+  interface-aware Vanka smoother (the extreme-contrast gap), per-level pressure
+  nullspace, parallel patch construction, and a UW3 API. Production grid-sequencing
+  needs the problem **re-bindable to coarser meshes** (re-create variables, re-bind
+  expressions/BCs, and **restrict auxiliary/material fields** — the recurring "get the
+  coefficient data onto the coarse grid" problem).
+- **The tree needs a cleanup first.** Landing a feature of this weight onto a working
+  tree with many outstanding changes is the wrong order.
+
+## When to come back (what would change the calculus)
+
+- **3-D and large-parallel.** This is where the argument flips: the direct solves
+  that FMG's coarse grid and the LU paths lean on scale ~O(N²) in 3-D and MUMPS gets
+  unreliable at high core counts, while Vanka's local solves scale ~O(N) and
+  parallelise. Re-measure there.
+- **A concrete problem where FMG genuinely fails at production resolution** — e.g. a
+  notched/localising plasticity benchmark whose cold fine solve collapses. The spike
+  could not tune a *simplified* version into the "easy-coarse / hard-fine" knife-edge
+  (UW3's Picard+FMG was robust on smooth viscoplastic across resolutions); the real
+  pathology needs the singular features (geometric notch tip, Drucker–Prager) and is
+  the right target if/when it bites in practice.
+
+## Cheapest future entry point
+
+If a real need appears, the lowest-cost first step is **Tier-1 grid sequencing**: a
+~30-line `grid_sequence_solve(build_fn, resolutions)` helper that loops coarse→fine,
+solves each level with FMG, and interpolates up via `uw.function.evaluate`. It needs
+no solver-internals changes and is robust to UW3's non-nested meshes. It only helps
+when the fine solve is iteration-limited, so it should be opt-in. Prototype:
+`~/+Simulations/snesfas_spike/notched_beam_cascade.py`. The callback-free
+`solve(grid_sequence=True)` and the FAS-Vanka path are the larger follow-ons, gated
+on the cleanup and on 3-D/parallel demand.
diff --git a/docs/developer/design/snesfas-feasibility.md b/docs/developer/design/snesfas-feasibility.md
new file mode 100644
index 00000000..2ec6ec71
--- /dev/null
+++ b/docs/developer/design/snesfas-feasibility.md
@@ -0,0 +1,377 @@
+---
+title: "SNESFAS (nonlinear multigrid) feasibility in Underworld3"
+---
+
+# SNESFAS feasibility spike
+
+**Verdict: GO for scalar nonlinear, and GO for nonlinear Stokes with adaptation
+(with two engineering follow-ons).** PETSc's `SNESFAS` (Full Approximation Scheme —
+nonlinear geometric multigrid) runs through Underworld3's existing DMPlex-FE solver
+stack **with no source changes** — it is activated entirely through `petsc_options`.
+On a strongly nonlinear scalar problem it is mesh-independent, more robust than
+Newton, and faster, in serial and parallel. On **nonlinear (power-law) Stokes** it
+holds a nonlinearity-independent 2–3 multigrid cycles where Newton climbs to 30+
+iterations, **including on an adapted (coordinate-deformed) mesh** — the actual
+production target. The two remaining pieces before it is production-ready are
+scalable saddle-point smoothers (the spike used direct LU per level) and a
+per-level pressure nullspace for enclosed/free-slip problems.
+
+This note records the spike (scalar `SNES_Scalar` / `Poisson`, static meshes) that
+established feasibility, and what remains before FAS could be offered as a
+production feature or extended to Stokes.
+
+## Background: why FAS, and the architectural question
+
+The geometric **FMG** preconditioner landed in #231 gives *linear* multigrid that
+is robust to mesh anisotropy. But nonlinear robustness still rests on Newton
+(`newtonls`) / Picard, whose convergence basin shrinks as nonlinearity sharpens
+(temperature-dependent viscosity, viscoplastic yield, Richards fronts). `SNESFAS`
+performs coarse-grid correction on the **nonlinear residual itself**, which is the
+classic lever for global robustness.
+
+The open question was whether UW3 could feed FAS a genuine nonlinear residual on
+every coarse level. The FMG note (`docs/advanced/multigrid-preconditioning.md`)
+states that UW3 "does not install residual/Jacobian callbacks on the coarse DMs,
+so the coarse operators must be formed by Galerkin projection (RAP)". That is true
+for the **linear `PCMG`** path — but it turns out **not** to block FAS.
+
+## Key finding: FAS works with zero code changes
+
+In `petsc_generic_snes_solvers.pyx` each solver `_build`:
+
+1. copies the `PetscDS` (which holds the JIT pointwise residual/Jacobian function
+   pointers) to **every** coarse DM — `self.dm.copyDS(coarse_dm)` in a loop over
+   `self.dm_hierarchy`; and
+2. installs the SNES local-FEM callback on the **fine** DM only —
+   `UW_DMPlexSetSNESLocalFEM(cdm.dm, ...)`.
+
+When `snes_type = fas`, PETSc's `SNESSetUp_FAS` walks the coarse-DM chain (already
+linked by `setCoarseDM` when the mesh is built with `refinement=N`) and
+**propagates the fine DM's `DMSNES` local-FEM callback down to each level** via
+`DMCopyDMSNES`. Combined with the already-copied DS, every level can evaluate its
+own nonlinear residual and Jacobian. `-snes_view` confirms it:
+
+```
+type: fas
+  type is FULL, levels=3, cycles=1
+  Not using Galerkin computed coarse grid function evaluation   <-- genuine re-discretisation
+  Coarse grid solver -- level 0:  newtonls + LU,   rows=445
+  Down/Up smoother on level 1:    newtonls,        rows=1857
+  Down/Up smoother on level 2:    newtonls (fine)
+```
+
+So the spike's anticipated "install callbacks on coarse DMs" change was
+**unnecessary**. FAS is a `petsc_options`-only capability today.
+
+## Results (scalar, static mesh)
+
+Two testbeds on the unit square, P2 elements, mesh built with `refinement`:
+
+- **Bratu** `-Δu = λ e^u`, `u=0` on `∂Ω` (lower branch exists for `λ < λ_c ≈ 6.808`).
+- **Exponential nonlinear diffusion** `-∇·(e^{βu}∇u) = f`, manufactured solution
+  `u = sin πx · sin πy` so a solution provably exists at every β (a stiff Newton
+  stressor as β grows).
+
+Configurations: `newton` (`newtonls`), `fas` (`snes_type=fas`, FULL cycle,
+`newtonls` smoothers, LU coarse), `ngmres+fas` (NGMRES outer with FAS as the
+nonlinear preconditioner, `-npc_snes_type fas`).
+
+### Correctness and mesh-independence (Bratu, λ=5)
+
+| refinement | levels | Newton its | FAS its (F-cycles) | ‖u_FAS − u_Newton‖ |
+|---|---|---|---|---|
+| 1 | 2 | 4 | **1** | 7e-15 |
+| 2 | 3 | 4 | **1** | 2e-15 |
+| 3 | 4 | 4 | **1** | 2e-15 |
+
+FAS converges in a single F-cycle independent of refinement, to the same solution
+as Newton. Across `λ = 3…6.8`, `fas` and `ngmres+fas` agree with Newton to
+1e-9…1e-15. Both Newton and FAS hit the **same** envelope at the fold
+(`λ ≳ 6.85`, where no solution exists) — Bratu's lower branch is reachable by
+Newton from a cold start, so Bratu shows FAS *works*, not that it is *more robust*.
+
+### Robustness advantage (exponential diffusion, MMS)
+
+| β | Newton | FAS | NGMRES+FAS | notes |
+|---|---|---|---|---|
+| 1 | 16 its ✓ | **1** ✓ | 1 ✓ | L2 ≈ 2e-7 all |
+| 2 | 26 its ✓ | **1** ✓ | 1 ✓ | |
+| 3 | **diverges** (line search) | **1 ✓** | **1 ✓** | solution exists — FAS finds it (L2 2.2e-7), Newton cannot |
+| 4 | diverges | diverges | diverges | default smoothers insufficient |
+| 5 | diverges | 80 (stalls) | 27 (fails) | |
+
+This is the headline: at β=3 a solution provably exists and **FAS converges in one
+cycle where Newton diverges from the cold start**. FAS is not a silver bullet —
+β ≥ 4 needs stronger smoothers/continuation than the spike's defaults.
+
+### Wall-clock (exponential diffusion)
+
+| β | refinement | Newton | FAS | speed-up |
+|---|---|---|---|---|
+| 1 | 2 | 1.86 s (16) | 1.10 s (1) | 1.7× |
+| 2 | 2 | 2.59 s (26) | 1.14 s (1) | 2.3× |
+| 1 | 3 | 6.45 s (16) | 5.35 s (1) | 1.2× |
+| 2 | 3 | 10.6 s (26) | 5.56 s (1) | 1.9× |
+
+FAS is faster as well as more robust when the nonlinearity is strong (Newton needs
+many steps). On mild problems where Newton converges in 3–4 steps, FAS's
+per-cycle cost makes it merely competitive, not faster.
+
+### Parallel
+
+np=2 with a parallel-safe coarse solve (`fas_coarse_pc_type=redundant`,
+`fas_coarse_redundant_pc_type=lu`) converges in 1 cycle, `L2=2.06e-7` vs np=1's
+`2.07e-7`. The scalar parallel gate is clear.
+
+### Adaptation — FAS on a coordinate-deformed mesh
+
+The headline risk in the first cut of this note was that, because the movers
+update only the fine DM's coordinates, FAS's coarse(undeformed) → fine(deformed)
+inter-grid transfers might be "too weak" to converge. **Tested and refuted for the
+scalar case.** A refined mesh (3 levels) was deformed with `follow_metric`
+(tanh-front metric) at increasing strength, then the exponential-diffusion MMS was
+solved on the deformed mesh:
+
+| follow R | q_min | coarse moved | fine moved | Newton | FAS | FAS L2 |
+|---|---|---|---|---|---|---|
+| — (static) | 0.89 | — | — | 26 ✓ | **1** ✓ | 2.1e-7 |
+| 1.5 | 0.37 | **0** | 0.21 | 24 ✓ | **2** ✓ | 4.2e-7 |
+| 2.0 | 0.32 | **0** | 0.35 | 24 ✓ | **2** ✓ | 6.3e-7 |
+| 2.5 | 0.34 | **0** | 0.41 | 25 ✓ | **2** ✓ | 5.7e-7 |
+
+The mismatch is real ("coarse moved" = 0 confirms coarse DMs keep original
+geometry while the fine DM deforms), yet FAS degrades only from 1 to **2 F-cycles**
+and never loses correctness (L2 tracks Newton exactly). Pushed harder — into the
+β=3 regime where Newton **diverges** on a uniform mesh — FAS on the *deformed* mesh
+still converges in 2 cycles and stays correct:
+
+| β | follow R | q_min | Newton | FAS | FAS L2 |
+|---|---|---|---|---|---|
+| 3 | 1.5 | 0.37 | **diverges** | **2 ✓** | 4.3e-7 |
+| 3 | 2.0 | 0.32 | **diverges** | **2 ✓** | 6.7e-7 |
+| 2 | 3.0 | 0.32 | 25 ✓ | 2 ✓ | 5.3e-7 |
+| 3 | 3.0 | 0.32 | **diverges** | **2 ✓** | 5.9e-7 |
+
+Why it works despite the mismatch: PETSc builds the transfers from the parent→child
+*refinement* relationship (reference-element shape functions), not from a match of
+physical coarse/fine geometry, and FAS only needs the coarse step to be a useful
+*correction* — the fine smoother removes whatever the geometrically-imperfect
+coarse correction leaves behind. Coarse-coordinate propagation would likely recover
+the single-cycle count, but is **not required for convergence** on the scalar case.
+
+## How to use it today (no code required)
+
+```python
+mesh = uw.meshing.UnstructuredSimplexBox(..., refinement=2)   # builds the hierarchy
+poisson = uw.systems.Poisson(mesh, u_Field=u)
+# ... constitutive model, nonlinear f, BCs ...
+
+po = poisson.petsc_options
+po["snes_type"] = "fas"
+po["snes_fas_type"] = "full"            # FMG-style F-cycle
+po["fas_levels_snes_type"] = "newtonls" # per-level nonlinear smoother
+po["fas_levels_snes_max_it"] = 4
+po["fas_levels_snes_linesearch_type"] = "basic"
+po["fas_coarse_snes_type"] = "newtonls" # coarse nonlinear solve
+po["fas_coarse_ksp_type"] = "preonly"
+po["fas_coarse_pc_type"] = "lu"         # parallel: "redundant" + redundant_pc_type "lu"
+poisson.solve(zero_init_guess=True)
+
+# Or FAS as a nonlinear preconditioner to a robust outer accelerator:
+#   po["snes_type"] = "ngmres";  po["npc_snes_type"] = "fas";  po["npc_fas_*"] = ...
+```
+
+## Stokes (saddle-point) — the production target
+
+Everything above is scalar. The real goal is **nonlinear Stokes with adaptation**.
+The spike carried the result all the way there.
+
+**Plumbing (linear Stokes).** FAS runs on the velocity–pressure saddle-point system
+with no code changes. Using a monolithic LU smoother on each level (newtonls +
+`preonly`/`lu`) on a constant-viscosity box with an **open top** (traction-free, so
+*no* constant-pressure nullspace — the LU level solves stay non-singular), FAS
+reaches the same velocity field as the default fieldsplit+FMG solve (rel.diff 4e-5,
+1 cycle). So coarse residual, inter-grid transfers, and coarse correction all work
+on a saddle system. The Stokes `solve()` reads `snes_type` from
+`self.snes.getType()` (set by `petsc_options` during `_build`) and re-applies it, so
+`snes_type=fas` survives on the standard path (`picard=0`, `zero_init_guess=True`).
+
+**Nonlinear Stokes — the win.** A *smooth* shear-thinning power-law viscosity
+`η = η₀ (ε_II/ε_ref)^(1/n − 1)` (n=1 linear, larger n more nonlinear), open top:
+
+| n | Newton its | FAS cycles | same soln |
+|---|---|---|---|
+| 1 | 1 | 1 | 4e-6 |
+| 2 | 14 | **2** | 2e-4 |
+| 3 | 22 | **2** | 3e-4 |
+| 4 | 28 | **3** | 3e-4 |
+| 5 | 30 | **3** | 2e-4 |
+
+FAS holds **2–3 cycles while Newton climbs 14 → 30** as the nonlinearity sharpens —
+the same nonlinearity-independence seen in the scalar exponential-diffusion case,
+now on the saddle-point system, converging to the same solution.
+
+**Wall-clock — is the complexity worth it?** Yes, even with the un-optimised LU
+smoother. FAS (monolithic LU per level) vs Newton (default fieldsplit + FMG):
+
+| n | refine | levels | Newton | FAS | speed-up |
+|---|---|---|---|---|---|
+| 3 | 2 | 3 | 15.0 s | 9.1 s | 1.66× |
+| 5 | 2 | 3 | 19.9 s | 13.9 s | 1.44× |
+| 3 | 3 | 4 | 72.5 s | 38.6 s | 1.88× |
+| 5 | 3 | 4 | 95.3 s | 59.9 s | 1.59× |
+
+FAS is **1.4–1.9× faster** despite paying for a direct solve on every level, and the
+gap *widens* with mesh size (1.66→1.88× from ref2→ref3 at n=3) because the cycle
+count stays flat while Newton's iteration count does not.
+
+**But the per-level smoother is the crux, and it is not a `petsc_options` swap.**
+`-snes_view` confirms the smoother on *every* level (including the fine, 17505×17505)
+is `newtonls`+`preonly`+`lu` — a full direct factorization, ~6 of them per cycle on
+the fine grid. That does not scale (2-D sparse LU is ≈ O(N^1.5)). The obvious "fix"
+— a fieldsplit-Schur smoother on the levels, LU only on the coarse grid — was tried
+and is **9–17× slower**, not faster:
+
+| n | refine | LU smoother | fieldsplit-Schur smoother |
+|---|---|---|---|
+| 3 | 2 | 9.0 s | 81 s |
+| 3 | 3 | 37.6 s | 657 s |
+
+Both converge in 2–3 cycles to the same solution, but a Schur fieldsplit is itself
+an approximate Stokes *solve*, and as a smoother it is invoked many times (≈ 2 Newton
+steps × down+up × per level × per cycle) — so each relaxation does a near-complete
+Stokes solve. At these sizes a single direct LU is cheaper. The genuine requirement
+is an **inexpensive saddle-point smoother** — Vanka (element/patch block), Braess–
+Sarazin, or a distributive/Uzawa relaxation — none of which is a stock PETSc
+`petsc_options` choice for a DMPlex FE Stokes operator. **This, not the nullspace,
+is the real research/engineering cost of production Stokes FAS.** Until it exists,
+LU-smoothed FAS is a correct and (at moderate size) faster method whose cost is
+dominated by the fine-grid direct solve.
+
+**Nonlinear Stokes ON AN ADAPTED MESH — the target.** Power-law Stokes (n=3) with
+the mesh deformed by `follow_metric` toward the buoyancy feature (coarse DMs keep
+their original geometry — the same mismatch the scalar case shrugged off):
+
+| follow R | q_min | Newton its | FAS cycles | same soln |
+|---|---|---|---|---|
+| — (static) | 0.89 | 22 | 2 | 0.0931≈0.0932 |
+| 1.5 | 0.31 | 22 | **2** | 2.8e-4 |
+| 2.0 | 0.32 | 22 | **3** | 2.7e-4 |
+| 2.5 | 0.36 | 23 | **3** | 1.8e-4 |
+
+FAS keeps its 2–3 cycle convergence on the deformed saddle-point mesh, same answer
+every time. **Feasibility for nonlinear-Stokes-with-adaptation is established.**
+
+**Decomposing the benefit: globalization vs convergence rate.** FAS delivers two
+distinct things — (a) it drops the fine grid into Newton's quadratic basin (the
+coarse→fine ramp-up / *globalization*), and (b) a genuine nonlinear-multigrid
+*convergence rate* via repeated coarse correction. The globalization piece can be
+had *cheaply*, reusing the existing Newton + linear-FMG machinery with no saddle
+smoother, by **nested iteration / grid sequencing**: solve coarse, interpolate,
+warm-start fine. Two ways to get it, and how much it buys:
+
+- PETSc `-snes_grid_sequence` **does not work** on UW3 meshes via `petsc_options`:
+  it triggers `DMPlexComputeInterpolatorGeneral` (PETSc err 56) rather than the
+  pre-chained nested hierarchy that FAS reuses — so, unlike FAS, it is *not*
+  plug-and-play.
+- **Manual nested iteration at the UW3 level** (coarse solve → `uw.function.evaluate`
+  onto the fine field → warm-started fine solve) *does* work, and drops the fine
+  Newton count from 22→14 (n=3) and 29→19 (n=5) — but only ~1.1–1.2× wall-clock.
+
+So globalization alone is a *modest* win: it fixes the initial guess but not the
+per-level convergence rate, and the coarse solve still pays the full nonlinear cost.
+FAS's larger 1.5–1.9× comes from (b), the multigrid convergence rate — which is
+exactly the part that needs the inexpensive saddle smoother. Practical reading: for
+*robustness* on hard nonlinear Stokes, Picard warm-up (already in UW3) or manual
+nested iteration is cheap and reuses FMG; for *speed*, FAS is the lever but only
+once a real saddle smoother exists.
+
+**Where Stokes FAS does *not* help: non-smooth yielding.** A von Mises viscoplastic
+viscosity `η = min(η₀, τ_y/2ε_II)` is hard for *every* solver: once the domain
+yields (τ_y ≲ 1 here) Newton, Picard **and** FAS all fail. The `min()` kink is
+non-smooth and, worse for FAS, the coarse level yields on a different pattern than
+the fine, so the coarse correction is inconsistent. This is the known
+regularization/continuation regime, not a FAS-specific failure — the same shape as
+the scalar β≥4 ceiling.
+
+**Two engineering follow-ons before production Stokes FAS:**
+
+1. **An inexpensive saddle-point smoother (the hard part — feasible, but real work).**
+   The spike's per-level smoother is a monolithic direct LU — fine for proving the
+   algorithm (the 2–3 cycle count is the real result), but it does not scale.
+   Replacements tried, all on power-law n=3 at ref2/ref3:
+
+   | smoother | result |
+   |---|---|
+   | monolithic LU (baseline) | converges, 2 cycles, 11 s / 38 s — but LU does not scale |
+   | fieldsplit-Schur, heavy (gmres 20) | converges but **9–17× slower** (a Schur solve per relaxation) |
+   | fieldsplit-Schur, light (gmres 1–2) | **zero pivot** → `DIVERGED_INNER` (−7) |
+   | PCPATCH **Vanka** | **zero pivot** on setup (singular local patches) |
+
+   Every cheap attempt dies on the same rock: **"Zero pivot in LU factorization."**
+   The pressure block has a zero diagonal, so naive relaxations (SOR / ILU / a plain
+   sub-LU) divide by zero. That is the defining saddle-point difficulty, and exactly
+   what Vanka (solve each local velocity+pressure patch as a coupled block) and
+   Braess–Sarazin (a specific approximate-Schur relaxation) are constructed to avoid.
+   So the smoother is **feasible — Vanka via PETSc `PCPATCH` is the standard tool and
+   is used in production for Stokes geometric MG (e.g. Firedrake)** — but wiring it
+   into UW3 is a real task: correct `PCPATCH` patch construction so the local saddles
+   are non-singular, plus UW3-side exposure of the DM fields/topology to the patch PC
+   (analogous to `_setup_block_fieldsplit_options`). It is **not** a `petsc_options`
+   one-liner, which is what makes it the genuine engineering cost of scalable Stokes
+   FAS.
+2. **Per-level pressure nullspace.** The spike used an open-top problem to avoid the
+   constant-pressure nullspace. `_attach_stokes_nullspace()` sets the nullspace on
+   the *outer* matrix only; FAS builds each level's matrix internally. Enclosed /
+   free-slip problems will need the nullspace registered at the DM/DS level (e.g.
+   `DMSetNullSpaceConstructor` on the pressure field) so every FAS level inherits
+   it. This is the one place a small UW3 code change is likely required.
+
+## Limitations and open questions
+
+- **Smoother/coarse tuning matters.** Default `newtonls` smoothers handle moderate
+  nonlinearity; very stiff regimes need stronger smoothers, more pre/post sweeps,
+  W-cycles, or continuation. This is normal FAS practice, not a UW3 limitation.
+- **Coarse geometry under movers — TESTED, benign for scalar.** The movers update
+  only the fine DM's coordinates; coarse DMs keep their original geometry (verified:
+  "coarse moved" = 0). This was the headline risk, but on the scalar case FAS only
+  slows from 1 to 2 F-cycles under deformation strong enough to drop q_min to 0.32,
+  stays correct, and still converges where Newton diverges (see the adaptation
+  tables above). Coarse-coordinate propagation down the hierarchy would likely
+  restore the single-cycle count but is **not a prerequisite for convergence**. Whether
+  this still holds for the much larger anisotropy of an adapted Stokes problem is
+  the next thing to confirm.
+- **Stokes saddle-point FAS — separate, harder.** FAS smoothers on the
+  velocity-pressure system need a fieldsplit smoother and a constant-pressure
+  nullspace on every level. Out of scope here; revisit only after the scalar path
+  and the coarse-geometry question are settled.
+
+## Recommendation
+
+1. **Offer FAS as an opt-in for scalar (and likely vector) nonlinear solves.** It
+   needs no new code — only a documented option bundle, or a thin
+   `nonlinear_preconditioner` / `snes_type='fas'` convenience knob mirroring the
+   FMG `preconditioner` property (default off; FMG/Newton defaults untouched).
+   The exact deliverable shape is deferred pending these results.
+2. **Validate on Richards** (`docs/beginner/tutorials/16/17`), the strongest
+   real-world scalar FAS target, before committing to an API.
+3. **Nonlinear-Stokes-with-adaptation is demonstrated** (2–3 cycles vs Newton's
+   ~22, on a deformed mesh, same solution; 1.4–1.9× faster with LU smoothers).
+   Production scalability hinges on an **inexpensive saddle-point smoother** (Vanka /
+   Braess–Sarazin / distributive) — *not* a stock fieldsplit, which is 9–17× slower
+   as a smoother. A **per-level pressure nullspace** is the smaller second piece
+   (enclosed/free-slip). Coarse-coordinate propagation stays an optimisation; the
+   non-smooth viscoplastic regime is out of scope (hard for all solvers).
+
+## Reproduction
+
+Spike scripts (not part of the repo; under `~/+Simulations/snesfas_spike/`):
+`bratu_baseline.py`, `fas_probe.py`, `fas_measure.py`, `expdiff_mms.py`,
+`adapt_fas.py` (FAS on a `follow_metric`-deformed mesh), and the Stokes set:
+`stokes_fas.py` (linear plumbing), `stokes_fas_nl.py` (viscoplastic — hard for
+all), `stokes_fas_powerlaw.py` (smooth nonlinear win), `stokes_fas_adapt.py`
+(nonlinear Stokes on an adapted mesh).
+Worktree `feature/snesfas-spike` off `origin/development` — **no source files
+changed** (`git status` on `src/` is clean); the only artifact in-repo is this
+note. Regression sanity: `tests/test_1014_stokes_multigrid.py` +
+`tests/test_1000_poissonCart.py` → 18 passed.
diff --git a/docs/developer/design/snesfas-vanka-feasibility-study.md b/docs/developer/design/snesfas-vanka-feasibility-study.md
new file mode 100644
index 00000000..37dee221
--- /dev/null
+++ b/docs/developer/design/snesfas-vanka-feasibility-study.md
@@ -0,0 +1,337 @@
+---
+title: "Feasibility study: a scalable saddle-point smoother for Stokes SNESFAS"
+---
+
+# Feasibility study — scalable saddle smoother for Stokes FAS
+
+**Bottom line: GO — demonstrated with a working, mesh-independent prototype.
+[Revised 2026-06-15 after expert input + experiments — supersedes the earlier
+NO-GO.]** Vanka works on UW3's simplex Taylor-Hood Stokes; the earlier failure was
+PETSc's *stock* `-pc_patch_construct_type vanka`/`star` heuristic mis-constructing
+patches for continuous-P1 simplices, **not** Vanka. With patches built as *the
+support of each pressure basis function* (one pressure DOF + its B-coupled velocity
+DOFs), driven by **`PCASM` with custom index sets** and exact local LU mini-Stokes
+solves, and used as the smoother of a geometric multigrid on the full saddle, the
+solver is **mesh-independent: 5 → 5 → 6 outer FGMRES iterations across a 16× growth
+in DOFs (1.2k → 19k)**. The one non-obvious ingredient: the smoother must be a
+**Krylov (GMRES) smoother** wrapping the Vanka PC — an undamped Richardson smoother
+amplifies the additive-Schwarz spectrum and diverges. So the scalable-smoother piece
+is **done in prototype**. **Caveat on the payoff:** the iteration count is
+mesh-independent, but at moderate **2-D** sizes it is *not* faster in wall-clock than
+a sparse direct solve / fieldsplit+FMG (2-D direct solves are cheap); Vanka's
+wall-clock advantage is asymptotic in 2-D (~10⁵–10⁶ DOFs) and real in **3-D and
+large-parallel** runs, where direct solves scale badly. So this is the tool for
+big/3-D/parallel Stokes, not a speed-up for typical 2-D problems. What remains is
+productionisation (wrap in FAS — already proven with an LU smoother; per-level
+pressure nullspace; parallel patch construction; a UW3 API). Until that lands, FMG +
+Newton/Picard/nested ships and LU-smoothed FAS is the moderate-size nonlinear option.
+
+> The detailed reasoning below that concluded NO-GO was **correct about stock
+> `PCPATCH` but wrong about Vanka in general**. Read it as "stock PCPATCH
+> constructions are the wrong tool", then jump to *Correction* near the end for the
+> working approach and the revised path forward.
+
+## Objective
+
+Decide — *before* a major rewrite — whether a production-scalable Stokes FAS is
+achievable, by answering whether a cheap saddle-point smoother can be made to work
+in UW3.
+
+## Method
+
+Staged, cheapest-fatal-risk-first, with the key move of **decoupling the smoother
+from FAS**: test each smoother candidate as a plain *linear* preconditioner on a
+constant-viscosity UW3 Stokes solve first (a `pc_type` swap), since FAS itself is
+already proven. Option sets were lifted from PETSc's own CI-tested example
+`src/snes/tutorials/ex62.c`.
+
+## Findings
+
+### Stage 0 — the known-good Vanka recipe exists (and my first attempt was mis-configured)
+
+`ex62.c`'s `2d_q1_p0_gmg_vanka` test gives the authoritative incantation:
+`-pc_type patch -pc_patch_partition_of_unity 0 -pc_patch_construct_codim 0
+-pc_patch_construct_type vanka -..._sub_pc_type lu -mg_coarse_pc_type svd`. Note:
+**every passing Vanka test in ex62 uses `dm_plex_simplex 0` (quads), Q1–P0
+(discontinuous pressure)** — there is no CI-tested Vanka for simplex Taylor-Hood.
+
+### Stage 1 — Vanka does not work on UW3's Stokes (the crux, R2)
+
+PCPATCH Vanka as the outer linear PC on a constant-viscosity UW3 Stokes solve, with
+the corrected recipe, fails **uniformly**:
+
+| variant | result |
+|---|---|
+| P1-continuous, sub LU | **zero pivot** (singular patch) |
+| P0/P1-discontinuous, sub LU | **zero pivot** |
+| `+ pc_use_amat` (true saddle Jac, not UW3's Schur Pmat) | **zero pivot** |
+| sub `svd` (tolerates singular patches), P1-cont | builds, but **no convergence** (400 its, no progress) |
+| sub `svd`, P0-disc | builds, **crawls** to 0.00274 vs 0.00319 ref, 427 its, no convergence |
+| **quad** mesh (ex62's Vanka element family), all of the above | **identical** failure pattern |
+
+So the failure is **not** element-type (simplex vs quad), pressure continuity, or
+sub-solver. The patches are genuinely singular (SVD removes the pivot but leaves an
+ineffective preconditioner). The obstruction is in how `PCPATCH` constructs/assembles
+the local patch operators (and their boundary conditions / local nullspaces) from
+UW3's DM/DS — the same interface Firedrake feeds correctly when it runs DMPlex Stokes
+Vanka. UW3 confirmed to build a **separate Schur-structured Pmat** (`1/μ` pressure
+mass; `petsc_generic_snes_solvers.pyx:4381`), which is part of why the default path
+hands Vanka the wrong operator, but `pc_use_amat` did not rescue it — the problem is
+deeper than operator selection.
+
+### Braess–Sarazin hedge — works as a PC, too weak as a smoother
+
+A *correct* Braess–Sarazin (diagonal velocity block `fieldsplit_velocity_pc_type=jacobi`
++ `selfp` Schur on the true operator via `pc_use_amat`) **converges as an outer PC**
+(reason 3, exact reference solution) — options-only, no patches, no zero pivot. But
+as a **FAS level smoother it fails** (`DIVERGED_INNER`): it needs ~63 iterations to
+converge as a PC, i.e. it is a *weak* relaxation, so a few sweeps per smooth don't
+reduce error. A stronger variant (SOR velocity + full Schur) converged at ref2 (5
+cycles) but was **22× slower than LU and failed at ref3** — unreliable and expensive.
+
+### The smoother-candidate matrix
+
+| smoother | cheap? | effective? | scalable? | status in UW3 |
+|---|---|---|---|---|
+| monolithic **LU** | no | yes | no | works; fastest at moderate size; O(N^1.5) wall |
+| **Vanka** (PCPATCH) | yes | yes | yes | **blocked** — singular/ineffective patches |
+| **Braess–Sarazin** (diag) | yes | no (weak) | yes | works as PC, fails as smoother |
+| BS + SOR velocity | ~ | unreliable | ? | 22× slower, fails at ref3 |
+| heavy fieldsplit-Schur | no | yes | ~ | 9–17× slower than LU |
+
+No candidate is simultaneously cheap, effective, and configurable. The only effective
+smoothers are expensive (LU, heavy fieldsplit); the only cheap one (BS) is too weak.
+
+## De-risking step 1 — DONE: simplex Vanka fails in clean PETSc too
+
+The proposed de-risk was to test Vanka *outside* UW3. Rather than re-write a Stokes
+solver in petsc4py, the cleanest reference is PETSc's own `ex62.c` (DMPlex Stokes,
+the source of the recipe). Compiled it (this build has **no 2-D simplex generator** —
+`--download-triangle` absent — so a gmsh unit-square `.msh` was loaded via
+`-dm_plex_filename ... -dm_plex_boundary_label marker`; UW3 itself only gets simplices
+through gmsh for the same reason). Results, **pure PETSc, no UW3**:
+
+| element / construction | result |
+|---|---|
+| **quad** Q1–P0, `vanka` (the CI-tested recipe) | ✅ converges, 59 KSP its |
+| simplex P2–P1, full LU **and** fieldsplit-Schur (controls) | ✅ converge — mesh/BC are correct |
+| simplex P2–P1, `vanka` patches (sub lu / svd) | ❌ KSP **stalls at the initial residual** (29.46 flat for 200 its) |
+| simplex P2–P0, `vanka` patches | ❌ diverges |
+| simplex P2–P1 / P2–P0, `star` (vertex) patches | ❌ diverges |
+
+So Vanka works on quads and the same binary solves the simplex mesh fine with
+standard solvers, but **every patch-smoother construction is ineffective on simplex
+Taylor-Hood** — the KSP makes *zero* progress. This is a PETSc/numerical-methods
+fact, independent of UW3. (The earlier UW3 zero-pivot was just the first symptom of
+the same underlying problem; "fixing UW3's DM/DS↔PCPATCH interface" would **not** have
+helped, since clean PETSc fails identically.)
+
+**Why:** stock `PCPATCH` Vanka/star patches are built for low-order, quad/structured,
+discontinuous-pressure elements. Effective patch smoothing for simplex Stokes needs
+specialist constructions from the literature — typically a patch-smoothable stable
+element (e.g. **Scott–Vogelius on barycentrically-refined/Alfeld meshes**) plus
+**macro-element patches** (the Farrell–Mitchell–Wechsung programme). That is a change
+of *element and mesh*, not a smoother option.
+
+## Correction (2026-06-15): custom-IS PCASM Vanka works — the NO-GO was wrong
+
+External expert input reframed Vanka as *"a specialised overlapping-Schwarz
+preconditioner whose patches are defined by the pressure space, not by geometry"*:
+loop over pressure DOFs, use the FE sparsity of the divergence block **B** to gather
+the coupled velocity DOFs, extract the local mixed matrix, factor it, and run
+additive/multiplicative Schwarz — i.e. **`PCASM`/`PCGASM` with custom index sets**,
+the patches being the *support of each pressure basis function*. This is standard for
+Taylor-Hood, MINI, Scott–Vogelius, variable-viscosity convection, etc. The stock
+`PCPATCH` `vanka`/`star` constructs I tested are *not* this — they apply a fixed
+topological heuristic that misfires for continuous-P1 simplices.
+
+Tested directly on UW3's true saddle Jacobian (open-top, no pressure nullspace):
+
+```
+saddle 1110×1110, fields [velocity, pressure], n_vel=968 n_pres=142
+patches = 142 (one per pressure DOF), size 11–45 (mean 32.5)
+PCASM-Vanka (RAS and additive): CONVERGED, 312 fgmres its, rel_err 2.2e-6
+```
+
+One-level iteration count vs mesh size — the smoother signature:
+
+| cellSize | ndof | n_pres | 1-level Vanka its |
+|---|---|---|---|
+| 0.15 | 555 | 75 | 147 |
+| 0.10 | 1110 | 142 | 312 |
+| 0.07 | 2479 | 303 | 688 |
+
+Iterations grow ~with 1/h (no coarse correction) — high-frequency error is removed,
+low-frequency is left for the MG coarse grid. That is precisely a multigrid smoother.
+So **the failure was the patch *construction*, not Vanka**, and the smoother exists
+for UW3's element.
+
+## The working recipe (prototype: `vanka_mg_WORKING.py`)
+
+A geometric multigrid on the **full** Stokes saddle, custom-IS Vanka smoother:
+
+1. **Hierarchy + Galerkin.** PCMG over UW3's `dm_hierarchy` (built by `refinement=N`),
+   interpolation per level from `DMCreateInterpolation`, coarse operators by Galerkin
+   `pc_mg_galerkin both` (the FMG path — UW3 has no coarse-DM callbacks). Block-diagonal
+   velocity/pressure interpolation keeps the coarse operators saddle-structured.
+2. **Per-level Vanka smoother.** For each level: from the level's field decomposition
+   + the operator's row sparsity, build one patch per pressure DOF = {pressure DOF} ∪
+   {B-coupled velocity DOFs}; install a `PCASM` (RESTRICT) with those index sets and
+   exact `sub_pc_type lu` mini-Stokes solves. (Set subdomains after a `PC.reset()` +
+   re-attach operator, since they must precede `PCSetUp`.)
+3. **Krylov smoother — the key.** Wrap the Vanka PC in **`ksp_type gmres`, ~6 its**
+   per level. Richardson (even damped) diverges because additive Schwarz has spectral
+   radius > 1; GMRES self-stabilises it. Outer solver is FGMRES (flexible, since the
+   smoother is now a Krylov method).
+4. **Coarse solve:** LU.
+
+Measured: **5 / 5 / 6 / 5 outer iterations at ndof 1.2k / 4.8k / 19k / 76k** —
+mesh-independent.
+
+### Timing — and the right competitor is FMG, not LU
+
+A direct (LU) solve is a *strawman* competitor: if a full-scale factorization is
+affordable you would just use it directly, not as a smoother — and in 3-D / parallel
+that option disappears (the sparse direct solver, MUMPS, scales ~O(N²) in 3-D and is
+unreliable at large core counts). The honest competitor is **FMG** (velocity-block
+geometric MG inside the Schur fieldsplit, #231).
+
+Linear 2-D Stokes, same problem (FMG numbers include assembly/JIT — not a clean
+linear-solve isolation; Vanka is linear-solve only):
+
+| | 1.2k | 4.8k | 19k | 76k | iters |
+|---|---|---|---|---|---|
+| **FMG** total solve | (JIT) | ~1.7s | ~1.8s | ~4.7s | outer Schur = 1 |
+| **Vanka-MG** linear solve | 0.02s | 0.13s | 1.1s | 5.7s | 5–6 |
+| LU-smoother MG (≈ direct) | 0.01s | 0.07s | 0.54s | ~5s | 1 |
+
+For **linear 2-D Stokes the three are in the same ballpark** — FMG is mature and
+slightly ahead, the LU/direct path is cheap because 2-D fill-in is modest, and
+Vanka-MG is competitive but not a winner. So Vanka does **not** earn its keep on easy
+(linear, moderate-2-D) problems.
+
+**Where Vanka-MG wins — and the reason to build it:**
+
+- **3-D and large-parallel**, where the direct solves that FMG's coarse grid and the
+  LU paths lean on scale badly / become unreliable, while Vanka's local patch solves
+  scale ~O(N) and parallelise naturally. (3-D and parallel timing is the obvious next
+  measurement.)
+- **Strongly nonlinear / plastic rheology** — the decisive case. Fieldsplit-Schur+FMG
+  assumes a good Schur (pressure) approximation, which degrades when the viscosity
+  varies wildly (viscoplastic yield, thermal runaway). A **full-saddle Vanka-FAS**
+  smooths the coupled nonlinear system directly, and — the key point — the **coarse
+  problems are better conditioned** (plasticity localises; on coarse grids the yield
+  is smeared/milder), so the nonlinear coarse correction is especially effective.
+  This is exactly the regime where Newton+FMG needs many iterations or stalls
+  (cf. the viscoplastic results in `snesfas-feasibility.md`, where Newton, Picard
+  *and* LU-FAS all struggled).
+
+**Takeaway:** the value proposition is *not* "faster on linear 2-D" — there FMG is
+fine. It is **robust, scalable nonlinear Stokes in 3-D / parallel / plasticity**,
+where the fieldsplit-Schur assumptions and direct solves break down.
+
+### First data point — FMG vs FAS on hard benchmarks (`benchmark_fmg_vs_fas.py`)
+
+FMG (Newton + fieldsplit-Schur + velocity FMG, `saddle_preconditioner=1/η`) vs FAS
+(snes_type=fas, LU smoother — *not yet Vanka*), open-top so no nullspace,
+refinement=2:
+
+*SolCx viscosity step (linear), Δη = 1 → 10⁶:* both **converge at every contrast**.
+FMG is faster (1–2 outer iterations, 7–40 s) but its time *grows* with the jump
+(12 s → 40 s from 10⁴ → 10⁶ as the velocity-block MG degrades); FAS-LU is flat
+(~62 s, 2 cycles, robust but slow). So with the right Schur preconditioner FMG
+handles the discontinuous-viscosity benchmark well; FAS's edge only shows as the
+contrast becomes extreme.
+
+*Viscoplastic yield (nonlinear), τ_y = 10 → 0.25:* the nonlinear iteration count
+favours FAS — at τ_y = 1, **FAS converges in 3 nonlinear cycles where FMG+Newton
+needs 9 iterations** (the nonlinear coarse correction at work, as predicted). Below
+τ_y ≈ 0.5 *both* fail — the hard-yielding regime is continuation/regularisation
+territory for every solver (consistent with `snesfas-feasibility.md`). FAS is slower
+in wall-clock here only because it uses the LU smoother; the Vanka smoother is what
+would make the 3-vs-9 nonlinear-iteration advantage also a wall-clock win at scale.
+
+**Reading:** FMG is the right default for linear / moderate problems; FAS's
+robustness advantage is real but currently shows as *fewer nonlinear iterations* in
+the plastic regime, not yet as wall-clock (LU smoother). Wiring in the Vanka smoother
++ pushing to 3-D / extreme contrast is where the combination should pull clearly
+ahead. The very-hard yield regime needs continuation regardless of solver.
+
+### Three turnkey choices — FMG vs GAMG vs FAS-Vanka (`benchmark_3way.py`)
+
+With the Vanka smoother actually wired into FAS (custom-IS injection, GMRES Krylov
+smoother), the three production options compared on the same problems (open top,
+res 16 / refinement 1):
+
+*SolCx viscosity step (linear):*
+
+| Δη | FMG | GAMG | FAS-Vanka |
+|---|---|---|---|
+| 1 | 1 it, 3.9s | 2 it, 4.9s | 1 it, 3.0s |
+| 10³ | 1 it, 2.8s | 3 it, 14s | 1 it (gmres-15 smoother) |
+| 10⁶ | 2 it, 5.8s | 8 it, **99s** | **FAIL** |
+
+*Viscoplastic yield (nonlinear):*
+
+| τ_y | FMG | GAMG | FAS-Vanka |
+|---|---|---|---|
+| 10 | 1 it | 2 it | 1 it |
+| 1 | 10 it | 10 it | **2 it** |
+| 0.3 | FAIL | FAIL | FAIL |
+
+**What this says — and the honest "how much tuning":**
+
+- **FMG** is the **robust all-rounder and the right default.** Its
+  `saddle_preconditioner = 1/η` makes the Schur (pressure) approximation
+  viscosity-robust, so it sails through the 10⁶ contrast (5.8 s) where the others
+  struggle, and it is fastest almost everywhere. Needs a refinement hierarchy.
+- **GAMG** is the **no-hierarchy fallback** — robust but it *cliffs in cost* at high
+  contrast (99 s at 10⁶, 8 outer iterations). Use when there is no geometric
+  hierarchy.
+- **FAS-Vanka** is the **nonlinear / plasticity specialist.** It crushes the
+  viscoplastic case (**2 nonlinear iterations vs 10** for the others — the coarse
+  problem is better conditioned, exactly as expected) and handles moderate viscosity
+  contrast (10³) once the Krylov smoother is bumped to ~15 inner iterations. But with
+  *additive* PCASM Vanka it **fails at extreme contrast (10⁶)** no matter how hard you
+  smooth — that needs a **multiplicative** Vanka (PCGASM / a coloured patch sweep)
+  and/or viscosity-aware coarsening, which is genuine development, not a flag.
+
+**Bottom line for "2–3 good choices, not much tuning":** FMG (default) and GAMG
+(fallback) are the two robust, low-tuning options today. FAS-Vanka is a strong third
+**specifically for strongly nonlinear / plastic problems** (where it beats both on
+iteration count) and for 3-D / large-parallel; making it a *robust* turnkey peer of
+FMG across all regimes needs the multiplicative-Vanka smoother (the extreme-contrast
+gap) plus the productionisation items (per-level nullspace, parallel patches, UW3
+API). The plasticity win, though, is real and in hand now.
+
+## Path forward (GO — productionise the prototype)
+
+1. **Wrap in FAS for the nonlinear case.** The smoother is the hard part and it is now
+   solved; FAS adds the nonlinear coarse correction, already proven to work with an LU
+   smoother (`snesfas-feasibility.md`). Swap LU → the Vanka GMRES smoother above.
+2. **Per-level pressure nullspace** for enclosed / free-slip problems (the prototype
+   used an open top to avoid it). Register it at the DM/DS level so each level inherits
+   it.
+3. **Parallel patch construction** (PCASM is parallel; build the pressure-support index
+   sets rank-locally) and a **UW3 API** (a `smoother="vanka"` path that builds the IS
+   per level and installs the smoother — ≈ the prototype's ~40 lines, generalised).
+4. **Open optimisation:** whether `PCPATCH` can be configured to build these exact
+   pressure-support patches (would make it options-only, no custom-IS code).
+
+The earlier "research only / different element" conclusion is fully withdrawn: a
+mesh-independent Vanka multigrid for UW3's existing simplex Taylor-Hood Stokes is
+demonstrated; what remains is engineering.
+
+## Meanwhile
+
+`FMG + Picard / nested iteration` remains the production workhorse for nonlinear
+Stokes; LU-smoothed FAS is a correct, moderate-size-faster option where nonlinear
+robustness matters more than asymptotic scaling. Neither is blocked; both ship today.
+
+## Reproduction
+
+`~/+Simulations/snesfas_spike/`: `vanka_stage1.py` (UW3 Stage 1 matrix), `/tmp`-staged
+probes `vanka_quad.py`, `braess.py`, `bs_fas.py`, `bs_sor.py`. Clean-PETSc reference:
+`petsc-custom/petsc/src/snes/tutorials/ex62.c` (compiled in the `amr-dev` env; gmsh
+mesh `/tmp/square.msh` from `/tmp/mksquare.py`; driven with the `vanka`/`star` patch
+options above). PETSc `/*TEST*/` block `*_vanka` suffixes are the recipe source.
diff --git a/docs/examples/snesfas_investigation/README.md b/docs/examples/snesfas_investigation/README.md
new file mode 100644
index 00000000..8a3ac34c
--- /dev/null
+++ b/docs/examples/snesfas_investigation/README.md
@@ -0,0 +1,49 @@
+# SNESFAS / Vanka / grid-sequencing — research prototypes
+
+> **Status: research prototypes, not production code.** These scripts back the design
+> notes in `docs/developer/design/` (`snesfas-feasibility.md`,
+> `snesfas-vanka-feasibility-study.md`, `multilevel-nonlinear-stokes-strategy.md`).
+> They are preserved so the *working* multilevel-Stokes implementations are not lost
+> when the investigation is parked. They are **not tested, not tier-classified, and
+> make no `src/` changes** — everything here runs through `petsc_options` and
+> petsc4py on top of stock UW3. Wall-clock numbers in the docs are soft (JIT / noise);
+> trust the iteration counts.
+
+Run any of them with the worktree's environment, e.g.
+`pixi run -e <env> python docs/examples/snesfas_investigation/<script>.py`.
+All build their own meshes (with `refinement=` so a `dm_hierarchy` exists).
+
+## The working pieces
+
+| script | what it demonstrates |
+|---|---|
+| `vanka_asm.py` | **The core recipe.** Vanka as overlapping Schwarz with patches = *support of each pressure basis function* (one pressure DOF + its B-coupled velocity DOFs), built as custom `PCASM` index sets with exact local LU. Works on UW3's simplex P2–P1 where stock `PCPATCH` fails. |
+| `vanka_mg.py` | **Mesh-independent Vanka multigrid.** Geometric MG on the *full* Stokes saddle over UW3's `dm_hierarchy` (Galerkin coarse operators) with the custom-IS Vanka smoother + a **GMRES Krylov smoother** (additive Schwarz alone diverges). ~5–6 outer iterations independent of refinement. Also times vs FMG. |
+| `fas_vanka.py` | **Working SNESFAS-Vanka for nonlinear Stokes** (modest contrast). Drives `snes_type=fas`, injects the custom-IS Vanka smoother into each FAS level (warm-up solve to assemble level operators → inject → re-solve). This is the piece that gives **2 nonlinear iterations vs 10** for Newton+FMG on viscoplastic τ_y=1. |
+| `benchmark_3way.py` | **FMG vs GAMG vs FAS-Vanka** on SolCx viscosity contrast (linear) and viscoplastic yield (nonlinear). The head-to-head behind the "three choices" table. |
+| `notched_beam_cascade.py` | **Grid sequencing (nested iteration)** prototype: solve coarse → interpolate up (`uw.function.evaluate`, robust to non-nested meshes) → solve fine, FMG per level. Robust where PETSc's `-snes_grid_sequence` fails on UW3 meshes. |
+
+## Key recipes (so they aren't lost in the code)
+
+**Custom-IS Vanka patch** (per pressure DOF `pg`, from the assembled saddle Jacobian `J`):
+patch = `{pg}` ∪ `{velocity DOFs in J.getRow(pg)}`, fed to `PCASM` (RESTRICT) with
+`sub_ksp_type=preonly`, `sub_pc_type=lu`.
+
+**Vanka geometric MG** (linear): `pc_type=mg`, `pc_mg_galerkin=both`, per-level
+`ksp_type=gmres` (the Krylov smoother — *essential*; Richardson diverges) wrapping the
+custom-IS `PCASM`, `mg_coarse_pc_type=lu`/`svd`.
+
+**FAS-Vanka** (nonlinear): `snes_type=fas`, `snes_fas_type=full`; level smoothers are
+`newtonls` whose KSP is `gmres` (max_it ~15 for moderate contrast) + the custom-IS
+`PCASM`; coarse solve LU. The patch *structure* is sparsity-only, so it stays valid as
+the operator values change across nonlinear iterations.
+
+## Known limits (see the design notes)
+
+- Additive Vanka **fails at extreme viscosity contrast (≳10⁶)** — needs *multiplicative*
+  Vanka + interface-aware coarsening.
+- Per-level **pressure nullspace** for enclosed problems, **parallel** patch
+  construction, and a clean UW3 API are all unimplemented (the prototypes use open-top
+  problems to avoid the nullspace).
+- The cascade only helps when the per-level solve is iteration-limited; it is overhead
+  otherwise, so it should be opt-in.
diff --git a/docs/examples/snesfas_investigation/benchmark_3way.py b/docs/examples/snesfas_investigation/benchmark_3way.py
new file mode 100644
index 00000000..1a9625e4
--- /dev/null
+++ b/docs/examples/snesfas_investigation/benchmark_3way.py
@@ -0,0 +1,110 @@
+"""
+import os
+Three good choices: FMG vs GAMG vs FAS-Vanka on hard Stokes benchmarks.
+
+  FMG       : Newton + fieldsplit-Schur + velocity-block geometric MG (preconditioner="fmg")
+  GAMG      : Newton + fieldsplit-Schur + velocity-block algebraic MG (preconditioner="gamg")
+  FAS-Vanka : nonlinear multigrid on the full saddle, custom-IS PCASM Vanka smoother
+
+(1) SolCx viscosity step, eta_B = 1 .. 1e6   (linear, discontinuous viscosity)
+(2) Viscoplastic yield, tau_y = 10 .. 0.25   (strong nonlinearity)
+
+Open top (no pressure nullspace). refinement=2 => 3-level hierarchy. The point is
+robustness with stock settings — no per-problem tuning.
+
+Run:  pixi run -e amr-dev python benchmark_3way.py
+"""
+import time
+import importlib.util
+import numpy as np
+import sympy
+import underworld3 as uw
+from underworld3.function import analytic as A
+
+fv = importlib.util.module_from_spec(
+    importlib.util.spec_from_file_location("fv", os.path.join(os.path.dirname(os.path.abspath(__file__)), "fas_vanka.py")))
+importlib.util.spec_from_file_location("fv", os.path.join(os.path.dirname(os.path.abspath(__file__)), "fas_vanka.py")).loader.exec_module(fv)
+
+
+def make(res=16, refinement=1):
+    return uw.meshing.UnstructuredSimplexBox(
+        minCoords=(0, 0), maxCoords=(1, 1), cellSize=1.0 / res, refinement=refinement, qdegree=3)
+
+
+def open_top(st):
+    st.add_dirichlet_bc((0.0, None), "Left")
+    st.add_dirichlet_bc((0.0, None), "Right")
+    st.add_dirichlet_bc((0.0, 0.0), "Bottom")
+
+
+def solcx(mesh, eta_B):
+    sol = A.SolCx(mesh, eta_A=1.0, eta_B=eta_B, x_c=0.5, n=1)
+    st = uw.systems.Stokes(mesh)
+    st.constitutive_model = uw.constitutive_models.ViscousFlowModel
+    st.constitutive_model.Parameters.shear_viscosity_0 = sol.fn_viscosity
+    st.bodyforce = sol.fn_bodyforce
+    st._sp = 1.0 / sol.fn_viscosity
+    open_top(st)
+    return st
+
+
+def visc(mesh, tau_y):
+    x, y = mesh.X
+    st = uw.systems.Stokes(mesh)
+    st.constitutive_model = uw.constitutive_models.ViscoPlasticFlowModel
+    st.constitutive_model.Parameters.shear_viscosity_0 = 1.0
+    st.constitutive_model.Parameters.yield_stress = tau_y
+    st.constitutive_model.Parameters.strainrate_inv_II_min = 1.0e-3
+    st.bodyforce = sympy.Matrix([0.0, -50.0 * sympy.exp(-100.0 * ((x - 0.5) ** 2 + (y - 0.5) ** 2))])
+    st._sp = None
+    open_top(st)
+    return st
+
+
+def run(build, difficulty, method):
+    mesh = make()
+    st = build(mesh, difficulty)
+    if method in ("fmg", "gamg"):
+        st.petsc_options["snes_rtol"] = 1e-7
+        st.petsc_options["snes_max_it"] = 25
+        st.petsc_options["ksp_max_it"] = 300              # cap inner so failures bail fast
+        st.petsc_options["snes_error_if_not_converged"] = 0
+        st.preconditioner = method
+        if st._sp is not None:
+            st.saddle_preconditioner = st._sp
+        t0 = time.perf_counter()
+        try:
+            st.solve(zero_init_guess=True, picard=0)
+            r = int(st.snes.getConvergedReason()); its = int(st.snes.getIterationNumber())
+        except Exception:
+            r, its = -99, None
+        dt = time.perf_counter() - t0
+        out = dict(reason=r, its=its, t=dt)
+    else:  # fas-vanka
+        try:
+            out = fv.solve_fas_vanka(st, st.dm_hierarchy)
+        except Exception as e:  # noqa
+            out = dict(reason=-99, its=None, t=float("nan"))
+    del st, mesh
+    return out
+
+
+def cell(r):
+    ok = "ok  " if (r["reason"] or -1) > 0 else "FAIL"
+    return f"{ok} its={str(r['its']):>3} ({str(r['reason']):>3}) {r['t']:5.1f}s"
+
+
+print(f"underworld3: {uw.__file__}\n")
+print("=== (1) SolCx viscosity step (open top) — FMG vs GAMG vs FAS-Vanka ===")
+print(f"{'eta_B':>7} | {'FMG':>20} | {'GAMG':>20} | {'FAS-Vanka':>20}")
+print("-" * 78)
+for eta_B in [1.0, 1.0e3, 1.0e6]:
+    rows = {m: run(solcx, eta_B, m) for m in ("fmg", "gamg", "fas-vanka")}
+    print(f"{eta_B:>7.0e} | {cell(rows['fmg']):>20} | {cell(rows['gamg']):>20} | {cell(rows['fas-vanka']):>20}")
+
+print("\n=== (2) Viscoplastic yield (open top) — FMG vs GAMG vs FAS-Vanka ===")
+print(f"{'tau_y':>7} | {'FMG':>20} | {'GAMG':>20} | {'FAS-Vanka':>20}")
+print("-" * 78)
+for tau_y in [10.0, 1.0, 0.3]:
+    rows = {m: run(visc, tau_y, m) for m in ("fmg", "gamg", "fas-vanka")}
+    print(f"{tau_y:>7.2f} | {cell(rows['fmg']):>20} | {cell(rows['gamg']):>20} | {cell(rows['fas-vanka']):>20}")
diff --git a/docs/examples/snesfas_investigation/fas_vanka.py b/docs/examples/snesfas_investigation/fas_vanka.py
new file mode 100644
index 00000000..a6db8a4f
--- /dev/null
+++ b/docs/examples/snesfas_investigation/fas_vanka.py
@@ -0,0 +1,101 @@
+"""
+import os
+FAS-Vanka: nonlinear multigrid (SNESFAS) with the custom-IS PCASM Vanka smoother.
+
+Turnkey-via-options is not available (PCPATCH won't build pressure-support patches
+in this build), so we inject the custom index sets into each FAS level smoother:
+  1. solve once with an LU smoother (assembles every level operator),
+  2. for each level smoother SNES: build pressure-support patches from its operator,
+     install PCASM(RESTRICT)+sub-LU wrapped in a GMRES Krylov smoother,
+  3. re-solve, driving the SNES directly.
+
+The patch *structure* depends only on sparsity, so it stays valid as the operator
+values change across nonlinear iterations.
+"""
+import time
+import importlib.util
+import numpy as np
+import sympy
+import underworld3 as uw
+from petsc4py import PETSc
+
+_spec = importlib.util.spec_from_file_location("va", os.path.join(os.path.dirname(os.path.abspath(__file__)), "vanka_mg.py"))
+va = importlib.util.module_from_spec(_spec); _spec.loader.exec_module(va)
+
+FAS_LU = {
+    "snes_type": "fas", "snes_fas_type": "full",
+    "fas_levels_snes_type": "newtonls", "fas_levels_snes_max_it": 1,
+    "fas_levels_snes_linesearch_type": "basic",
+    "fas_levels_ksp_type": "preonly", "fas_levels_pc_type": "lu",
+    "fas_coarse_snes_type": "newtonls", "fas_coarse_snes_max_it": 30,
+    "fas_coarse_ksp_type": "preonly", "fas_coarse_pc_type": "lu",
+}
+
+
+def inject_vanka(snes, hier):
+    nlev = snes.getFASLevels()
+    opts = PETSc.Options()
+    npres_per = []
+    for L in range(1, nlev):                       # 0 = coarse solve (LU); smooth on 1..nlev-1
+        sm = snes.getFASSmoother(L)                # a SNES
+        ksp = sm.getKSP()
+        Jl = ksp.getOperators()[0]
+        dmL = sm.getDM()                            # the smoother's own level DM
+        patches, npres = va.pressure_support_patches(dmL, Jl)
+        npres_per.append(npres)
+        # GMRES Krylov smoother (stabilises additive-Schwarz Vanka). max_it~15 is a
+        # more robust default — it carries moderate viscosity contrast (~1e3); extreme
+        # contrast (>=1e6) needs multiplicative Vanka, which additive PCASM can't do.
+        ksp.setType("gmres")
+        ksp.setTolerances(rtol=1e-2, max_it=15)
+        pc = ksp.getPC()
+        pc.reset()
+        pc.setOperators(Jl, Jl)
+        pc.setType("asm")
+        pc.setASMType(PETSc.PC.ASMType.RESTRICT)
+        pc.setASMLocalSubdomains(len(patches), patches)
+        pref = pc.getOptionsPrefix() or ""
+        opts[pref + "sub_ksp_type"] = "preonly"
+        opts[pref + "sub_pc_type"] = "lu"
+        pc.setFromOptions()
+    return npres_per
+
+
+def solve_fas_vanka(stokes, hier, verbose=False):
+    po = stokes.petsc_options
+    po["snes_rtol"] = 1e-7
+    po["snes_error_if_not_converged"] = 0
+    for k, v in FAS_LU.items():
+        po[k] = v
+    # (1) warm-up FAS-LU solve — only needs to ASSEMBLE level operators, not converge
+    po["snes_max_it"] = 2
+    stokes.solve(zero_init_guess=True, picard=0)
+    snes = stokes.snes
+    po["snes_max_it"] = 60                          # real solve cap
+    snes.setTolerances(max_it=60)
+    # (2) inject Vanka smoothers
+    npres = inject_vanka(snes, hier)
+    # (3) re-solve directly with the Vanka smoothers
+    stokes.mesh.update_lvec()
+    stokes.dm.setAuxiliaryVec(stokes.mesh.lvec, None)
+    stokes._update_constants()
+    gvec = stokes.dm.getGlobalVec()
+    gvec.set(0.0)
+    t0 = time.perf_counter()
+    snes.solve(None, gvec)
+    dt = time.perf_counter() - t0
+    r = int(snes.getConvergedReason())
+    its = int(snes.getIterationNumber())
+    stokes.dm.restoreGlobalVec(gvec)
+    return dict(reason=r, its=its, t=dt, npres=npres)
+
+
+if __name__ == "__main__":
+    # validate on a simple linear blob (open top, no nullspace)
+    print(f"underworld3: {uw.__file__}")
+    for R in (1, 2, 3):
+        mesh, st, v, p = va.build_stokes(R)
+        res = solve_fas_vanka(st, st.dm_hierarchy)
+        print(f"  FAS-Vanka  refinement={R}  npres/level={res['npres']}  "
+              f"->  reason={res['reason']}  fas_its={res['its']}  t={res['t']:.2f}s")
+        del st, mesh
diff --git a/docs/examples/snesfas_investigation/notched_beam_cascade.py b/docs/examples/snesfas_investigation/notched_beam_cascade.py
new file mode 100644
index 00000000..cb2dd897
--- /dev/null
+++ b/docs/examples/snesfas_investigation/notched_beam_cascade.py
@@ -0,0 +1,102 @@
+"""
+Nested iteration (grid sequencing) on the hard case: viscoplastic compression with
+a weak seed -> localizing shear band, which is well-conditioned at low resolution
+and increasingly ill-conditioned as you refine (cold Newton convergence collapses).
+
+Tests the classical Brandt "coarse-as-initialization" idea (NOT FAS correction):
+  solve coarse (robust) -> interpolate (u,p) up as initial guess -> solve next, ...
+each level using a robust saddle-point solver (FMG). No tau-correction, no Vanka,
+no coarse-grid correction across the band -> the contrast is handled at the
+resolution where it is resolved.
+
+Compares, at each resolution:
+  COLD     : Newton+FMG from a zero initial guess (the current default)
+  CASCADE  : Newton+FMG warm-started from the interpolated coarser solution
+
+Pure-shear compression (enclosed; UW3 FMG handles the pressure nullspace).
+Run:  pixi run -e amr-dev python notched_beam_cascade.py
+"""
+import time
+import numpy as np
+import sympy
+import underworld3 as uw
+
+TAU_Y = 8.0
+SR_MIN = 1.0e-2          # regularization floor -> bounds the effective contrast
+SEED_SIG = 0.04
+U = 1.0
+
+
+def build(cellSize, refinement=1):
+    mesh = uw.meshing.UnstructuredSimplexBox(
+        minCoords=(0, 0), maxCoords=(1, 1), cellSize=cellSize,
+        refinement=refinement, qdegree=3)
+    x, y = mesh.X
+    v = uw.discretisation.MeshVariable("V", mesh, mesh.dim, degree=2)
+    p = uw.discretisation.MeshVariable("P", mesh, 1, degree=1, continuous=True)
+    st = uw.systems.Stokes(mesh, velocityField=v, pressureField=p)
+    st.constitutive_model = uw.constitutive_models.ViscoPlasticFlowModel
+    st.constitutive_model.Parameters.shear_viscosity_0 = 1.0
+    # weak seed: yield stress dips to 0.1*tau_y in a small central blob -> nucleates the band
+    seed = sympy.exp(-((x - 0.5) ** 2 + (y - 0.5) ** 2) / (2 * SEED_SIG ** 2))
+    st.constitutive_model.Parameters.yield_stress = TAU_Y * (1.0 - 0.9 * seed)
+    st.constitutive_model.Parameters.strainrate_inv_II_min = SR_MIN
+    st.bodyforce = sympy.Matrix([0.0, 0.0])
+    # pure-shear compression: compress in x, extend in y; free-slip tangential
+    st.add_dirichlet_bc((+U, None), "Left")
+    st.add_dirichlet_bc((-U, None), "Right")
+    st.add_dirichlet_bc((None, +U), "Top")
+    st.add_dirichlet_bc((None, -U), "Bottom")
+    return mesh, st, v, p
+
+
+def solve_level(st, v, p, warm=None, max_it=40):
+    po = st.petsc_options
+    po["snes_rtol"] = 1.0e-4
+    po["snes_max_it"] = 120
+    po["snes_error_if_not_converged"] = 0
+    st.preconditioner = "fmg"
+    if warm is not None:
+        vc, pc = warm
+        v.data[:] = np.asarray(uw.function.evaluate(vc.sym, v.coords)).reshape(v.data.shape)
+        p.data[:, 0] = np.asarray(uw.function.evaluate(pc.sym, p.coords)).reshape(-1)
+        zig = False
+    else:
+        zig = True
+    t0 = time.perf_counter()
+    try:
+        st.solve(zero_init_guess=zig, picard=40)
+        r = int(st.snes.getConvergedReason()); its = int(st.snes.getIterationNumber())
+    except Exception:
+        r, its = -99, None
+    return dict(reason=r, its=its, t=time.perf_counter() - t0)
+
+
+CELLSIZES = [0.10, 0.05, 0.025]   # coarse -> fine
+
+if __name__ == "__main__":
+    print(f"underworld3: {uw.__file__}")
+    print(f"viscoplastic pure-shear compression, weak seed, tau_y={TAU_Y}, sr_min={SR_MIN}\n")
+
+    # ---- COLD: Newton+FMG from scratch at each resolution (the current default) ----
+    print("COLD  (zero initial guess at each resolution):")
+    for cs in CELLSIZES:
+        mesh, st, v, p = build(cs)
+        ndof = st.snes.getJacobian()[0].getSize()[0] if False else None
+        r = solve_level(st, v, p, warm=None)
+        tag = "ok  " if (r["reason"] or -1) > 0 else "FAIL"
+        print(f"  cellSize={cs:<6}  {tag} newton_its={str(r['its']):>3} ({str(r['reason']):>3})  {r['t']:6.1f}s")
+        del st, v, p, mesh
+
+    # ---- CASCADE: coarse -> fine, warm-started by interpolation ----
+    print("\nCASCADE  (coarse -> fine, warm-started from the interpolated coarser solution):")
+    warm = None
+    keep = []
+    for cs in CELLSIZES:
+        mesh, st, v, p = build(cs)
+        r = solve_level(st, v, p, warm=warm)
+        tag = "ok  " if (r["reason"] or -1) > 0 else "FAIL"
+        src = "cold" if warm is None else "warm"
+        print(f"  cellSize={cs:<6}  [{src}] {tag} newton_its={str(r['its']):>3} ({str(r['reason']):>3})  {r['t']:6.1f}s")
+        warm = (v, p)
+        keep.append((mesh, st, v, p))    # keep alive so warm refs stay valid
diff --git a/docs/examples/snesfas_investigation/vanka_asm.py b/docs/examples/snesfas_investigation/vanka_asm.py
new file mode 100644
index 00000000..882b82f3
--- /dev/null
+++ b/docs/examples/snesfas_investigation/vanka_asm.py
@@ -0,0 +1,102 @@
+"""
+Vanka, done right: overlapping Schwarz (PCASM) with patches = support of each
+pressure basis function, built from the divergence-block sparsity — per expert
+advice (the stock PCPATCH `vanka`/`star` constructs were ineffective on simplex
+Taylor-Hood; this is the hand-built-IS route that is standard for P2-P1).
+
+Procedure:
+  1. assemble UW3's true saddle Jacobian J = [A B; B^T (0)]  (Amat, not the Schur Pmat)
+  2. get the velocity / pressure global DOF index sets from the DM field decomposition
+  3. for each pressure DOF p: patch = {p} ∪ {velocity DOFs coupled to p through J's row}
+  4. PCASM with those index sets, each sub-block an exact LU mini-Stokes solve
+  5. wrap in FGMRES; measure convergence on J x = b
+
+Run:  pixi run -e amr-dev python vanka_asm.py
+"""
+import numpy as np
+import sympy
+import underworld3 as uw
+from petsc4py import PETSc
+
+
+def build_stokes(cellSize=0.1):
+    mesh = uw.meshing.UnstructuredSimplexBox(
+        minCoords=(0, 0), maxCoords=(1, 1), cellSize=cellSize, qdegree=3)
+    x, y = mesh.X
+    v = uw.discretisation.MeshVariable("V", mesh, mesh.dim, degree=2)
+    p = uw.discretisation.MeshVariable("P", mesh, 1, degree=1, continuous=True)
+    st = uw.systems.Stokes(mesh, velocityField=v, pressureField=p)
+    st.constitutive_model = uw.constitutive_models.ViscousFlowModel
+    st.constitutive_model.Parameters.shear_viscosity_0 = 1.0
+    st.bodyforce = sympy.Matrix([0.0, -sympy.exp(-100.0 * ((x - 0.5) ** 2 + (y - 0.5) ** 2))])
+    st.add_dirichlet_bc((0.0, 0.0), "Bottom")     # open top -> no pressure nullspace
+    st.add_dirichlet_bc((0.0, None), "Left")
+    st.add_dirichlet_bc((0.0, None), "Right")
+    return mesh, st, v, p
+
+
+def main():
+    print(f"underworld3: {uw.__file__}")
+    mesh, st, v, p = build_stokes()
+    st.solve(zero_init_guess=True, picard=0)        # assembles Amat
+    J = st.snes.getJacobian()[0]                    # true saddle Jacobian
+    n = J.getSize()[0]
+    print(f"saddle matrix: {n} x {n}")
+
+    # field decomposition -> velocity / pressure global DOFs
+    names, ises, dms = st.dm.createFieldDecomposition()
+    print("fields:", names)
+    vIS, pIS = ises[0], ises[1]
+    vset = set(int(i) for i in vIS.getIndices())
+    pidx = [int(i) for i in pIS.getIndices()]
+    print(f"n_vel={len(vset)}  n_pres={len(pidx)}")
+
+    # build one patch per pressure DOF: {p} + coupled velocity DOFs (J row sparsity)
+    patches = []
+    sizes = []
+    for pg in pidx:
+        cols, _ = J.getRow(pg)
+        members = {pg}
+        for c in cols:
+            ci = int(c)
+            if ci in vset:
+                members.add(ci)
+        arr = np.array(sorted(members), dtype=PETSc.IntType)
+        sizes.append(arr.size)
+        patches.append(PETSc.IS().createGeneral(arr, comm=PETSc.COMM_SELF))
+    print(f"patches: {len(patches)}  size min/mean/max = "
+          f"{min(sizes)}/{np.mean(sizes):.1f}/{max(sizes)}")
+
+    # consistent RHS: b = J x_exact
+    x_exact = J.createVecRight(); x_exact.setRandom()
+    b = J.createVecLeft(); J.mult(x_exact, b)
+
+    for asm_type, label in [(PETSc.PC.ASMType.RESTRICT, "RAS"),
+                            (PETSc.PC.ASMType.BASIC, "additive")]:
+        ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
+        ksp.setOperators(J)
+        ksp.setType("fgmres")
+        ksp.setTolerances(rtol=1e-8, max_it=400)
+        ksp.setOptionsPrefix(f"vk_{label}_")
+        pc = ksp.getPC()
+        pc.setType("asm")
+        pc.setASMType(asm_type)
+        pc.setASMLocalSubdomains(len(patches), patches)
+        opts = PETSc.Options()
+        opts[f"vk_{label}_sub_ksp_type"] = "preonly"
+        opts[f"vk_{label}_sub_pc_type"] = "lu"
+        ksp.setFromOptions()
+        x = b.duplicate(); x.set(0.0)
+        try:
+            ksp.solve(b, x)
+            reason = ksp.getConvergedReason()
+            its = ksp.getIterationNumber()
+            err = (x - x_exact).norm() / x_exact.norm()
+            print(f"  PCASM-Vanka [{label:8}] reason={reason:>3}  its={its:>4}  rel_err={err:.2e}")
+        except Exception as e:  # noqa
+            print(f"  PCASM-Vanka [{label:8}] EXC {repr(e)[:70]}")
+        ksp.destroy()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/examples/snesfas_investigation/vanka_mg.py b/docs/examples/snesfas_investigation/vanka_mg.py
new file mode 100644
index 00000000..ccbec5f2
--- /dev/null
+++ b/docs/examples/snesfas_investigation/vanka_mg.py
@@ -0,0 +1,138 @@
+"""
+The payoff test: custom-IS PCASM Vanka as the per-level smoother of a geometric
+multigrid on the FULL Stokes saddle, over UW3's dm_hierarchy (Galerkin coarse
+operators — the FMG path, but smoothing the coupled system with Vanka).
+
+Measure outer-KSP iterations vs refinement: flat => mesh-independent => the
+scalable Stokes-MG smoother we were after (linear Stokes; this is the FAS-
+scalability question in linear form).
+
+Run:  pixi run -e amr-dev python vanka_mg.py
+"""
+import time
+import numpy as np
+import sympy
+import underworld3 as uw
+from petsc4py import PETSc
+
+
+def build_stokes(refinement, cellSize=0.2):
+    mesh = uw.meshing.UnstructuredSimplexBox(
+        minCoords=(0, 0), maxCoords=(1, 1), cellSize=cellSize,
+        refinement=refinement, qdegree=3)
+    x, y = mesh.X
+    v = uw.discretisation.MeshVariable("V", mesh, mesh.dim, degree=2)
+    p = uw.discretisation.MeshVariable("P", mesh, 1, degree=1, continuous=True)
+    st = uw.systems.Stokes(mesh, velocityField=v, pressureField=p)
+    st.constitutive_model = uw.constitutive_models.ViscousFlowModel
+    st.constitutive_model.Parameters.shear_viscosity_0 = 1.0
+    st.bodyforce = sympy.Matrix([0.0, -sympy.exp(-100.0 * ((x - 0.5) ** 2 + (y - 0.5) ** 2))])
+    st.add_dirichlet_bc((0.0, 0.0), "Bottom")
+    st.add_dirichlet_bc((0.0, None), "Left")
+    st.add_dirichlet_bc((0.0, None), "Right")
+    return mesh, st, v, p
+
+
+def pressure_support_patches(dm, J):
+    """One patch per pressure DOF = {p} u {velocity DOFs coupled via J row}."""
+    names, ises, dms = dm.createFieldDecomposition()
+    vset = set(int(i) for i in ises[0].getIndices())
+    pidx = [int(i) for i in ises[1].getIndices()]
+    patches = []
+    for pg in pidx:
+        cols, _ = J.getRow(pg)
+        mem = {pg} | {int(c) for c in cols if int(c) in vset}
+        patches.append(PETSc.IS().createGeneral(np.array(sorted(mem), dtype=PETSc.IntType),
+                                                 comm=PETSc.COMM_SELF))
+    return patches, len(pidx)
+
+
+def run(refinement, smoother="vanka"):
+    mesh, st, v, p = build_stokes(refinement)
+    st.preconditioner = "fmg"                          # the real competitor (velocity-block MG + Schur)
+    t0 = time.perf_counter()
+    st.solve(zero_init_guess=True, picard=0)           # assemble + FMG linear solve
+    t_fmg = time.perf_counter() - t0
+    fmg_its = int(st.snes.getLinearSolveIterations())
+    Amat = st.snes.getJacobian()[0]
+    hier = st.dm_hierarchy                              # coarse -> fine
+    nlev = len(hier)
+    ndof = Amat.getSize()[0]
+
+    ksp = PETSc.KSP().create(PETSc.COMM_WORLD)
+    ksp.setOperators(Amat)
+    ksp.setType("fgmres")
+    ksp.setTolerances(rtol=1e-8, max_it=400)
+    ksp.setOptionsPrefix("vmg_")
+    pc = ksp.getPC()
+    pc.setType("mg")
+    pc.setMGLevels(nlev)
+    for L in range(1, nlev):                            # interpolation coarse(L-1) -> fine(L)
+        interp, _ = hier[L - 1].createInterpolation(hier[L])
+        pc.setMGInterpolation(L, interp)
+
+    o = PETSc.Options()
+    o["vmg_pc_mg_galerkin"] = "both"                   # coarse operators = R A P
+    o["vmg_pc_mg_type"] = "multiplicative"
+    if smoother == "lu":                               # control: exact monolithic smoother
+        o["vmg_mg_levels_ksp_type"] = "richardson"
+        o["vmg_mg_levels_ksp_max_it"] = 4
+        o["vmg_mg_levels_pc_type"] = "lu"
+    else:
+        # Krylov smoother (GMRES) self-stabilises the additive-Schwarz Vanka spectrum
+        o["vmg_mg_levels_ksp_type"] = "gmres"
+        o["vmg_mg_levels_ksp_max_it"] = 6
+        o["vmg_mg_levels_pc_type"] = "asm"
+        o["vmg_mg_levels_sub_ksp_type"] = "preonly"
+        o["vmg_mg_levels_sub_pc_type"] = "lu"
+    o["vmg_mg_coarse_ksp_type"] = "preonly"
+    o["vmg_mg_coarse_pc_type"] = "lu"
+    ksp.setFromOptions()
+    t0 = time.perf_counter()
+    ksp.setUp()
+    t_setup = time.perf_counter() - t0
+
+    # inject custom pressure-support patches into each fine-ish level smoother
+    patch_info = []
+    t_patch = 0.0
+    if smoother == "vanka":
+        for L in range(1, nlev):                        # level 0 is coarse solve (LU)
+            sm = pc.getMGSmoother(L)
+            Jl = sm.getOperators()[0]
+            dmL = hier[L]
+            tp = time.perf_counter()
+            patches, npres = pressure_support_patches(dmL, Jl)   # Python prototype cost
+            t_patch += time.perf_counter() - tp
+            pcL = sm.getPC()
+            pcL.reset()                                # clear the auto setup so we can set subdomains
+            pcL.setOperators(Jl, Jl)                    # reset() drops operators — restore
+            pcL.setType("asm")
+            pcL.setASMType(PETSc.PC.ASMType.RESTRICT)
+            pcL.setASMLocalSubdomains(len(patches), patches)
+            patch_info.append(npres)
+
+    x_exact = Amat.createVecRight(); x_exact.setRandom()
+    b = Amat.createVecLeft(); Amat.mult(x_exact, b)
+    x = b.duplicate(); x.set(0.0)
+    t0 = time.perf_counter()
+    ksp.solve(b, x)
+    t_solve = time.perf_counter() - t0
+    reason = ksp.getConvergedReason()
+    its = ksp.getIterationNumber()
+    err = (x - x_exact).norm() / x_exact.norm()
+    print(f"  ndof={ndof:>6} lvl={nlev} | FMG: its={fmg_its:>2} solve={t_fmg:6.2f}s "
+          f"| Vanka-MG: its={its:>2} solve={t_solve:6.2f}s (+setup {t_setup:.2f} +patch(py) {t_patch:.2f}) "
+          f"err={err:.0e}")
+    ksp.destroy()
+    return dict(ndof=ndof, fmg_its=fmg_its, t_fmg=t_fmg, its=its, t_solve=t_solve)
+
+
+if __name__ == "__main__":
+    print(f"underworld3: {uw.__file__}")
+    print("FMG (velocity-block MG + Schur — the production competitor) vs full-saddle Vanka-MG.")
+    print("Linear Stokes; FMG solve = assemble + linear solve; Vanka solve = linear solve on assembled J.\n")
+    for R in (1, 2, 3, 4):
+        try:
+            run(R, "vanka")
+        except Exception as e:  # noqa
+            print(f"  refinement={R}: EXC {repr(e)[:90]}")