From fcb9692bf11c01e66373f03c58bf2b03ab99f825 Mon Sep 17 00:00:00 2001 From: lmoresi Date: Tue, 16 Jun 2026 14:12:32 +1000 Subject: [PATCH] docs(solvers): preserve SNESFAS / Vanka / grid-sequencing investigation (design notes + working prototypes) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lands the parked multilevel-nonlinear-Stokes investigation as documentation so the significant trial-and-error and the WORKING implementations are not lost. No src/ or test changes — everything runs through petsc_options + petsc4py on stock UW3. Design notes (docs/developer/design/): - snesfas-feasibility.md: SNESFAS via options; mesh-independent on scalar nonlinear; adaptation benign. - snesfas-vanka-feasibility-study.md: the Vanka deep-dive; custom-IS pressure-support patches work where stock PCPATCH fails; mesh-independent Vanka-MG (5/5/6); FAS-Vanka 2-vs-10 on plasticity but fails at extreme contrast; 3-way FMG/GAMG/FAS-Vanka. - multilevel-nonlinear-stokes-strategy.md: capstone + decision to park (small 2D win, high engineering load, cleanup first) with return-conditions (3D/parallel/real FMG-failing benchmark) and the cheapest re-entry (Tier-1 grid_sequence helper). Working prototypes (docs/examples/snesfas_investigation/, precedent: submesh_investigation/): vanka_asm.py (core custom-IS Vanka recipe), vanka_mg.py (mesh-independent Vanka MG), fas_vanka.py (working SNESFAS-Vanka for nonlinear, modest contrast — the 2-vs-10 case), benchmark_3way.py (FMG/GAMG/FAS-Vanka), notched_beam_cascade.py (grid sequencing). Cross-imports made self-contained; all byte-compile. README documents each + the recipes + known limits. Underworld development team with AI support from Claude Code --- .../multilevel-nonlinear-stokes-strategy.md | 135 +++++++ docs/developer/design/snesfas-feasibility.md | 377 ++++++++++++++++++ .../design/snesfas-vanka-feasibility-study.md | 337 ++++++++++++++++ docs/examples/snesfas_investigation/README.md | 49 +++ .../snesfas_investigation/benchmark_3way.py | 110 +++++ .../snesfas_investigation/fas_vanka.py | 101 +++++ .../notched_beam_cascade.py | 102 +++++ .../snesfas_investigation/vanka_asm.py | 102 +++++ .../snesfas_investigation/vanka_mg.py | 138 +++++++ 9 files changed, 1451 insertions(+) create mode 100644 docs/developer/design/multilevel-nonlinear-stokes-strategy.md create mode 100644 docs/developer/design/snesfas-feasibility.md create mode 100644 docs/developer/design/snesfas-vanka-feasibility-study.md create mode 100644 docs/examples/snesfas_investigation/README.md create mode 100644 docs/examples/snesfas_investigation/benchmark_3way.py create mode 100644 docs/examples/snesfas_investigation/fas_vanka.py create mode 100644 docs/examples/snesfas_investigation/notched_beam_cascade.py create mode 100644 docs/examples/snesfas_investigation/vanka_asm.py create mode 100644 docs/examples/snesfas_investigation/vanka_mg.py diff --git a/docs/developer/design/multilevel-nonlinear-stokes-strategy.md b/docs/developer/design/multilevel-nonlinear-stokes-strategy.md new file mode 100644 index 00000000..d26c221d --- /dev/null +++ b/docs/developer/design/multilevel-nonlinear-stokes-strategy.md @@ -0,0 +1,135 @@ +--- +title: "Multilevel solvers for nonlinear Stokes — strategy, findings, and a decision to park" +status: PARKED (2026-06-16) +--- + +# Multilevel solvers for nonlinear Stokes — strategy note + decision + +**Decision (2026-06-16): park this. Capture the understanding, ship nothing yet.** +The conceptual payoff of the investigation was large; the *implementation* payoff +*right now* is small relative to the engineering load, and the codebase wants a +cleanup before features of this size land. This note records the landscape, the +findings, and the conditions under which it is worth coming back. Companion docs +with the detailed experiments: `snesfas-feasibility.md` and +`snesfas-vanka-feasibility-study.md`. **The working prototypes are preserved in-repo +at `docs/examples/snesfas_investigation/`** (curated set + README; the full scratch +set is in `~/+Simulations/snesfas_spike/`). No `src/` changes were made — this was a +spike. + +## What we were chasing + +A robust, scalable solver for **nonlinear Stokes** in the regime that actually +matters for geodynamics: **strong nonlinearity (plasticity, T-dependent viscosity) +*and* high viscosity contrast — which co-occur, because the nonlinearity is what +creates the contrast.** FMG (the geometric-MG-on-the-velocity-block preconditioner, +#231) is the current robust default; the question was whether SNESFAS / Vanka / +grid-sequencing could do better where FMG-inside-Newton struggles. + +## The conceptual map (the durable result) + +Three multilevel strategies, and the key distinctions that took a while to get +straight: + +- **FMG (Galerkin).** Coarse operator `A_H = R A_h P`, built *algebraically from the + fine matrix* — it never re-visits the PDE. The multigrid acts on the **velocity + block only**, inside the Schur fieldsplit; the saddle/pressure coupling and the + viscosity jump are handled by the Schur, *not* by a coarse-grid correction. That is + why FMG is robust to high contrast: it keeps the hard, indefinite, high-contrast + part **out of the multigrid**. Transfers can be approximate (barycentric / general); + they only move a *correction* with homogeneous boundary values, and the smoother + + Krylov clean up any imperfection — so it tolerates UW3's non-nested (curved / + mover-deformed) hierarchies. + +- **SNESFAS (re-discretisation + τ-correction).** Nonlinear multigrid on the **full + coupled saddle**. Each level uses the *genuine re-discretised* coarse problem (not + Galerkin — `-snes_view` confirms "Not using Galerkin… function evaluation"), plus + the **τ-correction** that couples levels (the coarse grid solves the full problem + with the restricted fine residual as forcing, and prolongs only the *change*). The + smoother is a separate choice — **Vanka** (patch-local mini-Stokes solves) is one + option for the saddle, *not* part of FAS itself. The τ-correction is the extra + nonlinear machinery — **and it is exactly the part that is fragile to high + contrast**, because it is a repeated coarse-grid *correction* across a jump the + coarse grid cannot resolve. + +- **Grid sequencing / nested iteration (Brandt).** Solve the re-discretised coarse + problem (cheap, well-conditioned) to **find the basin**, interpolate up as the + initial guess, solve the next level, repeat. Coarse grids are used for + *initialisation*, never *correction* — so there is **no coarse-correction-across- + the-jump**, hence it is robust to contrast. It lacks FAS's τ-correction (so it + doesn't get the textbook MG convergence *rate*), but for problems that become + *more ill-conditioned with refinement* it is the effective, robust tool: each + level is only a bounded correction to the one below. + +The unifying insight: **nonlinearity wants a coarse-grid correction (FAS); high +contrast forbids one (the coarse grid can't represent the jump).** Those are +opposite demands, which is why no single monolithic cycle covers both, and why the +robust workhorse stays "split the saddle off (FMG) + handle the nonlinearity by +Newton/Picard + continuation (timestepping)." + +## What was actually demonstrated (iteration counts — timings were unreliable) + +(Wall-clock numbers in the companion docs are *soft* — JIT, machine noise, +resolution/tolerance-sensitive. Trust the iteration counts.) + +- **SNESFAS runs through UW3 with zero `src` changes** (petsc_options only) — `copyDS` + + PETSc's FAS setup give coarse residuals. Mesh-independent on scalar nonlinear + problems; beats Newton on stiff nonlinear diffusion. +- **Vanka works** on UW3's simplex P2–P1 once patches are built as the *support of + each pressure basis function* (custom-IS `PCASM`, exact local LU) — the stock + `PCPATCH` constructs are the wrong tool and fail. A geometric MG with that Vanka + smoother is **mesh-independent (5/5/6 outer iters across 16× DOFs)** — needs a GMRES + Krylov smoother (additive Schwarz alone diverges). +- **FAS-Vanka wins on plasticity** — viscoplastic τ_y=1: **2 nonlinear iterations vs + 10** for Newton+FMG and GAMG (the coarse problem is better conditioned, as + predicted). But with *additive* Vanka it **fails at extreme contrast (10⁶)** no + matter how hard you smooth — needs multiplicative Vanka + interface-aware coarsening. +- **The cascade helps iff the per-level solve is iteration-limited.** On SolCx it does + nothing for FMG (already 1–3 iters → cascade is pure overhead) but rescues a + struggling GAMG (**8→2 iters**). Its payoff is exactly proportional to how badly the + cold fine solve struggles. +- **PETSc's `-snes_grid_sequence` does not work on UW3 meshes** — the built-in + interpolator hard-fails (point location) on UW3's non-nested (boundary-snapped / + mover-deformed) hierarchies and on the BC-constrained boundary DOFs. A hand-rolled + cascade using `uw.function.evaluate` (robust, never hard-fails; BCs re-imposed per + level) sidesteps this entirely. + +## Why park it now + +- **The marginal win is small in the current regime.** FMG already handles the + linear / moderate / high-contrast-*linear* cases well; for nonlinear, Picard/Newton + + FMG + timestepping-as-continuation is robust if unspectacular. FAS-Vanka's clear + advantage (plasticity iteration count) does **not** yet translate to wall-clock + (LU/Vanka smoother cost), and it breaks on exactly the high-contrast that real + nonlinear problems also carry. +- **The engineering load is high.** Production FAS-Vanka needs: a multiplicative / + interface-aware Vanka smoother (the extreme-contrast gap), per-level pressure + nullspace, parallel patch construction, and a UW3 API. Production grid-sequencing + needs the problem **re-bindable to coarser meshes** (re-create variables, re-bind + expressions/BCs, and **restrict auxiliary/material fields** — the recurring "get the + coefficient data onto the coarse grid" problem). +- **The tree needs a cleanup first.** Landing a feature of this weight onto a working + tree with many outstanding changes is the wrong order. + +## When to come back (what would change the calculus) + +- **3-D and large-parallel.** This is where the argument flips: the direct solves + that FMG's coarse grid and the LU paths lean on scale ~O(N²) in 3-D and MUMPS gets + unreliable at high core counts, while Vanka's local solves scale ~O(N) and + parallelise. Re-measure there. +- **A concrete problem where FMG genuinely fails at production resolution** — e.g. a + notched/localising plasticity benchmark whose cold fine solve collapses. The spike + could not tune a *simplified* version into the "easy-coarse / hard-fine" knife-edge + (UW3's Picard+FMG was robust on smooth viscoplastic across resolutions); the real + pathology needs the singular features (geometric notch tip, Drucker–Prager) and is + the right target if/when it bites in practice. + +## Cheapest future entry point + +If a real need appears, the lowest-cost first step is **Tier-1 grid sequencing**: a +~30-line `grid_sequence_solve(build_fn, resolutions)` helper that loops coarse→fine, +solves each level with FMG, and interpolates up via `uw.function.evaluate`. It needs +no solver-internals changes and is robust to UW3's non-nested meshes. It only helps +when the fine solve is iteration-limited, so it should be opt-in. Prototype: +`~/+Simulations/snesfas_spike/notched_beam_cascade.py`. The callback-free +`solve(grid_sequence=True)` and the FAS-Vanka path are the larger follow-ons, gated +on the cleanup and on 3-D/parallel demand. diff --git a/docs/developer/design/snesfas-feasibility.md b/docs/developer/design/snesfas-feasibility.md new file mode 100644 index 00000000..2ec6ec71 --- /dev/null +++ b/docs/developer/design/snesfas-feasibility.md @@ -0,0 +1,377 @@ +--- +title: "SNESFAS (nonlinear multigrid) feasibility in Underworld3" +--- + +# SNESFAS feasibility spike + +**Verdict: GO for scalar nonlinear, and GO for nonlinear Stokes with adaptation +(with two engineering follow-ons).** PETSc's `SNESFAS` (Full Approximation Scheme — +nonlinear geometric multigrid) runs through Underworld3's existing DMPlex-FE solver +stack **with no source changes** — it is activated entirely through `petsc_options`. +On a strongly nonlinear scalar problem it is mesh-independent, more robust than +Newton, and faster, in serial and parallel. On **nonlinear (power-law) Stokes** it +holds a nonlinearity-independent 2–3 multigrid cycles where Newton climbs to 30+ +iterations, **including on an adapted (coordinate-deformed) mesh** — the actual +production target. The two remaining pieces before it is production-ready are +scalable saddle-point smoothers (the spike used direct LU per level) and a +per-level pressure nullspace for enclosed/free-slip problems. + +This note records the spike (scalar `SNES_Scalar` / `Poisson`, static meshes) that +established feasibility, and what remains before FAS could be offered as a +production feature or extended to Stokes. + +## Background: why FAS, and the architectural question + +The geometric **FMG** preconditioner landed in #231 gives *linear* multigrid that +is robust to mesh anisotropy. But nonlinear robustness still rests on Newton +(`newtonls`) / Picard, whose convergence basin shrinks as nonlinearity sharpens +(temperature-dependent viscosity, viscoplastic yield, Richards fronts). `SNESFAS` +performs coarse-grid correction on the **nonlinear residual itself**, which is the +classic lever for global robustness. + +The open question was whether UW3 could feed FAS a genuine nonlinear residual on +every coarse level. The FMG note (`docs/advanced/multigrid-preconditioning.md`) +states that UW3 "does not install residual/Jacobian callbacks on the coarse DMs, +so the coarse operators must be formed by Galerkin projection (RAP)". That is true +for the **linear `PCMG`** path — but it turns out **not** to block FAS. + +## Key finding: FAS works with zero code changes + +In `petsc_generic_snes_solvers.pyx` each solver `_build`: + +1. copies the `PetscDS` (which holds the JIT pointwise residual/Jacobian function + pointers) to **every** coarse DM — `self.dm.copyDS(coarse_dm)` in a loop over + `self.dm_hierarchy`; and +2. installs the SNES local-FEM callback on the **fine** DM only — + `UW_DMPlexSetSNESLocalFEM(cdm.dm, ...)`. + +When `snes_type = fas`, PETSc's `SNESSetUp_FAS` walks the coarse-DM chain (already +linked by `setCoarseDM` when the mesh is built with `refinement=N`) and +**propagates the fine DM's `DMSNES` local-FEM callback down to each level** via +`DMCopyDMSNES`. Combined with the already-copied DS, every level can evaluate its +own nonlinear residual and Jacobian. `-snes_view` confirms it: + +``` +type: fas + type is FULL, levels=3, cycles=1 + Not using Galerkin computed coarse grid function evaluation <-- genuine re-discretisation + Coarse grid solver -- level 0: newtonls + LU, rows=445 + Down/Up smoother on level 1: newtonls, rows=1857 + Down/Up smoother on level 2: newtonls (fine) +``` + +So the spike's anticipated "install callbacks on coarse DMs" change was +**unnecessary**. FAS is a `petsc_options`-only capability today. + +## Results (scalar, static mesh) + +Two testbeds on the unit square, P2 elements, mesh built with `refinement`: + +- **Bratu** `-Δu = λ e^u`, `u=0` on `∂Ω` (lower branch exists for `λ < λ_c ≈ 6.808`). +- **Exponential nonlinear diffusion** `-∇·(e^{βu}∇u) = f`, manufactured solution + `u = sin πx · sin πy` so a solution provably exists at every β (a stiff Newton + stressor as β grows). + +Configurations: `newton` (`newtonls`), `fas` (`snes_type=fas`, FULL cycle, +`newtonls` smoothers, LU coarse), `ngmres+fas` (NGMRES outer with FAS as the +nonlinear preconditioner, `-npc_snes_type fas`). + +### Correctness and mesh-independence (Bratu, λ=5) + +| refinement | levels | Newton its | FAS its (F-cycles) | ‖u_FAS − u_Newton‖ | +|---|---|---|---|---| +| 1 | 2 | 4 | **1** | 7e-15 | +| 2 | 3 | 4 | **1** | 2e-15 | +| 3 | 4 | 4 | **1** | 2e-15 | + +FAS converges in a single F-cycle independent of refinement, to the same solution +as Newton. Across `λ = 3…6.8`, `fas` and `ngmres+fas` agree with Newton to +1e-9…1e-15. Both Newton and FAS hit the **same** envelope at the fold +(`λ ≳ 6.85`, where no solution exists) — Bratu's lower branch is reachable by +Newton from a cold start, so Bratu shows FAS *works*, not that it is *more robust*. + +### Robustness advantage (exponential diffusion, MMS) + +| β | Newton | FAS | NGMRES+FAS | notes | +|---|---|---|---|---| +| 1 | 16 its ✓ | **1** ✓ | 1 ✓ | L2 ≈ 2e-7 all | +| 2 | 26 its ✓ | **1** ✓ | 1 ✓ | | +| 3 | **diverges** (line search) | **1 ✓** | **1 ✓** | solution exists — FAS finds it (L2 2.2e-7), Newton cannot | +| 4 | diverges | diverges | diverges | default smoothers insufficient | +| 5 | diverges | 80 (stalls) | 27 (fails) | | + +This is the headline: at β=3 a solution provably exists and **FAS converges in one +cycle where Newton diverges from the cold start**. FAS is not a silver bullet — +β ≥ 4 needs stronger smoothers/continuation than the spike's defaults. + +### Wall-clock (exponential diffusion) + +| β | refinement | Newton | FAS | speed-up | +|---|---|---|---|---| +| 1 | 2 | 1.86 s (16) | 1.10 s (1) | 1.7× | +| 2 | 2 | 2.59 s (26) | 1.14 s (1) | 2.3× | +| 1 | 3 | 6.45 s (16) | 5.35 s (1) | 1.2× | +| 2 | 3 | 10.6 s (26) | 5.56 s (1) | 1.9× | + +FAS is faster as well as more robust when the nonlinearity is strong (Newton needs +many steps). On mild problems where Newton converges in 3–4 steps, FAS's +per-cycle cost makes it merely competitive, not faster. + +### Parallel + +np=2 with a parallel-safe coarse solve (`fas_coarse_pc_type=redundant`, +`fas_coarse_redundant_pc_type=lu`) converges in 1 cycle, `L2=2.06e-7` vs np=1's +`2.07e-7`. The scalar parallel gate is clear. + +### Adaptation — FAS on a coordinate-deformed mesh + +The headline risk in the first cut of this note was that, because the movers +update only the fine DM's coordinates, FAS's coarse(undeformed) → fine(deformed) +inter-grid transfers might be "too weak" to converge. **Tested and refuted for the +scalar case.** A refined mesh (3 levels) was deformed with `follow_metric` +(tanh-front metric) at increasing strength, then the exponential-diffusion MMS was +solved on the deformed mesh: + +| follow R | q_min | coarse moved | fine moved | Newton | FAS | FAS L2 | +|---|---|---|---|---|---|---| +| — (static) | 0.89 | — | — | 26 ✓ | **1** ✓ | 2.1e-7 | +| 1.5 | 0.37 | **0** | 0.21 | 24 ✓ | **2** ✓ | 4.2e-7 | +| 2.0 | 0.32 | **0** | 0.35 | 24 ✓ | **2** ✓ | 6.3e-7 | +| 2.5 | 0.34 | **0** | 0.41 | 25 ✓ | **2** ✓ | 5.7e-7 | + +The mismatch is real ("coarse moved" = 0 confirms coarse DMs keep original +geometry while the fine DM deforms), yet FAS degrades only from 1 to **2 F-cycles** +and never loses correctness (L2 tracks Newton exactly). Pushed harder — into the +β=3 regime where Newton **diverges** on a uniform mesh — FAS on the *deformed* mesh +still converges in 2 cycles and stays correct: + +| β | follow R | q_min | Newton | FAS | FAS L2 | +|---|---|---|---|---|---| +| 3 | 1.5 | 0.37 | **diverges** | **2 ✓** | 4.3e-7 | +| 3 | 2.0 | 0.32 | **diverges** | **2 ✓** | 6.7e-7 | +| 2 | 3.0 | 0.32 | 25 ✓ | 2 ✓ | 5.3e-7 | +| 3 | 3.0 | 0.32 | **diverges** | **2 ✓** | 5.9e-7 | + +Why it works despite the mismatch: PETSc builds the transfers from the parent→child +*refinement* relationship (reference-element shape functions), not from a match of +physical coarse/fine geometry, and FAS only needs the coarse step to be a useful +*correction* — the fine smoother removes whatever the geometrically-imperfect +coarse correction leaves behind. Coarse-coordinate propagation would likely recover +the single-cycle count, but is **not required for convergence** on the scalar case. + +## How to use it today (no code required) + +```python +mesh = uw.meshing.UnstructuredSimplexBox(..., refinement=2) # builds the hierarchy +poisson = uw.systems.Poisson(mesh, u_Field=u) +# ... constitutive model, nonlinear f, BCs ... + +po = poisson.petsc_options +po["snes_type"] = "fas" +po["snes_fas_type"] = "full" # FMG-style F-cycle +po["fas_levels_snes_type"] = "newtonls" # per-level nonlinear smoother +po["fas_levels_snes_max_it"] = 4 +po["fas_levels_snes_linesearch_type"] = "basic" +po["fas_coarse_snes_type"] = "newtonls" # coarse nonlinear solve +po["fas_coarse_ksp_type"] = "preonly" +po["fas_coarse_pc_type"] = "lu" # parallel: "redundant" + redundant_pc_type "lu" +poisson.solve(zero_init_guess=True) + +# Or FAS as a nonlinear preconditioner to a robust outer accelerator: +# po["snes_type"] = "ngmres"; po["npc_snes_type"] = "fas"; po["npc_fas_*"] = ... +``` + +## Stokes (saddle-point) — the production target + +Everything above is scalar. The real goal is **nonlinear Stokes with adaptation**. +The spike carried the result all the way there. + +**Plumbing (linear Stokes).** FAS runs on the velocity–pressure saddle-point system +with no code changes. Using a monolithic LU smoother on each level (newtonls + +`preonly`/`lu`) on a constant-viscosity box with an **open top** (traction-free, so +*no* constant-pressure nullspace — the LU level solves stay non-singular), FAS +reaches the same velocity field as the default fieldsplit+FMG solve (rel.diff 4e-5, +1 cycle). So coarse residual, inter-grid transfers, and coarse correction all work +on a saddle system. The Stokes `solve()` reads `snes_type` from +`self.snes.getType()` (set by `petsc_options` during `_build`) and re-applies it, so +`snes_type=fas` survives on the standard path (`picard=0`, `zero_init_guess=True`). + +**Nonlinear Stokes — the win.** A *smooth* shear-thinning power-law viscosity +`η = η₀ (ε_II/ε_ref)^(1/n − 1)` (n=1 linear, larger n more nonlinear), open top: + +| n | Newton its | FAS cycles | same soln | +|---|---|---|---| +| 1 | 1 | 1 | 4e-6 | +| 2 | 14 | **2** | 2e-4 | +| 3 | 22 | **2** | 3e-4 | +| 4 | 28 | **3** | 3e-4 | +| 5 | 30 | **3** | 2e-4 | + +FAS holds **2–3 cycles while Newton climbs 14 → 30** as the nonlinearity sharpens — +the same nonlinearity-independence seen in the scalar exponential-diffusion case, +now on the saddle-point system, converging to the same solution. + +**Wall-clock — is the complexity worth it?** Yes, even with the un-optimised LU +smoother. FAS (monolithic LU per level) vs Newton (default fieldsplit + FMG): + +| n | refine | levels | Newton | FAS | speed-up | +|---|---|---|---|---|---| +| 3 | 2 | 3 | 15.0 s | 9.1 s | 1.66× | +| 5 | 2 | 3 | 19.9 s | 13.9 s | 1.44× | +| 3 | 3 | 4 | 72.5 s | 38.6 s | 1.88× | +| 5 | 3 | 4 | 95.3 s | 59.9 s | 1.59× | + +FAS is **1.4–1.9× faster** despite paying for a direct solve on every level, and the +gap *widens* with mesh size (1.66→1.88× from ref2→ref3 at n=3) because the cycle +count stays flat while Newton's iteration count does not. + +**But the per-level smoother is the crux, and it is not a `petsc_options` swap.** +`-snes_view` confirms the smoother on *every* level (including the fine, 17505×17505) +is `newtonls`+`preonly`+`lu` — a full direct factorization, ~6 of them per cycle on +the fine grid. That does not scale (2-D sparse LU is ≈ O(N^1.5)). The obvious "fix" +— a fieldsplit-Schur smoother on the levels, LU only on the coarse grid — was tried +and is **9–17× slower**, not faster: + +| n | refine | LU smoother | fieldsplit-Schur smoother | +|---|---|---|---| +| 3 | 2 | 9.0 s | 81 s | +| 3 | 3 | 37.6 s | 657 s | + +Both converge in 2–3 cycles to the same solution, but a Schur fieldsplit is itself +an approximate Stokes *solve*, and as a smoother it is invoked many times (≈ 2 Newton +steps × down+up × per level × per cycle) — so each relaxation does a near-complete +Stokes solve. At these sizes a single direct LU is cheaper. The genuine requirement +is an **inexpensive saddle-point smoother** — Vanka (element/patch block), Braess– +Sarazin, or a distributive/Uzawa relaxation — none of which is a stock PETSc +`petsc_options` choice for a DMPlex FE Stokes operator. **This, not the nullspace, +is the real research/engineering cost of production Stokes FAS.** Until it exists, +LU-smoothed FAS is a correct and (at moderate size) faster method whose cost is +dominated by the fine-grid direct solve. + +**Nonlinear Stokes ON AN ADAPTED MESH — the target.** Power-law Stokes (n=3) with +the mesh deformed by `follow_metric` toward the buoyancy feature (coarse DMs keep +their original geometry — the same mismatch the scalar case shrugged off): + +| follow R | q_min | Newton its | FAS cycles | same soln | +|---|---|---|---|---| +| — (static) | 0.89 | 22 | 2 | 0.0931≈0.0932 | +| 1.5 | 0.31 | 22 | **2** | 2.8e-4 | +| 2.0 | 0.32 | 22 | **3** | 2.7e-4 | +| 2.5 | 0.36 | 23 | **3** | 1.8e-4 | + +FAS keeps its 2–3 cycle convergence on the deformed saddle-point mesh, same answer +every time. **Feasibility for nonlinear-Stokes-with-adaptation is established.** + +**Decomposing the benefit: globalization vs convergence rate.** FAS delivers two +distinct things — (a) it drops the fine grid into Newton's quadratic basin (the +coarse→fine ramp-up / *globalization*), and (b) a genuine nonlinear-multigrid +*convergence rate* via repeated coarse correction. The globalization piece can be +had *cheaply*, reusing the existing Newton + linear-FMG machinery with no saddle +smoother, by **nested iteration / grid sequencing**: solve coarse, interpolate, +warm-start fine. Two ways to get it, and how much it buys: + +- PETSc `-snes_grid_sequence` **does not work** on UW3 meshes via `petsc_options`: + it triggers `DMPlexComputeInterpolatorGeneral` (PETSc err 56) rather than the + pre-chained nested hierarchy that FAS reuses — so, unlike FAS, it is *not* + plug-and-play. +- **Manual nested iteration at the UW3 level** (coarse solve → `uw.function.evaluate` + onto the fine field → warm-started fine solve) *does* work, and drops the fine + Newton count from 22→14 (n=3) and 29→19 (n=5) — but only ~1.1–1.2× wall-clock. + +So globalization alone is a *modest* win: it fixes the initial guess but not the +per-level convergence rate, and the coarse solve still pays the full nonlinear cost. +FAS's larger 1.5–1.9× comes from (b), the multigrid convergence rate — which is +exactly the part that needs the inexpensive saddle smoother. Practical reading: for +*robustness* on hard nonlinear Stokes, Picard warm-up (already in UW3) or manual +nested iteration is cheap and reuses FMG; for *speed*, FAS is the lever but only +once a real saddle smoother exists. + +**Where Stokes FAS does *not* help: non-smooth yielding.** A von Mises viscoplastic +viscosity `η = min(η₀, τ_y/2ε_II)` is hard for *every* solver: once the domain +yields (τ_y ≲ 1 here) Newton, Picard **and** FAS all fail. The `min()` kink is +non-smooth and, worse for FAS, the coarse level yields on a different pattern than +the fine, so the coarse correction is inconsistent. This is the known +regularization/continuation regime, not a FAS-specific failure — the same shape as +the scalar β≥4 ceiling. + +**Two engineering follow-ons before production Stokes FAS:** + +1. **An inexpensive saddle-point smoother (the hard part — feasible, but real work).** + The spike's per-level smoother is a monolithic direct LU — fine for proving the + algorithm (the 2–3 cycle count is the real result), but it does not scale. + Replacements tried, all on power-law n=3 at ref2/ref3: + + | smoother | result | + |---|---| + | monolithic LU (baseline) | converges, 2 cycles, 11 s / 38 s — but LU does not scale | + | fieldsplit-Schur, heavy (gmres 20) | converges but **9–17× slower** (a Schur solve per relaxation) | + | fieldsplit-Schur, light (gmres 1–2) | **zero pivot** → `DIVERGED_INNER` (−7) | + | PCPATCH **Vanka** | **zero pivot** on setup (singular local patches) | + + Every cheap attempt dies on the same rock: **"Zero pivot in LU factorization."** + The pressure block has a zero diagonal, so naive relaxations (SOR / ILU / a plain + sub-LU) divide by zero. That is the defining saddle-point difficulty, and exactly + what Vanka (solve each local velocity+pressure patch as a coupled block) and + Braess–Sarazin (a specific approximate-Schur relaxation) are constructed to avoid. + So the smoother is **feasible — Vanka via PETSc `PCPATCH` is the standard tool and + is used in production for Stokes geometric MG (e.g. Firedrake)** — but wiring it + into UW3 is a real task: correct `PCPATCH` patch construction so the local saddles + are non-singular, plus UW3-side exposure of the DM fields/topology to the patch PC + (analogous to `_setup_block_fieldsplit_options`). It is **not** a `petsc_options` + one-liner, which is what makes it the genuine engineering cost of scalable Stokes + FAS. +2. **Per-level pressure nullspace.** The spike used an open-top problem to avoid the + constant-pressure nullspace. `_attach_stokes_nullspace()` sets the nullspace on + the *outer* matrix only; FAS builds each level's matrix internally. Enclosed / + free-slip problems will need the nullspace registered at the DM/DS level (e.g. + `DMSetNullSpaceConstructor` on the pressure field) so every FAS level inherits + it. This is the one place a small UW3 code change is likely required. + +## Limitations and open questions + +- **Smoother/coarse tuning matters.** Default `newtonls` smoothers handle moderate + nonlinearity; very stiff regimes need stronger smoothers, more pre/post sweeps, + W-cycles, or continuation. This is normal FAS practice, not a UW3 limitation. +- **Coarse geometry under movers — TESTED, benign for scalar.** The movers update + only the fine DM's coordinates; coarse DMs keep their original geometry (verified: + "coarse moved" = 0). This was the headline risk, but on the scalar case FAS only + slows from 1 to 2 F-cycles under deformation strong enough to drop q_min to 0.32, + stays correct, and still converges where Newton diverges (see the adaptation + tables above). Coarse-coordinate propagation down the hierarchy would likely + restore the single-cycle count but is **not a prerequisite for convergence**. Whether + this still holds for the much larger anisotropy of an adapted Stokes problem is + the next thing to confirm. +- **Stokes saddle-point FAS — separate, harder.** FAS smoothers on the + velocity-pressure system need a fieldsplit smoother and a constant-pressure + nullspace on every level. Out of scope here; revisit only after the scalar path + and the coarse-geometry question are settled. + +## Recommendation + +1. **Offer FAS as an opt-in for scalar (and likely vector) nonlinear solves.** It + needs no new code — only a documented option bundle, or a thin + `nonlinear_preconditioner` / `snes_type='fas'` convenience knob mirroring the + FMG `preconditioner` property (default off; FMG/Newton defaults untouched). + The exact deliverable shape is deferred pending these results. +2. **Validate on Richards** (`docs/beginner/tutorials/16/17`), the strongest + real-world scalar FAS target, before committing to an API. +3. **Nonlinear-Stokes-with-adaptation is demonstrated** (2–3 cycles vs Newton's + ~22, on a deformed mesh, same solution; 1.4–1.9× faster with LU smoothers). + Production scalability hinges on an **inexpensive saddle-point smoother** (Vanka / + Braess–Sarazin / distributive) — *not* a stock fieldsplit, which is 9–17× slower + as a smoother. A **per-level pressure nullspace** is the smaller second piece + (enclosed/free-slip). Coarse-coordinate propagation stays an optimisation; the + non-smooth viscoplastic regime is out of scope (hard for all solvers). + +## Reproduction + +Spike scripts (not part of the repo; under `~/+Simulations/snesfas_spike/`): +`bratu_baseline.py`, `fas_probe.py`, `fas_measure.py`, `expdiff_mms.py`, +`adapt_fas.py` (FAS on a `follow_metric`-deformed mesh), and the Stokes set: +`stokes_fas.py` (linear plumbing), `stokes_fas_nl.py` (viscoplastic — hard for +all), `stokes_fas_powerlaw.py` (smooth nonlinear win), `stokes_fas_adapt.py` +(nonlinear Stokes on an adapted mesh). +Worktree `feature/snesfas-spike` off `origin/development` — **no source files +changed** (`git status` on `src/` is clean); the only artifact in-repo is this +note. Regression sanity: `tests/test_1014_stokes_multigrid.py` + +`tests/test_1000_poissonCart.py` → 18 passed. diff --git a/docs/developer/design/snesfas-vanka-feasibility-study.md b/docs/developer/design/snesfas-vanka-feasibility-study.md new file mode 100644 index 00000000..37dee221 --- /dev/null +++ b/docs/developer/design/snesfas-vanka-feasibility-study.md @@ -0,0 +1,337 @@ +--- +title: "Feasibility study: a scalable saddle-point smoother for Stokes SNESFAS" +--- + +# Feasibility study — scalable saddle smoother for Stokes FAS + +**Bottom line: GO — demonstrated with a working, mesh-independent prototype. +[Revised 2026-06-15 after expert input + experiments — supersedes the earlier +NO-GO.]** Vanka works on UW3's simplex Taylor-Hood Stokes; the earlier failure was +PETSc's *stock* `-pc_patch_construct_type vanka`/`star` heuristic mis-constructing +patches for continuous-P1 simplices, **not** Vanka. With patches built as *the +support of each pressure basis function* (one pressure DOF + its B-coupled velocity +DOFs), driven by **`PCASM` with custom index sets** and exact local LU mini-Stokes +solves, and used as the smoother of a geometric multigrid on the full saddle, the +solver is **mesh-independent: 5 → 5 → 6 outer FGMRES iterations across a 16× growth +in DOFs (1.2k → 19k)**. The one non-obvious ingredient: the smoother must be a +**Krylov (GMRES) smoother** wrapping the Vanka PC — an undamped Richardson smoother +amplifies the additive-Schwarz spectrum and diverges. So the scalable-smoother piece +is **done in prototype**. **Caveat on the payoff:** the iteration count is +mesh-independent, but at moderate **2-D** sizes it is *not* faster in wall-clock than +a sparse direct solve / fieldsplit+FMG (2-D direct solves are cheap); Vanka's +wall-clock advantage is asymptotic in 2-D (~10⁵–10⁶ DOFs) and real in **3-D and +large-parallel** runs, where direct solves scale badly. So this is the tool for +big/3-D/parallel Stokes, not a speed-up for typical 2-D problems. What remains is +productionisation (wrap in FAS — already proven with an LU smoother; per-level +pressure nullspace; parallel patch construction; a UW3 API). Until that lands, FMG + +Newton/Picard/nested ships and LU-smoothed FAS is the moderate-size nonlinear option. + +> The detailed reasoning below that concluded NO-GO was **correct about stock +> `PCPATCH` but wrong about Vanka in general**. Read it as "stock PCPATCH +> constructions are the wrong tool", then jump to *Correction* near the end for the +> working approach and the revised path forward. + +## Objective + +Decide — *before* a major rewrite — whether a production-scalable Stokes FAS is +achievable, by answering whether a cheap saddle-point smoother can be made to work +in UW3. + +## Method + +Staged, cheapest-fatal-risk-first, with the key move of **decoupling the smoother +from FAS**: test each smoother candidate as a plain *linear* preconditioner on a +constant-viscosity UW3 Stokes solve first (a `pc_type` swap), since FAS itself is +already proven. Option sets were lifted from PETSc's own CI-tested example +`src/snes/tutorials/ex62.c`. + +## Findings + +### Stage 0 — the known-good Vanka recipe exists (and my first attempt was mis-configured) + +`ex62.c`'s `2d_q1_p0_gmg_vanka` test gives the authoritative incantation: +`-pc_type patch -pc_patch_partition_of_unity 0 -pc_patch_construct_codim 0 +-pc_patch_construct_type vanka -..._sub_pc_type lu -mg_coarse_pc_type svd`. Note: +**every passing Vanka test in ex62 uses `dm_plex_simplex 0` (quads), Q1–P0 +(discontinuous pressure)** — there is no CI-tested Vanka for simplex Taylor-Hood. + +### Stage 1 — Vanka does not work on UW3's Stokes (the crux, R2) + +PCPATCH Vanka as the outer linear PC on a constant-viscosity UW3 Stokes solve, with +the corrected recipe, fails **uniformly**: + +| variant | result | +|---|---| +| P1-continuous, sub LU | **zero pivot** (singular patch) | +| P0/P1-discontinuous, sub LU | **zero pivot** | +| `+ pc_use_amat` (true saddle Jac, not UW3's Schur Pmat) | **zero pivot** | +| sub `svd` (tolerates singular patches), P1-cont | builds, but **no convergence** (400 its, no progress) | +| sub `svd`, P0-disc | builds, **crawls** to 0.00274 vs 0.00319 ref, 427 its, no convergence | +| **quad** mesh (ex62's Vanka element family), all of the above | **identical** failure pattern | + +So the failure is **not** element-type (simplex vs quad), pressure continuity, or +sub-solver. The patches are genuinely singular (SVD removes the pivot but leaves an +ineffective preconditioner). The obstruction is in how `PCPATCH` constructs/assembles +the local patch operators (and their boundary conditions / local nullspaces) from +UW3's DM/DS — the same interface Firedrake feeds correctly when it runs DMPlex Stokes +Vanka. UW3 confirmed to build a **separate Schur-structured Pmat** (`1/μ` pressure +mass; `petsc_generic_snes_solvers.pyx:4381`), which is part of why the default path +hands Vanka the wrong operator, but `pc_use_amat` did not rescue it — the problem is +deeper than operator selection. + +### Braess–Sarazin hedge — works as a PC, too weak as a smoother + +A *correct* Braess–Sarazin (diagonal velocity block `fieldsplit_velocity_pc_type=jacobi` ++ `selfp` Schur on the true operator via `pc_use_amat`) **converges as an outer PC** +(reason 3, exact reference solution) — options-only, no patches, no zero pivot. But +as a **FAS level smoother it fails** (`DIVERGED_INNER`): it needs ~63 iterations to +converge as a PC, i.e. it is a *weak* relaxation, so a few sweeps per smooth don't +reduce error. A stronger variant (SOR velocity + full Schur) converged at ref2 (5 +cycles) but was **22× slower than LU and failed at ref3** — unreliable and expensive. + +### The smoother-candidate matrix + +| smoother | cheap? | effective? | scalable? | status in UW3 | +|---|---|---|---|---| +| monolithic **LU** | no | yes | no | works; fastest at moderate size; O(N^1.5) wall | +| **Vanka** (PCPATCH) | yes | yes | yes | **blocked** — singular/ineffective patches | +| **Braess–Sarazin** (diag) | yes | no (weak) | yes | works as PC, fails as smoother | +| BS + SOR velocity | ~ | unreliable | ? | 22× slower, fails at ref3 | +| heavy fieldsplit-Schur | no | yes | ~ | 9–17× slower than LU | + +No candidate is simultaneously cheap, effective, and configurable. The only effective +smoothers are expensive (LU, heavy fieldsplit); the only cheap one (BS) is too weak. + +## De-risking step 1 — DONE: simplex Vanka fails in clean PETSc too + +The proposed de-risk was to test Vanka *outside* UW3. Rather than re-write a Stokes +solver in petsc4py, the cleanest reference is PETSc's own `ex62.c` (DMPlex Stokes, +the source of the recipe). Compiled it (this build has **no 2-D simplex generator** — +`--download-triangle` absent — so a gmsh unit-square `.msh` was loaded via +`-dm_plex_filename ... -dm_plex_boundary_label marker`; UW3 itself only gets simplices +through gmsh for the same reason). Results, **pure PETSc, no UW3**: + +| element / construction | result | +|---|---| +| **quad** Q1–P0, `vanka` (the CI-tested recipe) | ✅ converges, 59 KSP its | +| simplex P2–P1, full LU **and** fieldsplit-Schur (controls) | ✅ converge — mesh/BC are correct | +| simplex P2–P1, `vanka` patches (sub lu / svd) | ❌ KSP **stalls at the initial residual** (29.46 flat for 200 its) | +| simplex P2–P0, `vanka` patches | ❌ diverges | +| simplex P2–P1 / P2–P0, `star` (vertex) patches | ❌ diverges | + +So Vanka works on quads and the same binary solves the simplex mesh fine with +standard solvers, but **every patch-smoother construction is ineffective on simplex +Taylor-Hood** — the KSP makes *zero* progress. This is a PETSc/numerical-methods +fact, independent of UW3. (The earlier UW3 zero-pivot was just the first symptom of +the same underlying problem; "fixing UW3's DM/DS↔PCPATCH interface" would **not** have +helped, since clean PETSc fails identically.) + +**Why:** stock `PCPATCH` Vanka/star patches are built for low-order, quad/structured, +discontinuous-pressure elements. Effective patch smoothing for simplex Stokes needs +specialist constructions from the literature — typically a patch-smoothable stable +element (e.g. **Scott–Vogelius on barycentrically-refined/Alfeld meshes**) plus +**macro-element patches** (the Farrell–Mitchell–Wechsung programme). That is a change +of *element and mesh*, not a smoother option. + +## Correction (2026-06-15): custom-IS PCASM Vanka works — the NO-GO was wrong + +External expert input reframed Vanka as *"a specialised overlapping-Schwarz +preconditioner whose patches are defined by the pressure space, not by geometry"*: +loop over pressure DOFs, use the FE sparsity of the divergence block **B** to gather +the coupled velocity DOFs, extract the local mixed matrix, factor it, and run +additive/multiplicative Schwarz — i.e. **`PCASM`/`PCGASM` with custom index sets**, +the patches being the *support of each pressure basis function*. This is standard for +Taylor-Hood, MINI, Scott–Vogelius, variable-viscosity convection, etc. The stock +`PCPATCH` `vanka`/`star` constructs I tested are *not* this — they apply a fixed +topological heuristic that misfires for continuous-P1 simplices. + +Tested directly on UW3's true saddle Jacobian (open-top, no pressure nullspace): + +``` +saddle 1110×1110, fields [velocity, pressure], n_vel=968 n_pres=142 +patches = 142 (one per pressure DOF), size 11–45 (mean 32.5) +PCASM-Vanka (RAS and additive): CONVERGED, 312 fgmres its, rel_err 2.2e-6 +``` + +One-level iteration count vs mesh size — the smoother signature: + +| cellSize | ndof | n_pres | 1-level Vanka its | +|---|---|---|---| +| 0.15 | 555 | 75 | 147 | +| 0.10 | 1110 | 142 | 312 | +| 0.07 | 2479 | 303 | 688 | + +Iterations grow ~with 1/h (no coarse correction) — high-frequency error is removed, +low-frequency is left for the MG coarse grid. That is precisely a multigrid smoother. +So **the failure was the patch *construction*, not Vanka**, and the smoother exists +for UW3's element. + +## The working recipe (prototype: `vanka_mg_WORKING.py`) + +A geometric multigrid on the **full** Stokes saddle, custom-IS Vanka smoother: + +1. **Hierarchy + Galerkin.** PCMG over UW3's `dm_hierarchy` (built by `refinement=N`), + interpolation per level from `DMCreateInterpolation`, coarse operators by Galerkin + `pc_mg_galerkin both` (the FMG path — UW3 has no coarse-DM callbacks). Block-diagonal + velocity/pressure interpolation keeps the coarse operators saddle-structured. +2. **Per-level Vanka smoother.** For each level: from the level's field decomposition + + the operator's row sparsity, build one patch per pressure DOF = {pressure DOF} ∪ + {B-coupled velocity DOFs}; install a `PCASM` (RESTRICT) with those index sets and + exact `sub_pc_type lu` mini-Stokes solves. (Set subdomains after a `PC.reset()` + + re-attach operator, since they must precede `PCSetUp`.) +3. **Krylov smoother — the key.** Wrap the Vanka PC in **`ksp_type gmres`, ~6 its** + per level. Richardson (even damped) diverges because additive Schwarz has spectral + radius > 1; GMRES self-stabilises it. Outer solver is FGMRES (flexible, since the + smoother is now a Krylov method). +4. **Coarse solve:** LU. + +Measured: **5 / 5 / 6 / 5 outer iterations at ndof 1.2k / 4.8k / 19k / 76k** — +mesh-independent. + +### Timing — and the right competitor is FMG, not LU + +A direct (LU) solve is a *strawman* competitor: if a full-scale factorization is +affordable you would just use it directly, not as a smoother — and in 3-D / parallel +that option disappears (the sparse direct solver, MUMPS, scales ~O(N²) in 3-D and is +unreliable at large core counts). The honest competitor is **FMG** (velocity-block +geometric MG inside the Schur fieldsplit, #231). + +Linear 2-D Stokes, same problem (FMG numbers include assembly/JIT — not a clean +linear-solve isolation; Vanka is linear-solve only): + +| | 1.2k | 4.8k | 19k | 76k | iters | +|---|---|---|---|---|---| +| **FMG** total solve | (JIT) | ~1.7s | ~1.8s | ~4.7s | outer Schur = 1 | +| **Vanka-MG** linear solve | 0.02s | 0.13s | 1.1s | 5.7s | 5–6 | +| LU-smoother MG (≈ direct) | 0.01s | 0.07s | 0.54s | ~5s | 1 | + +For **linear 2-D Stokes the three are in the same ballpark** — FMG is mature and +slightly ahead, the LU/direct path is cheap because 2-D fill-in is modest, and +Vanka-MG is competitive but not a winner. So Vanka does **not** earn its keep on easy +(linear, moderate-2-D) problems. + +**Where Vanka-MG wins — and the reason to build it:** + +- **3-D and large-parallel**, where the direct solves that FMG's coarse grid and the + LU paths lean on scale badly / become unreliable, while Vanka's local patch solves + scale ~O(N) and parallelise naturally. (3-D and parallel timing is the obvious next + measurement.) +- **Strongly nonlinear / plastic rheology** — the decisive case. Fieldsplit-Schur+FMG + assumes a good Schur (pressure) approximation, which degrades when the viscosity + varies wildly (viscoplastic yield, thermal runaway). A **full-saddle Vanka-FAS** + smooths the coupled nonlinear system directly, and — the key point — the **coarse + problems are better conditioned** (plasticity localises; on coarse grids the yield + is smeared/milder), so the nonlinear coarse correction is especially effective. + This is exactly the regime where Newton+FMG needs many iterations or stalls + (cf. the viscoplastic results in `snesfas-feasibility.md`, where Newton, Picard + *and* LU-FAS all struggled). + +**Takeaway:** the value proposition is *not* "faster on linear 2-D" — there FMG is +fine. It is **robust, scalable nonlinear Stokes in 3-D / parallel / plasticity**, +where the fieldsplit-Schur assumptions and direct solves break down. + +### First data point — FMG vs FAS on hard benchmarks (`benchmark_fmg_vs_fas.py`) + +FMG (Newton + fieldsplit-Schur + velocity FMG, `saddle_preconditioner=1/η`) vs FAS +(snes_type=fas, LU smoother — *not yet Vanka*), open-top so no nullspace, +refinement=2: + +*SolCx viscosity step (linear), Δη = 1 → 10⁶:* both **converge at every contrast**. +FMG is faster (1–2 outer iterations, 7–40 s) but its time *grows* with the jump +(12 s → 40 s from 10⁴ → 10⁶ as the velocity-block MG degrades); FAS-LU is flat +(~62 s, 2 cycles, robust but slow). So with the right Schur preconditioner FMG +handles the discontinuous-viscosity benchmark well; FAS's edge only shows as the +contrast becomes extreme. + +*Viscoplastic yield (nonlinear), τ_y = 10 → 0.25:* the nonlinear iteration count +favours FAS — at τ_y = 1, **FAS converges in 3 nonlinear cycles where FMG+Newton +needs 9 iterations** (the nonlinear coarse correction at work, as predicted). Below +τ_y ≈ 0.5 *both* fail — the hard-yielding regime is continuation/regularisation +territory for every solver (consistent with `snesfas-feasibility.md`). FAS is slower +in wall-clock here only because it uses the LU smoother; the Vanka smoother is what +would make the 3-vs-9 nonlinear-iteration advantage also a wall-clock win at scale. + +**Reading:** FMG is the right default for linear / moderate problems; FAS's +robustness advantage is real but currently shows as *fewer nonlinear iterations* in +the plastic regime, not yet as wall-clock (LU smoother). Wiring in the Vanka smoother ++ pushing to 3-D / extreme contrast is where the combination should pull clearly +ahead. The very-hard yield regime needs continuation regardless of solver. + +### Three turnkey choices — FMG vs GAMG vs FAS-Vanka (`benchmark_3way.py`) + +With the Vanka smoother actually wired into FAS (custom-IS injection, GMRES Krylov +smoother), the three production options compared on the same problems (open top, +res 16 / refinement 1): + +*SolCx viscosity step (linear):* + +| Δη | FMG | GAMG | FAS-Vanka | +|---|---|---|---| +| 1 | 1 it, 3.9s | 2 it, 4.9s | 1 it, 3.0s | +| 10³ | 1 it, 2.8s | 3 it, 14s | 1 it (gmres-15 smoother) | +| 10⁶ | 2 it, 5.8s | 8 it, **99s** | **FAIL** | + +*Viscoplastic yield (nonlinear):* + +| τ_y | FMG | GAMG | FAS-Vanka | +|---|---|---|---| +| 10 | 1 it | 2 it | 1 it | +| 1 | 10 it | 10 it | **2 it** | +| 0.3 | FAIL | FAIL | FAIL | + +**What this says — and the honest "how much tuning":** + +- **FMG** is the **robust all-rounder and the right default.** Its + `saddle_preconditioner = 1/η` makes the Schur (pressure) approximation + viscosity-robust, so it sails through the 10⁶ contrast (5.8 s) where the others + struggle, and it is fastest almost everywhere. Needs a refinement hierarchy. +- **GAMG** is the **no-hierarchy fallback** — robust but it *cliffs in cost* at high + contrast (99 s at 10⁶, 8 outer iterations). Use when there is no geometric + hierarchy. +- **FAS-Vanka** is the **nonlinear / plasticity specialist.** It crushes the + viscoplastic case (**2 nonlinear iterations vs 10** for the others — the coarse + problem is better conditioned, exactly as expected) and handles moderate viscosity + contrast (10³) once the Krylov smoother is bumped to ~15 inner iterations. But with + *additive* PCASM Vanka it **fails at extreme contrast (10⁶)** no matter how hard you + smooth — that needs a **multiplicative** Vanka (PCGASM / a coloured patch sweep) + and/or viscosity-aware coarsening, which is genuine development, not a flag. + +**Bottom line for "2–3 good choices, not much tuning":** FMG (default) and GAMG +(fallback) are the two robust, low-tuning options today. FAS-Vanka is a strong third +**specifically for strongly nonlinear / plastic problems** (where it beats both on +iteration count) and for 3-D / large-parallel; making it a *robust* turnkey peer of +FMG across all regimes needs the multiplicative-Vanka smoother (the extreme-contrast +gap) plus the productionisation items (per-level nullspace, parallel patches, UW3 +API). The plasticity win, though, is real and in hand now. + +## Path forward (GO — productionise the prototype) + +1. **Wrap in FAS for the nonlinear case.** The smoother is the hard part and it is now + solved; FAS adds the nonlinear coarse correction, already proven to work with an LU + smoother (`snesfas-feasibility.md`). Swap LU → the Vanka GMRES smoother above. +2. **Per-level pressure nullspace** for enclosed / free-slip problems (the prototype + used an open top to avoid it). Register it at the DM/DS level so each level inherits + it. +3. **Parallel patch construction** (PCASM is parallel; build the pressure-support index + sets rank-locally) and a **UW3 API** (a `smoother="vanka"` path that builds the IS + per level and installs the smoother — ≈ the prototype's ~40 lines, generalised). +4. **Open optimisation:** whether `PCPATCH` can be configured to build these exact + pressure-support patches (would make it options-only, no custom-IS code). + +The earlier "research only / different element" conclusion is fully withdrawn: a +mesh-independent Vanka multigrid for UW3's existing simplex Taylor-Hood Stokes is +demonstrated; what remains is engineering. + +## Meanwhile + +`FMG + Picard / nested iteration` remains the production workhorse for nonlinear +Stokes; LU-smoothed FAS is a correct, moderate-size-faster option where nonlinear +robustness matters more than asymptotic scaling. Neither is blocked; both ship today. + +## Reproduction + +`~/+Simulations/snesfas_spike/`: `vanka_stage1.py` (UW3 Stage 1 matrix), `/tmp`-staged +probes `vanka_quad.py`, `braess.py`, `bs_fas.py`, `bs_sor.py`. Clean-PETSc reference: +`petsc-custom/petsc/src/snes/tutorials/ex62.c` (compiled in the `amr-dev` env; gmsh +mesh `/tmp/square.msh` from `/tmp/mksquare.py`; driven with the `vanka`/`star` patch +options above). PETSc `/*TEST*/` block `*_vanka` suffixes are the recipe source. diff --git a/docs/examples/snesfas_investigation/README.md b/docs/examples/snesfas_investigation/README.md new file mode 100644 index 00000000..8a3ac34c --- /dev/null +++ b/docs/examples/snesfas_investigation/README.md @@ -0,0 +1,49 @@ +# SNESFAS / Vanka / grid-sequencing — research prototypes + +> **Status: research prototypes, not production code.** These scripts back the design +> notes in `docs/developer/design/` (`snesfas-feasibility.md`, +> `snesfas-vanka-feasibility-study.md`, `multilevel-nonlinear-stokes-strategy.md`). +> They are preserved so the *working* multilevel-Stokes implementations are not lost +> when the investigation is parked. They are **not tested, not tier-classified, and +> make no `src/` changes** — everything here runs through `petsc_options` and +> petsc4py on top of stock UW3. Wall-clock numbers in the docs are soft (JIT / noise); +> trust the iteration counts. + +Run any of them with the worktree's environment, e.g. +`pixi run -e python docs/examples/snesfas_investigation/