Stokes_Constrained is slow in parallel despite low KSP iterations

## Summary

`Stokes_Constrained` works in parallel, but the 8-rank runtime is much slower than the equivalent Nitsche free-slip solve even though the constrained solve only takes two KSP iterations. This suggests the bottleneck is setup/assembly/field handling for the block-constrained system, not Krylov convergence.

This came up while validating the Zhong et al. spherical-shell internal-boundary benchmark path using `SphericalShellInternalBoundary()` and exterior free-slip boundaries.

## Reproducer Context

External benchmark script:

```bash
cd /Users/tgol0006/uw_folder/uw3_git_gthyagi_latest/underworld3/.claude/worktrees/mantle-convection-benchmarks
pixi run -e amr-dev mpirun -np 8 python /Users/tgol0006/uw_folder/uw3-mantle-convection-benchmarks/benchmarks/005_internal_boundary_delta_probe.py
```

The script uses:

- `uw.meshing.SphericalShellInternalBoundary()`
- `ri = 0.55`, `rint = 0.775`, `ro = 1.0`
- `cellSize = 0.125` under 8 MPI ranks
- velocity `P2`, pressure `P1`
- internal radial natural load on the `Internal` boundary
- exterior free slip on `Upper` and `Lower`

The Nitsche comparison is:

```bash
pixi run -e amr-dev mpirun -np 8 python /Users/tgol0006/uw_folder/uw3-mantle-convection-benchmarks/benchmarks/005_internal_boundary_delta_probe.py -uw_freeslip_type nitsche
```

## Timing Evidence

Measured with `/usr/bin/time -p` on the same machine and mesh resolution:

| Method | Wall time | KSP iterations | Result |
| --- | ---: | ---: | --- |
| Nitsche free slip | `27.93s` | `1` | pass |
| `Stokes_Constrained` | `434.92s` | `2` | pass |
| constrained with temporary degree-1 multiplier test | `409.62s` | `2` | pass |

The temporary degree-1 multiplier test did not materially improve runtime, so multiplier polynomial degree is unlikely to be the main bottleneck.

## Observed Metrics

8-rank constrained run:

```json
{
  "cellsize": 0.125,
  "ksp_iterations": 2,
  "ksp_reason": 2,
  "l": 2,
  "max_boundary_area_relative_error": 0.009919085347124204,
  "max_y_l0_norm_error": 0.009849659575170255,
  "mpi_size": 8,
  "passed": true,
  "snes_reason": 5,
  "stokes_tolerance": 1e-05,
  "upper_characteristic_velocity": 0.012565603878091148,
  "upper_normal_velocity_rms": 1.631821482362003e-05
}
```

8-rank Nitsche comparison:

```json
{
  "cellsize": 0.125,
  "ksp_iterations": 1,
  "ksp_reason": 2,
  "l": 2,
  "max_boundary_area_relative_error": 0.009919085347124204,
  "max_y_l0_norm_error": 0.009849659575170255,
  "mpi_size": 8,
  "passed": true,
  "snes_reason": 5,
  "stokes_tolerance": 1e-05,
  "upper_characteristic_velocity": 0.010110860387364697,
  "upper_normal_velocity_rms": 2.2416675430165106e-05
}
```

## Current Diagnosis

The linear solve is not the issue: constrained free slip reports only `2` KSP iterations. The expensive part is likely one or more of:

- setup/assembly of the extra Lagrange-multiplier fields for `Upper` and `Lower`;
- boundary residual/Jacobian registration for the multiplier coupling;
- grouping pressure plus multipliers into the Schur block;
- `_constrain_interior_multipliers_in_section()` section work in parallel;
- creation or handling of full-domain multiplier fields when only the boundary trace is physical;
- repeated DM/section/fieldsplit setup that could be cached when mesh and constraints are unchanged.

## Candidate Fix Directions

- Profile `Stokes_Constrained` setup and assembly with PETSc/UW timing to locate the exact hotspot.
- Optimize `_constrain_interior_multipliers_in_section()` if Python-side set/section operations dominate.
- Consider boundary-only or submesh multiplier fields instead of full-domain multiplier fields with interior DOFs constrained out.
- Cache constrained section/fieldsplit setup when the mesh, fields, and constraint boundaries are unchanged.
- Check whether grouped `[p, lambda]` Schur setup rebuilds too much state each solve.

## Related Work

PR #242 fixes `SphericalShellInternalBoundary()` boundary labels so the internal-boundary benchmark can use the built-in mesh path directly. This issue is separate: after that fix, the constrained solve is correct but slow in parallel.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stokes_Constrained is slow in parallel despite low KSP iterations #244

Summary

Reproducer Context

Timing Evidence

Observed Metrics

Current Diagnosis

Candidate Fix Directions

Related Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Method	Wall time	KSP iterations	Result
Nitsche free slip	`27.93s`	`1`	pass
`Stokes_Constrained`	`434.92s`	`2`	pass
constrained with temporary degree-1 multiplier test	`409.62s`	`2`	pass

Stokes_Constrained is slow in parallel despite low KSP iterations #244

Description

Summary

Reproducer Context

Timing Evidence

Observed Metrics

Current Diagnosis

Candidate Fix Directions

Related Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions