Summary
Stokes_Constrained works in parallel, but the 8-rank runtime is much slower than the equivalent Nitsche free-slip solve even though the constrained solve only takes two KSP iterations. This suggests the bottleneck is setup/assembly/field handling for the block-constrained system, not Krylov convergence.
This came up while validating the Zhong et al. spherical-shell internal-boundary benchmark path using SphericalShellInternalBoundary() and exterior free-slip boundaries.
Reproducer Context
External benchmark script:
cd /Users/tgol0006/uw_folder/uw3_git_gthyagi_latest/underworld3/.claude/worktrees/mantle-convection-benchmarks
pixi run -e amr-dev mpirun -np 8 python /Users/tgol0006/uw_folder/uw3-mantle-convection-benchmarks/benchmarks/005_internal_boundary_delta_probe.py
The script uses:
uw.meshing.SphericalShellInternalBoundary()
ri = 0.55, rint = 0.775, ro = 1.0
cellSize = 0.125 under 8 MPI ranks
- velocity
P2, pressure P1
- internal radial natural load on the
Internal boundary
- exterior free slip on
Upper and Lower
The Nitsche comparison is:
pixi run -e amr-dev mpirun -np 8 python /Users/tgol0006/uw_folder/uw3-mantle-convection-benchmarks/benchmarks/005_internal_boundary_delta_probe.py -uw_freeslip_type nitsche
Timing Evidence
Measured with /usr/bin/time -p on the same machine and mesh resolution:
| Method |
Wall time |
KSP iterations |
Result |
| Nitsche free slip |
27.93s |
1 |
pass |
Stokes_Constrained |
434.92s |
2 |
pass |
| constrained with temporary degree-1 multiplier test |
409.62s |
2 |
pass |
The temporary degree-1 multiplier test did not materially improve runtime, so multiplier polynomial degree is unlikely to be the main bottleneck.
Observed Metrics
8-rank constrained run:
{
"cellsize": 0.125,
"ksp_iterations": 2,
"ksp_reason": 2,
"l": 2,
"max_boundary_area_relative_error": 0.009919085347124204,
"max_y_l0_norm_error": 0.009849659575170255,
"mpi_size": 8,
"passed": true,
"snes_reason": 5,
"stokes_tolerance": 1e-05,
"upper_characteristic_velocity": 0.012565603878091148,
"upper_normal_velocity_rms": 1.631821482362003e-05
}
8-rank Nitsche comparison:
{
"cellsize": 0.125,
"ksp_iterations": 1,
"ksp_reason": 2,
"l": 2,
"max_boundary_area_relative_error": 0.009919085347124204,
"max_y_l0_norm_error": 0.009849659575170255,
"mpi_size": 8,
"passed": true,
"snes_reason": 5,
"stokes_tolerance": 1e-05,
"upper_characteristic_velocity": 0.010110860387364697,
"upper_normal_velocity_rms": 2.2416675430165106e-05
}
Current Diagnosis
The linear solve is not the issue: constrained free slip reports only 2 KSP iterations. The expensive part is likely one or more of:
- setup/assembly of the extra Lagrange-multiplier fields for
Upper and Lower;
- boundary residual/Jacobian registration for the multiplier coupling;
- grouping pressure plus multipliers into the Schur block;
_constrain_interior_multipliers_in_section() section work in parallel;
- creation or handling of full-domain multiplier fields when only the boundary trace is physical;
- repeated DM/section/fieldsplit setup that could be cached when mesh and constraints are unchanged.
Candidate Fix Directions
- Profile
Stokes_Constrained setup and assembly with PETSc/UW timing to locate the exact hotspot.
- Optimize
_constrain_interior_multipliers_in_section() if Python-side set/section operations dominate.
- Consider boundary-only or submesh multiplier fields instead of full-domain multiplier fields with interior DOFs constrained out.
- Cache constrained section/fieldsplit setup when the mesh, fields, and constraint boundaries are unchanged.
- Check whether grouped
[p, lambda] Schur setup rebuilds too much state each solve.
Related Work
PR #242 fixes SphericalShellInternalBoundary() boundary labels so the internal-boundary benchmark can use the built-in mesh path directly. This issue is separate: after that fix, the constrained solve is correct but slow in parallel.
Summary
Stokes_Constrainedworks in parallel, but the 8-rank runtime is much slower than the equivalent Nitsche free-slip solve even though the constrained solve only takes two KSP iterations. This suggests the bottleneck is setup/assembly/field handling for the block-constrained system, not Krylov convergence.This came up while validating the Zhong et al. spherical-shell internal-boundary benchmark path using
SphericalShellInternalBoundary()and exterior free-slip boundaries.Reproducer Context
External benchmark script:
cd /Users/tgol0006/uw_folder/uw3_git_gthyagi_latest/underworld3/.claude/worktrees/mantle-convection-benchmarks pixi run -e amr-dev mpirun -np 8 python /Users/tgol0006/uw_folder/uw3-mantle-convection-benchmarks/benchmarks/005_internal_boundary_delta_probe.pyThe script uses:
uw.meshing.SphericalShellInternalBoundary()ri = 0.55,rint = 0.775,ro = 1.0cellSize = 0.125under 8 MPI ranksP2, pressureP1InternalboundaryUpperandLowerThe Nitsche comparison is:
Timing Evidence
Measured with
/usr/bin/time -pon the same machine and mesh resolution:27.93s1Stokes_Constrained434.92s2409.62s2The temporary degree-1 multiplier test did not materially improve runtime, so multiplier polynomial degree is unlikely to be the main bottleneck.
Observed Metrics
8-rank constrained run:
{ "cellsize": 0.125, "ksp_iterations": 2, "ksp_reason": 2, "l": 2, "max_boundary_area_relative_error": 0.009919085347124204, "max_y_l0_norm_error": 0.009849659575170255, "mpi_size": 8, "passed": true, "snes_reason": 5, "stokes_tolerance": 1e-05, "upper_characteristic_velocity": 0.012565603878091148, "upper_normal_velocity_rms": 1.631821482362003e-05 }8-rank Nitsche comparison:
{ "cellsize": 0.125, "ksp_iterations": 1, "ksp_reason": 2, "l": 2, "max_boundary_area_relative_error": 0.009919085347124204, "max_y_l0_norm_error": 0.009849659575170255, "mpi_size": 8, "passed": true, "snes_reason": 5, "stokes_tolerance": 1e-05, "upper_characteristic_velocity": 0.010110860387364697, "upper_normal_velocity_rms": 2.2416675430165106e-05 }Current Diagnosis
The linear solve is not the issue: constrained free slip reports only
2KSP iterations. The expensive part is likely one or more of:UpperandLower;_constrain_interior_multipliers_in_section()section work in parallel;Candidate Fix Directions
Stokes_Constrainedsetup and assembly with PETSc/UW timing to locate the exact hotspot._constrain_interior_multipliers_in_section()if Python-side set/section operations dominate.[p, lambda]Schur setup rebuilds too much state each solve.Related Work
PR #242 fixes
SphericalShellInternalBoundary()boundary labels so the internal-boundary benchmark can use the built-in mesh path directly. This issue is separate: after that fix, the constrained solve is correct but slow in parallel.