Local Aware IBM by danieljvickers · Pull Request #1378 · MFlowCode/MFC

danieljvickers · 2026-04-23T20:06:32Z

Description

The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.

Type of change

Refactor

Testing

TBD

Checklist

I added or updated tests for new behavior

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

@coderabbitai review — incremental review (new changes only)
@coderabbitai full review — full review from scratch
/review — Qodo review
/improve — Qodo code suggestions
@claude full review — Claude full review (also triggers on PR open/reopen/ready)
Add label claude-full-review — Claude full review via label

… boundaries

… are only locally aware

… intended

github-actions · 2026-04-23T20:10:39Z

Claude Code Review

Head SHA: 9879df3

Files changed:

21
src/simulation/m_ibm.fpp
src/simulation/m_start_up.fpp
src/simulation/m_collisions.fpp
src/simulation/m_global_parameters.fpp
src/common/m_constants.fpp
src/common/m_derived_types.fpp
src/common/m_mpi_common.fpp
src/post_process/m_data_output.fpp
src/simulation/m_ib_patches.fpp
src/simulation/m_time_steppers.fpp

Findings:

Critical: Stack overflow in `s_reduce_ib_patch_array` (`src/simulation/m_start_up.fpp`)

subroutine s_reduce_ib_patch_array()
    type(ib_patch_parameters), dimension(num_ib_patches_max) :: patch_ib_gbl

patch_ib_gbl is a local (stack-allocated) variable dimensioned by num_ib_patches_max, which was just raised from 50,000 to 2,050,000 in src/common/m_constants.fpp. ib_patch_parameters contains a character(LEN=pathlen_max) field (pathlen_max=400) plus ~500 bytes of numeric fields — roughly 900 bytes per element. At 2,050,000 elements this is ~1.8 GB on the call stack, which will segfault on every platform (typical stack limits are 8–64 MB). This variable must be heap-allocated (allocate/deallocate).

High: Debug print left in production code (`src/simulation/m_ibm.fpp`)

print *, proc_rank, " New Owner ", patch_ib(k)%gbl_patch_id  ! TODO :: REMOVE THIS DEBUG PRINT

This fires on every ownership transfer for every locally-tracked IB, generating unbounded output at scale. The ! TODO :: REMOVE THIS DEBUG PRINT marker confirms it is unintentional.

Medium: Commented-out code in `m_collisions.fpp` leaves `pid2` lookup absent (`src/simulation/m_collisions.fpp`)

! call s_get_neighborhood_idx(pid1, pid1) ! global patch ID -> local index call s_get_neighborhood_idx(pid2, pid2)
if (pid1 <= 0 .or. pid2 <= 0) cycle

The comment text contains call s_get_neighborhood_idx(pid2, pid2) — what appears to be a second intended lookup that was accidentally folded into the comment instead of being left as executable code. Neither lookup actually executes: pid1 and pid2 are the raw decoded global IDs from s_decode_patch_periodicity, not local indices. The guard if (pid1 <= 0 .or. pid2 <= 0) cycle would never trigger for valid global IDs (which are ≥ 1), meaning the subsequent patch_ib(pid1) and patch_ib(pid2) accesses use global IDs as local array indices, which is out-of-bounds when num_ibs < num_gbl_ibs. If this is intentional (global IDs equal local indices in the no-MPI / single-rank path), that should be documented; otherwise both lookups need to be uncommented.

Low: `GPU_UPDATE(device=[patch_ib(ib_idx)%moment])` removed without replacement (`src/simulation/m_ibm.fpp`)

-            patch_ib(ib_marker)%moment = moment*patch_ib(ib_marker)%mass/(count*cell_volume)
-            $:GPU_UPDATE(device='[patch_ib(ib_marker)%moment]')
+            patch_ib(ib_idx)%moment = moment*patch_ib(ib_idx)%mass/(count*cell_volume)

The per-field device update was removed. If patch_ib(i)%moment is consumed on the GPU before the next full GPU_UPDATE(device=[patch_ib(1:num_ibs)]) (e.g., in a subsequent call to s_ibm_correct_state within the same time step), the device will use a stale value. Verify that a covering full-struct update always precedes any GPU use of %moment after s_compute_moment_of_inertia is called from s_propagate_immersed_boundaries.

…local-aware-ibm

…packing in force communication

qodo-code-review · 2026-05-11T18:30:36Z

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

codecov · 2026-05-11T20:33:30Z

Codecov Report

❌ Patch coverage is 46.40152% with 283 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.23%. Comparing base (28863b1) to head (2c3099e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
src/simulation/m_ibm.fpp	30.89%	121 Missing and 11 partials ⚠️
src/simulation/m_start_up.fpp	50.34%	61 Missing and 10 partials ⚠️
src/post_process/m_data_output.fpp	0.00%	32 Missing ⚠️
src/simulation/m_particle_bed.fpp	74.11%	15 Missing and 7 partials ⚠️
src/simulation/m_collisions.fpp	27.27%	13 Missing and 3 partials ⚠️
src/simulation/m_mpi_proxy.fpp	41.66%	6 Missing and 1 partial ⚠️
src/simulation/m_time_steppers.fpp	33.33%	2 Missing ⚠️
src/simulation/m_data_output.fpp	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1378      +/-   ##
==========================================
- Coverage   64.95%   64.23%   -0.72%     
==========================================
  Files          72       73       +1     
  Lines       18879    19268     +389     
  Branches     1571     1628      +57     
==========================================
+ Hits        12263    12377     +114     
- Misses       5640     5888     +248     
- Partials      976     1003      +27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…s param description

Re-run ffmt with the corrected 0.4.0 build that includes the breaks.retain filter preventing line splits immediately before `%` (member accessor). Fixes three occurrences of `& %sf` in m_ibm.fpp, one `& %geometry` in m_ibm.fpp, and one `& %beg` in m_time_steppers.fpp.

num_ib_patches_max = 2050000 caused a ~2.25 GB unconditional heap allocation + full init loop in s_assign_default_values_to_user_inputs on every startup, crashing the CI debug build even for cases with no IBM. Also, s_reduce_ib_patch_array had a 2.25 GB stack-allocated local array that caused SIGSEGV for IBM cases. Introduce num_ib_patches_max_namelist = 50000 (restoring the pre-particle-bed budget) for the initial allocation. s_generate_particle_beds grows patch_ib to num_ib_patches_max via MOVE_ALLOC only when particle beds are actually being generated. s_reduce_ib_patch_array now uses a heap-allocated local array sized to num_ibs instead of the full num_ib_patches_max.

When particle beds are used, rank 0 grows patch_ib to num_ib_patches_max inside s_generate_particle_beds. Non-rank-0 ranks only have the namelist-sized allocation (num_ib_patches_max_namelist). The num_ibs scalar is broadcast first, so we can check and grow before the per-patch MPI_BCAST loop accesses patch_ib(i) for i > num_ib_patches_max_namelist.

sbryngelson · 2026-05-12T22:28:29Z

You had an issue that I think I fixed.

Root cause: num_ib_patches_max = 2,050,000 caused a ~2.25 GB unconditional allocation + full init loop on every startup (even for non-IBM cases), crashing CI debug builds.

Fix — 3 files, 2 commits:

m_constants.fpp: Added num_ib_patches_max_namelist = 50000 — the small initial budget (same as before particle beds were added).

m_global_parameters.fpp + m_start_up.fpp: Initial patch_ib allocation uses num_ib_patches_max_namelist on all ranks (~55 MB instead of ~2.25 GB).

m_particle_bed.fpp: Grows patch_ib to num_ib_patches_max via MOVE_ALLOC only when particle beds are actually being generated (rank 0 only, before the MPI broadcast).

m_mpi_proxy.fpp: Non-rank-0 processes grow patch_ib to match if num_ibs > size(patch_ib) — handles the MPI > 1 case where rank 0 grew for particle beds but other ranks still
have the small allocation.

m_start_up.fpp s_reduce_ib_patch_array: Changed the 2.25 GB stack-allocated patch_ib_gbl to a heap-allocated array sized to exactly num_ibs — eliminates the SIGSEGV for IBM
cases.

… limit NIB and the case_validator both now reference num_ib_patches_max_namelist (50000) instead of num_ib_patches_max (2050000). This constant is the actual namelist limit; particle beds grow patch_ib beyond it at runtime but those entries are never specified in the namelist. The fallback values match the constant, ensuring Homebrew installs (which lack m_constants.fpp) use the correct limit.

patch_ib is reallocated to num_aware_ibs slots (e.g. 54000 for 3D with the default ib_neighborhood_radius=1) before the 1-rank and no-MPI copy. Using patch_ib(1:num_gbl_ibs) crashed when num_gbl_ibs > num_aware_ibs (e.g. large particle beds on a single MPI rank). Use min() to clamp the copy, matching the original truncation behavior while avoiding the out-of-bounds write on the newly allocatable patch_ib_gbl.

For single-rank and no-MPI cases every IB patch is local, so shrinking patch_ib to num_aware_ibs (e.g. 54k for 3D) and then keeping num_ibs at num_gbl_ibs was a latent out-of-bounds: IBM loops over num_ibs would access patch_ib beyond its capacity. MOVE_ALLOC transfers patch_ib_gbl (sized exactly num_gbl_ibs) directly into patch_ib — no copy, no truncation, patch_ib size matches num_ibs. The num_aware_ibs resize is still done for multi-rank cases where each rank genuinely only needs its local neighbourhood subset.

patch_ib has GPU_DECLARE(create=...) in m_global_parameters, which means OpenACC/OpenMP tracks it via plain allocate/deallocate automatically. move_alloc is not reliably intercepted by all GPU runtimes for declare- create variables, so replace all three grow-on-demand sites (m_particle_bed, m_mpi_proxy, s_reduce_ib_patch_array 1-rank/no-MPI) with the established pattern: allocate tmp, copy, deallocate patch_ib, allocate patch_ib.

sbryngelson · 2026-05-13T12:53:23Z

I generated and pushed the missing golden file for BA186DDF (2D -> Example -> mibm_particle_bed) directly to your branch — that should fix the CI failure.

While reviewing the diff I found several other issues worth addressing:

Critical (wrong results)

f_neighborhood_ranks_own_location guard is num_procs > 2 instead of num_procs > 1 (m_collisions.fpp)
On a 2-rank run the condition is false, so both ranks think they own every collision and forces get double-applied. Should match f_local_rank_owns_location's guard.

recv_forces_snap not zeroed before y/z accumulation passes (m_ibm.fpp, s_communicate_ib_forces)
recv_forces_snap = 0._wp lives only inside the x-dimension block. For y and z it accumulates stale values from the previous dimension, making the double-count subtraction remove too much force.

s_handoff_ib_ownership can silently drop newly received patches (m_ibm.fpp)
After the compact-and-delete loop, ib_gbl_idx_lookup still reflects the old layout (it is rebuilt after unpacking). s_get_neighborhood_idx is therefore testing the pre-handoff table — a just-evicted patch may still appear as resident, causing the newly received authoritative state to be discarded.

High (MPI / GPU correctness)

recv_forces_snap, recv_torques_snap, recv_ids, recv_ft allocated with plain allocate, not @:ALLOCATE (m_ibm.fpp)
These arrays are used inside GPU_PARALLEL_LOOP via copyin clauses, but plain allocate skips GPU_ENTER_DATA. On OpenACC builds the GPU sees uninitialized device memory.

patch_ib re-allocated with plain allocate/deallocate while GPU_DECLARE is active (m_global_parameters.fpp + m_start_up.fpp)
patch_ib was changed from a static array to allocatable, but the device pointer registered by GPU_DECLARE at startup goes stale every time s_reduce_ib_patch_array or the MPI proxy resizes it. Each reallocation should use @:ALLOCATE/@:DEALLOCATE.

neighbor_domain_x/y/z computed on host but never GPU_UPDATE'd to device (m_start_up.fpp)
These are GPU_DECLARE'd and used inside f_neighborhood_ranks_own_location (a GPU_ROUTINE). After get_neighbor_bounds fills them on the CPU, a GPU_UPDATE(device=[neighbor_domain_x, neighbor_domain_y, neighbor_domain_z]) is needed.

s_compute_ib_neighbor_ranks: nreqs not reset between outer loop iterations (m_start_up.fpp)
After MPI_WAITALL, nreqs retains its end value going into the next k iteration. With ib_neighborhood_radius > 2 this overflows the dimension(52) requests array.

Post-process SILO writes called from all ranks (post_process/m_data_output.fpp)
DBPUTMVAR to dbroot is now called from all ranks, not just rank 0. In file_per_process=F mode this produces duplicate or corrupted multi-mesh entries.

Medium (robustness / performance)

num_local_ibs_max = 2000 hard cap with no overflow check — local_ib_patch_ids(num_local_ibs) = num_ibs writes out of bounds silently if more than 2000 particles land on one rank. Needs an @:PROHIBIT.

ib_force_send_buf/recv_buf heap-allocated on every RK stage call (m_ibm.fpp) — these should be module-level buffers sized once in s_ibm_setup rather than allocated and freed on every timestep.

moving_immersed_boundary_flag not recomputed after s_handoff_ib_ownership — set once at startup, then stale after each ownership handoff cycle when moving IBs migrate between ranks.

Low (code quality)

s_get_neighborhood_idx has an unused local variable i (compiler warning)
f_xorshift can return exactly 1.0 (violates the [0, 1) contract in the docstring — particle placed at xmax)
particle_bed_parameters type defined in simulation-only m_global_parameters.fpp; convention is to put derived types in m_derived_types.fpp

The two highest-priority fixes are the num_procs > 2 off-by-one (one-character fix, guaranteed wrong on 2-rank jobs) and recv_forces_snap not being zeroed per-dimension (wrong forces in any 2D/3D multi-rank run).

danieljvickers added 16 commits March 15, 2026 07:51

Remove interior point conservative variable protection for stationary…

4273c84

… boundaries

Merge branch 'master' of github.com:danieljvickers/MFC

4b2f519

Merge branch 'MFlowCode:master' into master

ba0cacb

Merge branch 'MFlowCode:master' into master

e6988ba

Merge branch 'MFlowCode:master' into master

0ce6f11

Merge branch 'MFlowCode:master' into master

16bd456

Merge branch 'MFlowCode:master' into master

f0e117d

Merge branch 'MFlowCode:master' into master

b42028a

Merge branch 'MFlowCode:master' into master

8728091

Initial separation for patch_ibs

f652274

Added IB patch reduction at the start of the simulation so that ranks…

3baf894

… are only locally aware

intermittent commit

2365f2c

we now write the global IB index, not the local one to ib_markers, as…

71bae6a

… intended

Refactored ib reduction to use neighbor bounds

0779098

prototype of send-receive replacing all-to-all

d9ac1c2

Compilation errors resolved

ee0bc0c

danieljvickers and others added 13 commits April 24, 2026 08:31

Resolved out of bounds error

8303cea

added send test algorithm for alternative MPI communication

1c1801c

Fixed early segfault due to uninitialized IB patch array

f014ca9

Debugged rank ownership bug and invalid number of global IBs

d36fe0b

Fixed global patch ID not being present on other ranks

cc76bf3

Merge branch 'master' into local-aware-ibm

e6e0613

Updating restart data

13ad7d0

Merge branch 'local-aware-ibm' of github.com:danieljvickers/MFC into …

2983eba

…local-aware-ibm

add integer declaration

35b7864

Fixed duplicate particle output

0ab63cb

Updated post processing

fce3071

Fixed stalling issues in proc_rank > 2 cases

6e9cd50

Significant MPI debug

92b776d

Daniel Vickers and others added 6 commits May 7, 2026 13:21

Successfully ran 128k IBs

6446782

fixed formatting issue

f6155f3

Lookup table for improved perofrmance

c8379a0

CHanges to allow rocm profiling on frontier and to parallelize array …

a631916

…packing in force communication

fixed incorrect array sizing

0f9eb30

Altered dynamic array allocation

9879df3

danieljvickers marked this pull request as ready for review May 11, 2026 18:30

danieljvickers requested a review from sbryngelson as a code owner May 11, 2026 18:30

danieljvickers added 2 commits May 11, 2026 14:31

Merge branch 'master' into local-aware-ibm

8fc61d9

Merge branch 'master' into local-aware-ibm

6bdf257

danieljvickers and others added 11 commits May 12, 2026 11:16

Merge branch 'master' into local-aware-ibm

7c39cf4

Address AI comments

c862070

Added particle beds as a feature

d132c7e

Formatting, collision N_2 fixes, and example update

a70fe62

Updated to long precicion to not overflow particle placement

dc433b6

style: apply ffmt v0.4.0 formatting

9060f6f

Merge branch 'master' into local-aware-ibm, apply ffmt v0.4.0 formatting

5bb2160

docs: add m_particle_bed module brief, category, and num_particle_bed…

72639c4

…s param description

sbryngelson added 5 commits May 12, 2026 17:30

test: add golden file for 2D mibm_particle_bed example

2c3099e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local Aware IBM#1378

Local Aware IBM#1378
danieljvickers wants to merge 82 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm

danieljvickers commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 •

edited

Loading

Uh oh!

sbryngelson commented May 12, 2026

Uh oh!

sbryngelson commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

danieljvickers commented Apr 23, 2026

Description

Type of change

Testing

Checklist

AI code reviews

Uh oh!

github-actions Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical: Stack overflow in s_reduce_ib_patch_array (src/simulation/m_start_up.fpp)

High: Debug print left in production code (src/simulation/m_ibm.fpp)

Medium: Commented-out code in m_collisions.fpp leaves pid2 lookup absent (src/simulation/m_collisions.fpp)

Low: GPU_UPDATE(device=[patch_ib(ib_idx)%moment]) removed without replacement (src/simulation/m_ibm.fpp)

Uh oh!

qodo-code-review Bot commented May 11, 2026

Uh oh!

codecov Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sbryngelson commented May 12, 2026

Uh oh!

sbryngelson commented May 13, 2026

Critical (wrong results)

High (MPI / GPU correctness)

Medium (robustness / performance)

Low (code quality)

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Apr 23, 2026 •

edited

Loading

Critical: Stack overflow in `s_reduce_ib_patch_array` (`src/simulation/m_start_up.fpp`)

High: Debug print left in production code (`src/simulation/m_ibm.fpp`)

Medium: Commented-out code in `m_collisions.fpp` leaves `pid2` lookup absent (`src/simulation/m_collisions.fpp`)

Low: `GPU_UPDATE(device=[patch_ib(ib_idx)%moment])` removed without replacement (`src/simulation/m_ibm.fpp`)

codecov Bot commented May 11, 2026 •

edited

Loading