Skip to content

Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139

Open
oshaughn wants to merge 247 commits into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_calmarg_in_loop
Open

Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139
oshaughn wants to merge 247 commits into
oshaughn:rift_O4dfrom
oshaughnessy-junior:rift_O4d_junior_calmarg_in_loop

Conversation

@oshaughn

@oshaughn oshaughn commented Jun 1, 2026

Copy link
Copy Markdown
Owner

calmarg done in the ILE loop, including

  • 'fused' : new kernel, specifically fast GPU-izing the code
  • 'loop': brute-force backtest, looping over cal realizations
    as well as fancy tools to
  • cal-pilot : adaptively sample in cal parameters, to enable sane results at high SNR

Richard O'Shaughnessy and others added 30 commits May 9, 2026 07:18
Third pass of the hyperpipeline-format work.  The first two commits
standardised the ILE -> CIP shard chain (commit #1) and the CIP -> ILE
grid handoff inside create_event_parameter_pipeline_BasicIteration plus
the puff / fetch / dag_utils plumbing (commit #2).  This commit makes
util_RIFT_pseudo_pipe.py -- the standard wrapper that builds args files
and then invokes BasicIteration -- respect the same env-var flag, so
end-users can flip the entire wrapper-driven workflow over to .dat
format with one environment variable.

Also includes the test/test_hyperpipeline_io.py file, which the prior
two commits referenced ("12 tests"/"17 tests") but did not actually
include in the staged file set.

Design constraint
-----------------

Per the project policy of "operate cohesively in one mode or the
other -- no internal conversion": pseudo_pipe is a thin
suffix-substituting wrapper.  It does NOT convert XML to .dat (or
vice versa) for any input.  Upstream inputs (manual seed grids,
template-bank-derived grids, etc.) must be staged in the format
matching the active mode; pseudo_pipe refuses with a clear message
when an XML-only auto-generation path would otherwise produce a file
the rest of the workflow can't consume.

Files
-----

* bin/util_RIFT_pseudo_pipe.py
  Five surgical patches, all gated on _use_hpip_pp derived from the
  same RIFT_HYPERPIPELINE_FORMAT env var commits #1/#2 use:

  - Three new variables (_use_hpip_pp, grid_suffix_pp, sim_grid_flag_pp)
    defined once near the top, immediately after the RIFT_LOWLATENCY
    block.  Mirrors the BasicIteration placement so the two scripts
    have parallel structure.

  - target_params writer (~line 639): in hyperpipeline mode, writes
    target_params.dat via hyperpipeline_io.write_grid_from_P_list with
    a column set auto-derived from whether P.eccentricity / P.meanPerAno
    are nonzero.  Otherwise legacy ChooseWaveformParams_array_to_xml
    emits target_params.xml.gz.  No behavioural change in legacy mode.

  - command-single --sim-xml line (~813): swapped to
    "{sim_grid_flag_pp} target_params.{grid_suffix_pp}", so the
    sanity-check ILE invocation routes through ILE's --sim-grid path
    in hyperpipeline mode.  This is the path the --sim-grid reader
    patch from commit #2 was designed for.

  - --manual-initial-grid copy site (~line 1399): copies to
    proposed-grid.{grid_suffix_pp} regardless of mode.  shutil.copyfile
    is format-agnostic; the source file's format must match the
    active mode (per the design constraint above).  The
    --manual-initial-grid-supplements branch (which uses ligolw_add,
    XML-only) raises SystemExit in hyperpipeline mode with a clear
    message pointing the user at pre-merging supplements upstream.

  - --input-grid argument to create_event_parameter_pipeline_BasicIteration
    (~line 1418): now passes proposed-grid.{grid_suffix_pp}, threading
    the suffix through to BasicIteration so the two scripts agree on
    the seed-grid filename.

  - AMR / template-bank seed-grid auto-generation guard (~line 1506):
    in hyperpipeline mode without --manual-initial-grid, raises
    SystemExit with a message asking the user to stage the initial
    grid as .dat and pass via --manual-initial-grid.  The XML-emitting
    util_AMRGrid.py and util_GridSubsetOfTemplateBank.py are
    intentionally untouched -- per the design constraint, no internal
    conversion.

  - --manual-initial-grid argparse help text updated to advertise both
    suffixes and note that the source format must match the active
    mode.

* test/test_hyperpipeline_io.py
  Recovered from the prior two commits, which referenced this file in
  their commit messages ("12 tests" in commit #1, extended to "17
  tests" in commit #2) but did not include it in the staged file set.
  The file is otherwise byte-identical to the version exercised
  end-to-end during the prior commits' development.  17 tests:

  - default_roundtrip
  - eccentricity_columns
  - tides_with_eos_index
  - to_legacy_dat_default
  - legacy_column_indices_consistency
  - sniff_distinguishes_legacy
  - sniff_recognizes_new_format
  - env_flag
  - concatenated_shards
  - read_many_skips_empties_and_mismatches
  - consolidate_weighted_average
  - consolidate_drops_high_sigma
  - grid_write_read_roundtrip_with_units
  - grid_distance_unit_conversion
  - grid_auto_suffix_append
  - column_alias_bridge
  - grid_no_lal_module_passthrough

  The file uses an importlib direct-load shim so it runs in stripped-
  down environments (no lalsuite / scipy required), making it usable
  for CI on minimal containers.

Audit
-----

A full pass over util_RIFT_pseudo_pipe.py confirmed every remaining
xml.gz / --sim-xml string is one of:

  * a comment / variable definition / argparse help mentioning the
    pair of supported suffixes (lines 34, 35, 47, 48, 321, ...);
  * a defaults string for an external file (PSD, coinc, ini) that is
    legitimately external and stays XML;
  * inside a code path I gated to refuse-and-exit in hyperpipeline
    mode (the AMR seed-grid block at ~line 1556 -- unreachable when
    _use_hpip_pp is True).

No live XML I/O paths reachable in hyperpipeline mode remain.

Tests
-----

All 17 tests in test/test_hyperpipeline_io.py pass.  Every patched
file in this commit and its dependents (commits #1, #2) compile via
py_compile.

Followups
---------

This commit covers util_RIFT_pseudo_pipe.py only.  Sibling drivers
that still need the same treatment:

  * bin/cepp_basic_htcondor (htcondor-only twin of BasicIteration)
  * bin/util_RIFT_pseudo_pipe_lowlatency.py
  * bin/util_RIFT_hyperpipe.py

To round out the seed-grid auto-generation paths so hyperpipeline mode
no longer requires --manual-initial-grid, the underlying generators
need parallel hyperpipeline output support:

  * bin/util_AMRGrid.py
  * bin/util_GridSubsetOfTemplateBank.py
  * bin/helper_LDG_Events.py

The EXTR_out -> LI posterior_samples convert path (called by
batchConvertExtr_job and friends in BasicIteration) is the last large
XML-resident consumer in the intrinsic-pipeline domain; addressing
that closes out the workstream.
…kelihood

Move calibration marginalization from postprocessing (calibration_reweighting.py)
into the ILE inner loop. Calibration is applied to the data (d -> C(f)d), so the
template-template U,V cross terms (rho_sq) stay calibration-independent and are
computed once; only the data term kappa changes per realization.

DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop gains an n_cal argument. With
n_cal==1 the path is byte-identical to before. With n_cal>1 it caches the per-detector
Q-product inputs, recomputes kappa per realization via the existing Q_inner_product
kernel using a block-offset window (ifirst + c*N_window), and reduces with a streaming
log-sum-exp over realizations (memory-neutral, reuses the validated kernel). The driver
threads --calibration-n-realizations into the three production call sites.

Also fixes a bug in ComputeModeIPTimeSeries: the calibration branch took the inner
product against the original data instead of the calibration-modified data_now, so the
calibration factor was never applied.

Validated CPU and GPU paths against a brute-force per-realization reference to machine
precision (RIFT/calmarg/test_calmarg_reduction.py). Design rationale, the apply-to-data
vs apply-to-template convention, and remaining work (Option C fused kernel, seeding,
param export) in RIFT/calmarg/DESIGN_calmarg_in_loop.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sics scaffold)

RIFT/calmarg/backtest_calmarg.py compares calmarg likelihood implementations on
controlled synthetic inputs that exercise the per-realization block structure.
METHODS registry: reference (brute-force per-block + logsumexp), in_loop_B (the
n_cal>1 call), in_loop_C (stub raising NotImplementedError until the fused kernel
lands). Reports max|lnL - reference| and best-of-N timing on CPU or GPU; wire the
fused kernel into method_in_loop_C and it validates automatically.

in_loop_B reproduces reference to ~1e-15 on CPU and GPU, with and without phase
marginalization; ~3-4x faster than the brute-force reference on GPU.

run_physics_backtest() scaffolds the heavier real-data comparison vs bilby
calibration_reweighting.py (needs frames/PSDs/data_dump; runs on the stable host).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…default helper)

Adds a single fused kernel (cuda_Q_fused_calmarg.cu + Q_fused_calmarg.py) that, per
extrinsic sample, loops realizations x time x detectors x modes, forms kappa, applies
the default factored-likelihood helper, and does a streaming Simpson-weighted
log-sum-exp over (c,t) on-board -- returning lnL[j] in one launch, with no
(batch, n_cal, npts) intermediate and no per-realization Python launches.

Selected via cal_method='fused' in DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop
(default remains 'loop' = Option B). Time integration matches Option B exactly by
passing the composite-Simpson weight vector w_t = simps(I, dx=deltaT) into the kernel.
rho_sq is calibration-independent and passed pre-summed over detectors.

Validated in backtest_calmarg.py (method in_loop_C) vs the brute-force reference and
Option B to ~1e-15 on GPU. ~8-9x faster than Option B and ~25-32x faster than brute
force (e.g. n_cal=200 x 8192 samples: 279 ms vs 2422 ms vs 7080 ms on sm_30).

Scope (raises NotImplementedError otherwise): GPU only; phase_marginalization=False;
default distance-unmarginalized helper only. Stage 2 = port the distmarg loglikelihood
(table interp) into the kernel for the dominant distance-marginalized path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
make_synthetic_case now builds per-detector rholms/U/V (dets=("H1","L1") default,
--dets CLI), so the fused kernel's detector loop and the function's per-detector
ifirst stacking are genuinely exercised (each detector gets a distinct ifirst from
its real location). Validated Option C vs reference and Option B to ~4e-15 with
H1,L1,V1; Option C ~9x faster than Option B (50 ms vs 445 ms, n_cal=100 x 1024 x 3 det).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… loop fallback)

Adds a SEPARATE fused kernel (cuda_Q_fused_calmarg_distmarg.cu + wrapper
Q_fused_calmarg_distmarg_cupy) that reproduces the distance-marginalization
loglikelihood on-board: x0=kappa/rho_sq, the asinh-based s/t transforms, the
EvenBivariateLinearInterpolator bilinear table lookup with its in-bounds mask, and
exponent_max -- then the streaming Simpson-weighted cal+time log-sum-exp.

Kept separate from the default-helper kernel on purpose: smaller review surface per
kernel, the simpler kernel stays as a baseline, and cal_method='loop' (Option B)
remains a full fallback for distmarg on CPU and GPU. Selected via cal_method='fused'
plus a cal_distmarg table dict (cal_distmarg=None -> default-helper kernel).

Harness gains --loglikelihood distmarg: builds a self-consistent table + a mirror
Python closure (reference/Option B) so the fused kernel is validated against the same
transform. Agreement ~1e-14 vs brute-force reference, single- and multi-detector;
~6-7x faster than Option B (e.g. n_cal=200 x 2048 x 3det distmarg: 333 ms vs 2358 ms).
Default-helper path unchanged and still matches to ~3e-15.

Scope (raises NotImplementedError otherwise): GPU only, phase_marginalization=False.
Remaining: phase-marg support; wire driver distmarg sites to a cal_distmarg dict
behind an opt-in flag (Option B stays default).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ernel (opt-in)

Adds opt-in --calibration-fused-kernel (off by default). When set, on GPU, with
calibration marginalization active, the driver packages the distance-marginalization
lookup_table (s_array, t_array, lnI_array, bmax, bref) + xmin/xmax into a cal_distmarg
dict and passes cal_method='fused' at the non-phase-marg distmarg call site (Option C).
The phase-marg distmarg site and everything else stay on cal_method='loop' (Option B),
which remains the default and the fallback for all cases. On CPU the flag is ignored
with a warning (the fused kernels are GPU-only).

Driver py_compile clean. Not yet exercised end-to-end on a real run; kernel/reduction
correctness is covered by RIFT/calmarg/backtest_calmarg.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…marg table

End-to-end driver test surfaced a real bug: in the distance-marginalization path
P.dist is fixed at the fiducial, so invDistMpc is a SCALAR, but the fused kernel
wrappers require one value per extrinsic sample (assert invDist.shape==(n_ext,)).
The loop path tolerated the scalar via broadcasting; the fused path raised
AssertionError (swallowed by the driver's generic handler). Fix: broadcast
invDistMpc to (npts_extrinsic,) in the fused branch (works for scalar fiducial
distance and for sampled-distance vectors alike).

Also add backtest_calmarg --real-table to validate the fused distmarg kernel against
a production util_InitMargTable .npz (real s/t ranges). Result: fused == reference
== loop to ~2e-14 on the real table (default helper and synthetic distmarg unchanged,
~1e-14).

Note: full-sampler end-to-end numerical comparison on the local 2GB NVS 510 is
unreliable (OOM / nan under load); use a larger GPU for that. Wiring is confirmed:
the flag builds the cal_distmarg dict, reaches the fused distmarg kernel, and runs to
completion.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rginalization)

Pipeline access: util_RIFT_pseudo_pipe.py gains --calmarg-envelope-directory,
--calmarg-n-realizations, --calmarg-spline-count, --calmarg-fused-kernel, which append
the corresponding ILE flags (--calibration-envelope-directory / -n-realizations /
-spline-count / --calibration-fused-kernel) to args_ile.txt. Setting the envelope
directory enables in-loop calmarg on the distance-marginalization code path; the fused
kernel additionally needs GPU and otherwise falls back to the loop method.

Demo: demo/rift/calmarg exercises baseline vs loop (Option B) vs fused (Option C) on the
zero-spin synthetic CI data in 3 detectors (H1,L1,V1) by running ILE directly (no
condor). Makefile targets: inputs, verify-exact (deterministic loop==fused==reference to
~1e-14 on the demo's real distmarg table), run-baseline/run-loop/run-fused, compare.
Includes tools/make_cal_envelopes.py and tools/compare_lnL.py, and a README explaining
the physics and that full-sampler runs agree only within Monte-Carlo noise (the GPU
integrator is not bit-reproducible even with --seed) -- verify-exact is the rigorous
equivalence check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o AV sampler

Both fused kernels now skip any (c,t) sample whose window offset (ifirst + c*N_window
+ t) falls outside [0, npts_full) -- pathological/NaN extrinsic draws from the sampler
can no longer cause CUDA_ERROR_ILLEGAL_ADDRESS; such samples simply do not contribute.
Verified no change to the validated numerics (backtest still ~1e-15, default + distmarg).

demo/rift/calmarg now uses the adaptive-volume sampler (SAMPLER=AV) by default instead
of GMM. AV (mcsamplerAdaptiveVolume) is the mature/stable GPU code path and sets
sampler.xpy=cupy under --gpu, so it works with the fused kernel; GMM (mcsamplerEnsemble)
is newer and heavier on the GPU and was the likely source of the illegal-address crash
in the full sampler run. SAMPLER is overridable (e.g. SAMPLER=GMM).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sets

Isolated on a CIT GPU (RTX 2080 Ti): baseline and the fused calmarg path run clean,
but the loop calmarg path (Option B) hits CUDA_ERROR_ILLEGAL_ADDRESS at scale. Cause:
for some sky positions the integration window extends one sample past the precomputed
rholm buffer (ifirst+t >= N_window). In baseline (buffer length N_window) the tiny
over-read lands in mapped pool memory and is silently tolerated; with calibration
marginalization the buffer is n_cal blocks long, so the over-read in the LAST block is
past the whole allocation and faults.

Fix: the shared Q_inner kernel now skips time offsets with (i_first_time+i_time) >=
num_time_points, contributing zero instead of reading out of bounds (a negative int
index wraps to a large size_t and is caught too). This makes the loop path robust and
also removes the latent silent over-read for ALL GPU runs (incl. non-calmarg). Valid
indices are unaffected; backtest numerics unchanged (~1e-15, default + distmarg).

NOTE: shared kernel used by all GPU ILE runs (slightly broader scope than the rest of
this branch). The underlying window-sizing edge (window can exceed the storage buffer
for extreme sky positions) is pre-existing; the guard makes it safe rather than masking
a calmarg-specific bug.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… / OOB)

A forced-overflow backtest (small N_window) exposed a correctness bug: when the
integration window over-runs a calibration block (ifirst+t >= N_window), the previous
guard (index < npts_full) only caught the LAST block (the crash). For earlier blocks
the read silently bled into block c+1 -- wrong values, not a fault -- in BOTH the loop
and fused paths (they agreed with each other but disagreed with the per-block
n_cal==1 reference).

Fix: guard the WITHIN-block offset against [0, N_window) in both fused kernels, and in
the loop path slice Q to the current block and pass the within-block offset (so the
shared Q_inner_product kernel's guard fires at the block boundary). The CPU loop branch
likewise zeros out-of-range rows. An over-running window now contributes zero from that
detector at that (c,t), matching the n_cal==1 reference exactly.

Validated: with forced overflow (N_window=140, 3 IFOs) loop == fused == reference to
~1e-14 (default and distmarg); the no-overflow case and the CPU regression test are
unchanged (~1e-15). Note: this is independent of the loud-source loop-vs-fused gap seen
in the full sampler run (that is Monte-Carlo scatter -- both paths share identical
behavior here -- pending the reproducibility check).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… sampling)

Generalize the cal marginalization from the (1/n_cal) average to a weighted sum
  Z_cal = sum_c exp(log_w[c]) * Z_c / sum_c exp(log_w[c]),
where log_w[c] are per-realization importance log-weights (w_c = prior/proposal). This
is the enabling hook for adaptive / importance cal sampling at high SNR, where prior
draws become inefficient as the cal posterior departs from the prior.

New optional cal_log_weights (length n_cal) threads through
DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop into the loop reduction and both
fused kernels (which now take log_w[] + log_w_norm=logsumexp(log_w); lnL_t += log_w[c],
final term -log_w_norm). Default None = uniform = the plain (1/n_cal) average, so all
current behavior and verify-exact are byte-identical.

Validated (backtest_calmarg --random-cal-weights): loop == fused == reference to ~1e-14
with non-uniform weights, for default helper and distmarg, on CPU and GPU, with and
without window overflow; uniform path unchanged (~1e-15). The learning loop that
produces non-uniform weights (fit a cal proposal from per-realization responsibilities,
redraw, iterate) is the planned follow-on.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d low-SNR variant

compare_lnL.py was reading the LAST .dat column, which is neff (effective sample
count), not the marginalized lnL. The ILE row ends with [... lnL, sqrt_var, ntotal,
neff], so lnL is column [-4]. This explains the spurious "loud source / large
loop-vs-fused gap" seen earlier: those numbers were neff (~1800-1900 of NMAX=20000),
which legitimately scatter run-to-run and are meaningless to compare. The bundled
injection is actually network SNR ~17.5 (verified with util_FrameZeroNoiseSNR), so the
true marginalized lnL is ~150, not ~1880. compare_lnL.py now reports lnL +- sampling
error and neff separately.

Add a quiet-source variant: make lowsnr-inputs generates a fainter copy of the same
injection (m1=35,m2=30 at larger distance, ~SNR 9) on the fly -- no committed binaries,
same path as the CI data (util_WriteInjectionFile.py + util_WriteFrameAndCacheFromXML.sh)
-- and make low-snr runs the full comparison on it. INJ_DIST tunes the loudness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The demo was analyzing overlap-grid.xml.gz event 0 (m1=m2=26.4) against an injection at
m1=35,m2=30 -- a far-off template, giving lnL ~ -2.7 ("no signal") and hiding the
calibration-marginalization effect. Point --sim-xml at the injection itself
(mdc.xml.gz, a single matched point) so the signal is present: lnL jumps to ~78 even
heavily undersampled on a 2GB card (-> ~rho^2/2 ~ 150 when converged). The low-snr
target uses the matched mdc_lowsnr.xml.gz. SIM_XML is overridable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-point runs

The single-point matched-template full-sampler run has a narrow extrinsic peak that
RIFT's adaptive sampler can fail to lock robustly (sensitive to NCHUNK/SNR/GPU env):
neff~1 (one draw dominates) or neff large but lnL~0 (spread off-peak, missed signal).
Sanity-check sqrt(2*lnLmax) ~ injected SNR. Use low-snr + modest NCHUNK for a robust
full run; verify-exact (deterministic ~1e-14) is the rigorous loop-vs-fused check that
does not depend on sampler convergence.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cal_method='fused' now works with xpy=np: cupy is imported lazily in
Q_fused_calmarg.py (module imports fine on a machine without CUDA), and pure-numpy
implementations Q_fused_calmarg_numpy / _distmarg_lnL_numpy mirror the CUDA kernels
exactly (within-block guard, importance weights, Simpson weights, distmarg table
transform). factored_likelihood routes the fused branch to numpy on CPU and the cupy
kernels on GPU.

This lets the fused path run on a laptop and gives an INDEPENDENT cross-check of the
kernel math. Validated (backtest --backend cpu): fused-numpy == loop == reference to
~1e-15 for default helper and distmarg, with non-uniform importance weights and with
window overflow; GPU fused-cupy unchanged (~1e-15); module imports with cupy blocked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Driver:
- use_fused_calmarg flag; wire the NON-distance-marginalization ILE call site to the
  fused default-helper kernel when --calibration-fused-kernel is set (was only wired
  at the distmarg site).
- drop the GPU gate on the fused path -- it now has a numpy backend, so it works on
  CPU too (the loop method is still the default/fallback).
- fix a pre-existing bug exposed by running the non-distmarg GPU path with the AV
  sampler: that likelihood_function used the passthrough xpy_asarray_already, but AV
  hands back numpy arrays, so cupy ufuncs raised "Unsupported type numpy.ndarray".
  Use xpy_default.asarray (a no-op for on-device arrays), matching the distmarg path.

Demo: BACKEND={gpu,cpu} and DMARG={1,0} toggles. verify-exact honours both. The full
matrix passes (loop == fused == reference ~1e-14): GPU/CPU x distmarg/default; and the
non-distmarg end-to-end run now works (loop ~ fused within sampling error). Distmarg
path unchanged (regression check passes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gnal lost)

CRITICAL bug: in ComputeModeIPTimeSeries the concatenated n_cal-block rholm series was
created with epoch=data.epoch, but the per-block series (after the roll/cut) has epoch
data.epoch - hlms.epoch (rolled).  Only the per-block *data* was copied in, never the
epoch, so the concatenated series carried the wrong time reference.  Downstream this put
ifirst at ~within_block + (n_cal-1)*N_window -- i.e. in the LAST block instead of within
block 0 -- so the integration window read past/into the wrong block, the within-block
guard zeroed it, and the calmarg likelihood returned NaN/collapsed (lnL ~ -3) while
baseline was fine (lnL ~ 115).  This affected BOTH loop and fused, on CPU and GPU, and
reproduced everywhere; verify-exact missed it because it feeds synthetic rholms with a
manually-set epoch.

Fix: set rholms_so_far.epoch = rholms_here.epoch (block 0's actual rolled/cut epoch),
matching the non-calibration branch.  Verified end-to-end: ifirst now lands in
[0, N_window-npts] and baseline ~ loop ~ fused (lnL ~ 115-123 at the matched template,
within the undersampled MC scatter) instead of collapsing.  All synthetic regressions
(verify-exact gpu/cpu x distmarg/default, CPU reduction test) still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…poch bug)

test_precompute_alignment.py exercises the REAL PrecomputeLikelihoodTerms /
ComputeModeIPTimeSeries path (3 IFOs, identity calibration) and asserts that the
calibration-marginalized rholm series matches the non-calibration series block-by-block
in BOTH data and epoch.  The epoch assertion fails on the alignment bug just fixed
(|delta epoch| ~ (n_cal-1)*N_window*deltaT ~ 0.5 s), which verify-exact /
test_calmarg_reduction cannot catch because they feed synthetic rholms with a hand-set
epoch.  CPU-only, no GPU needed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
return_lnLt with n_cal>1 now returns the calibration-marginalized lnL at each time bin
  lnL_marg(t) = log( sum_c exp(log_w[c]) exp(lnL_t,c(t)) ) - logsumexp(log_w)
(the weighted average of the per-realization likelihood series), instead of raising.
It is produced by the loop reduction (the fused scalar kernel is bypassed when a
timeseries is requested). Verified: integrating exp(lnL_marg(t)) over time reproduces
the time-integrated scalar lnL to ~1e-15.

Driver: resample_samples() threads n_cal through, so the time-resampling export uses the
cal-marginalized timeseries and all downstream time-sampling paths work unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, matrix toggle)

Both fused kernels (default-helper and distmarg) and the numpy backend now support
analytic phase marginalization: use |kappa| instead of Re(kappa) (the (2,-2)-mode
conjugation is already baked into Q/A by the caller, exactly as in the loop path).
factored_likelihood passes phase_marginalization through; the NotImplementedError is
gone. Driver: the phase-marg distmarg ILE call site now uses the fused distmarg kernel
when --calibration-fused-kernel is set.

Validated against reference + loop to ~1e-14 across the full matrix:
gpu/cpu x default/distmarg x phase 0/1 (8 cells PASS), plus precompute-alignment and
CPU-reduction regressions. Demo gains a PHASE=0|1 toggle for verify-exact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… node space)

RIFT/calmarg/adaptive.py learns a unimodal Gaussian proposal over the calibration
spline-node parameters so in-loop calmarg stays efficient at high SNR (where prior
draws collapse to ~1 effective cal sample).  Uses importance weighting
w_c = prior/proposal (so the marginalized result is unbiased) and a TEMPERED proposal
fit -- weights softmax(beta*log_resp), beta ramped 0.3->1.0, covariance inflated while
tempering is on -- to avoid a single sample dominating given the very large lnL dynamic
range, then sharpening as it learns.

Pieces: envelope_node_prior, nodes_to_cal_factors (spline, matches
generate_realizations), fit_proposal (tempered weighted Gaussian), neff_from_logweights,
adaptive_cal (the loop, taking an `evaluate(nodes)->log integral L` callback and
returning the final nodes + importance log-weights + neff history).  The fit targets the
cal posterior prior*L/proposal; neff of those -> n_real when the proposal matches.

Self-contained convergence demo (python -m RIFT.calmarg.adaptive, no GPU/lal): a
2-sigma-off, narrow (high-SNR) cal -- prior-only neff ~1/300; adaptive recovers
neff -> ~246 and the proposal converges onto the cal posterior to ~0.04 sigma.

Driver integration (outer pilot/refine pass that calls ILE per realization to get
log integral L) is the next step.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
backtest_calmarg.py --scan-ncal: per-likelihood-evaluation wall-time vs n_cal for
reference/loop/fused, to quantify the cost of cal marginalization for planning.
Data (GPU, 3 IFO, distmarg, 4096 extrinsic): marginal cost per extra realization ~57 ms
(brute), ~23 ms (loop), ~3.3 ms (fused); at n_cal=200 reference is ~11 s/eval (hours for
a full integration -> reference only), fused ~0.7 s/eval (production-feasible).

DESIGN_adaptive_driver.md: planning doc for learning the cal proposal in the driver.
Weighs (A) brute-force reference, (B) portable extrinsic+cal distribution / normalizing
flow breadcrumbs, (C) lazy pilot. Recommends: production path must be fused; learn cal
ONCE from a cheap pilot of high-likelihood points (cal is boring / extrinsic-independent)
+ Phase 0 importance weights; brute force is the validation reference; define a
save/load breadcrumb interface (Gaussian now, NF later). No multi-stage loop.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
claude and others added 30 commits June 10, 2026 08:54
…GETENV)

The fan-out .sub used 'getenv = True', which CIT (and pools with
SUBMIT_ALLOW_GETENV=false) reject -> submit fails. Set the environment explicitly
instead (HOME + PYTHONPATH + JAX/thread caps); the absolute conda-python executable +
PYTHONPATH are all the job needs. request_cpus=2 + OMP/OPENBLAS/MKL=2 +
xla_cpu_multi_thread_eigen=false mirror the ulimit -u thread-spawn fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge JAX-GP export-at-scale into calmarg
…kelihood

Merge JAX ILE likelihood prototype
Counterpart to RIFT_BOOLEAN_LIST (which adds (<x> =?= TRUE) requirements to the REMOTE worker
jobs ILE/CIP, gated on use_osg). The LOCAL/flock_local non-worker jobs (convert, test,
consolidate, puff, join, unify, psd, resample, ...) run with absolute /home paths and NO file
transfer, so on a pool whose execute points may lack /home (e.g. EPNFS=undefined) they fail
"Failed to open .../*.out: No such file or directory".

New helper _nonworker_extra_requirements() reads $RIFT_REQUIRE_NONWORKER (comma-separated
ClassAd attrs) and returns ['<attr> =?= TRUE', ...]; appended to the requirements list before
each non-worker requirements emission (workers CIP/ILE excluded -> they keep RIFT_BOOLEAN_LIST).
Read at DAG-build time, so it is durable through `asimov manage submit` / the asimov daemon.

Usage: export RIFT_REQUIRE_NONWORKER='EPNFS'  -> pins local jobs to NFS-/home nodes.
…opt-in)

ALTERNATIVE to the runtime-wrapper approach (branch
rift_O4d_osg_runtime_container_select): use HTCondor's container universe with
container_image = $$([...]) instead of MY.SingularityImage = ifThenElse(...).

Why it works on OSG: MY.SingularityImage=ifThenElse(...) is an execute-side
ClassAd expression that OSPool glidein pilots read as a LITERAL string and hold
the job on. container_image with a $$() token is resolved by HTCondor via
match-time machine-ad substitution (in the schedd, against the matched machine
ad) BEFORE the job reaches the EP, so the pilot only ever sees a literal image
URL. $$ in container_image is HTCondor's *documented* mechanism for selecting a
container image by GPU CUDA capability, and container universe is the current
OSPool-standard (it deprecated +SingularityImage); osdf:// container images are
supported and OSDF-cached; GPU access is automatic under request_gpus (no --nv
needed). The same path also works on the CIT-local pool, so this unifies both
pools (vs the ifThenElse path which is CIT-local-only).

- container_manifest.build_container_image_select(manifest): returns the
  $$([ ifThenElse(attr =?= undefined, <fallback img>, <ifThenElse selector>) ])
  value. Image branches are the manifest images VERBATIM (osdf URL fetched by
  container universe, or cvmfs/local path in place) -- not a ./basename rewrite.
  The =?= undefined guard makes a CPU-only / non-advertising slot fall to the
  fallback image instead of an undefined $$() that would hold the job.

- write_ILE_sub_simple: when RIFT_CONTAINER_UNIVERSE is set (and a family
  manifest + use_singularity), set universe=container, emit container_image =
  the $$() selector, and drop MY.SingularityImage / MY.SingularityBindCVMFS /
  the $$() transfer token (container universe transfers the image itself). The
  require_gpus floor is still applied. Default (env unset) behavior is unchanged:
  the existing ifThenElse MY.SingularityImage path for CIT-local runs.

Tests: container_image select expression (undefined-safe, verbatim osdf URLs,
fallback) and integration (universe=container, container_image=$$([...]),
no MY.SingularityImage / no transfer token, floor present). Existing CIT-local
and single-sif tests unchanged.

Trade-off vs the wrapper branch: this is much smaller and uses native/documented
HTCondor machinery, but relies on the matched slot advertising the capability
attribute at match time; the wrapper detects the real GPU at job start instead.
ILE-only for now (CIP/PSD/calibration still use the ifThenElse path). Open item
to confirm on a real OSG GPU job: cvmfs bind + capability advertisement coverage
across OSPool sites.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_universe

container family: OSG-safe per-machine image via container universe (opt-in)
…ainer universe)

write_calpilot_sub still handed the raw SINGULARITY_RIFT_IMAGE value to
MY.SingularityImage, so a .yaml/.yml family MANIFEST reached condor as the image
path and the job failed (a manifest is not a .sif).  The container-universe work
fixed write_ILE_sub_simple but never touched the CALPILOT writer, even though the
CALPILOT job runs ILE internally (GPU) and needs the same per-machine selection.

Mirror write_ILE_sub_simple exactly:
  * detect a container manifest (is_container_manifest) and expand it;
  * legacy (default): universe=vanilla, MY.SingularityImage = ifThenElse(...),
    plus the selective $$() osdf transfer token and a require_gpus floor;
  * container universe (opt-in RIFT_CONTAINER_UNIVERSE): universe=container,
    container_image = $$([...]) (match-time, OSG-safe), no MY.SingularityImage /
    SingularityBindCVMFS, image delivered via container_image (no transfer token).
A plain .sif / osdf:// value keeps the legacy single-image behavior unchanged.

Validated offline (pilot DAG build, OSG=1, family manifest) in both modes: the
generated CALPILOT.sub container_image is byte-identical to ILE.sub, and the
require_gpus floor is applied.  test_container_manifest.py: 15/15 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_universe_calpilot

container family: extend CALPILOT to per-machine image (legacy + cont…
When SINGULARITY_RIFT_IMAGE is a container-family MANIFEST (.yaml/.yml), the
osdf:// image URLs live INSIDE the manifest, so the existing
`'osdf:' in singularity_image` auto-detect (which force-sets
use_oauth_files='scitokens' for single-image osdf runs) misses it.  Result:
no `use_oauth_services = scitokens` in the subs -> the execute point has no
credential to fetch the selected container -> every ILE/CIP/CALPILOT job is
held with "credential is required for osdf://...sif but was not discovered".

Add a manifest-aware branch: if singularity_image is a container manifest,
inspect its image URLs and pick the same credential the single-image path
would (igwn+osdf -> 'igwn', osdf -> 'scitokens').  Pipeline-writer only (bin/),
no container rebuild.  Validated offline: a family-manifest pilot build now
emits `use_oauth_services = scitokens` on ILE/ILE_extr/ILE_puff/CALPILOT/CIP/
CIP_0/CIP_worker0, matching the old working single-image subs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When SINGULARITY_RIFT_IMAGE is a container-family MANIFEST (.yaml/.yml), the
osdf:// image URLs live INSIDE the manifest, so the existing
`'osdf:' in singularity_image` auto-detect (which force-sets
use_oauth_files='scitokens' for single-image osdf runs) misses it.  Result:
no `use_oauth_services = scitokens` in the subs -> the execute point has no
credential to fetch the selected container -> every ILE/CIP/CALPILOT job is
held with "credential is required for osdf://...sif but was not discovered".

Add a manifest-aware branch: if singularity_image is a container manifest,
inspect its image URLs and pick the same credential the single-image path
would (igwn+osdf -> 'igwn', osdf -> 'scitokens').  Pipeline-writer only (bin/),
no container rebuild.  Validated offline: a family-manifest pilot build now
emits `use_oauth_services = scitokens` on ILE/ILE_extr/ILE_puff/CALPILOT/CIP/
CIP_0/CIP_worker0, matching the old working single-image subs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…the GPU family

Two holds surfaced running the family-container pp-run/pp-run-pilot on CIT:
  * request_disk=4G held jobs that landed on Blackwell ("no space left on device"
    mid osdf transfer): the cc90-120 CUDA-12.8 image is 6.35 GB.  Bump PP_DISK
    default 4G->16G (OSG branch) to cover the largest image + unpack headroom.
  * request_memory=4096 (pseudo_pipe default; the demo never overrode it) held ILE
    on "memory usage exceeded request_memory": the FUSED calmarg precompute holds N
    cal realizations and the adaptive draw count doubles (NCAL 100->800), spiking
    RSS past 4 GB.  Add PP_MEM_ILE (default 8192, the historical standard, ~2.4x the
    observed ~3.4 GB peak) -> --internal-ile-request-memory; flows to ILE, ILE_puff,
    ILE_extr (auto 2x=16384) and CALPILOT (request_memory_ILE).
Both overridable per-run (PP_DISK=, PP_MEM_ILE=).  Demo Makefile only.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…CIP fix)

A CPU-only job (CIP) requests no GPU, so it matches a slot that advertises NO
GPU capability attribute. The per-machine container_image = $$([ ... capability
... ]) then has nothing to resolve against: the $$() substitution fails to
expand and HTCondor HOLDS the job -> all CIPs lock up.

Fix: when a job requests no GPU, do not emit a $$() capability selection at all;
use a SINGLE fixed container (the manifest fallback, i.e. the CPU-safe image).

- build_container_image_select(manifest, request_gpu=True): with request_gpu=
  False it returns the plain fallback image literal (no $$(), no ifThenElse).
- write_ILE_sub_simple passes request_gpu through (GPU jobs keep the $$ selector;
  a no-GPU ILE would also collapse).
- write_CIP_sub: wire container universe for CIP too (universe=container,
  container_image = fallback literal, no MY.SingularityImage / BindCVMFS / $$()
  transfer token). CIP is CPU-only so it always collapses to the single image;
  no require_gpus floor (unchanged).

Also corrects the stale CIP comment that claimed an undefined capability
"collapses to the fallback image" -- true-ish for the native ifThenElse, but
false for $$(), which holds the job.

Tests: build_container_image_select(request_gpu=False) -> bare fallback image;
CIP integration (universe=container, container_image = single fallback literal,
no MY.SingularityImage / no $$() token / no require_gpus). 17/17 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o-hold)

Folds in the undefined-safe guard (orig 8b9a0c5d, fix/manifest-cpu-fallback) and
unifies it with the container-universe collapse already in this branch.

build_singularity_image_expr and build_transfer_input_expr emitted a bare
ifThenElse/ternary over TARGET.GPUs_Capability with no guard for that attr being
undefined.  A job that matches a slot with no capability attribute -- a CPU-only
CIP slot, OR an OSPool GPU site that doesn't advertise it -- makes every
`TARGET.attr >= N` undefined, so the whole $$([...]) token "cannot expand" and
HTCondor HOLDS the job ("Cannot expand $$ expression").

Add an `undefined_safe` option to _build_selector that wraps the selector in
`TARGET.attr =?= undefined ? fallback : <selector>` (ternary for the comma-free
transfer token; ifThenElse otherwise).  Apply it to both legacy builders, and
refactor build_container_image_select to reuse it (DRY) instead of its own inline
guard.  An undefined-capability match now yields the fallback (smallest, CPU-safe)
image on every path instead of an unresolvable $$().

This is the central no-hold guard for the LEGACY (non-container-universe) path,
complementing the deterministic build-time collapse this branch already does for
CPU-only jobs under container universe (CIP -> single fallback container).
NB: the osdf scitokens credential for manifest images is a separate fix already
on dev (3e18793; re-proposed in PR #11) -- not duplicated here.

Tests: legacy builders are undefined-safe; updated the two exact-string
expression tests to the guarded form. 18/18 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r_universe

container universe: collapse to a single container for non-GPU jobs (…
…r_universe_calpilot

scitouens with manifests
calmarg_ci.ini set ile-runtime-max-minutes=60 under the stale assumption that ILE
jobs are "~1-3 min each".  With the FUSED calmarg kernel + adaptive cal-draw doubling
(NCAL 100->800) + distance marg, a 50-point ILE job routinely exceeds 60 min on the
slower 1050 Ti slots (the bulk of the CIT pool), so the 60-min periodic_remove wall
killed ~half the jobs -> retried -> churn -> the iteration never converged (no all.net
after 5 h).  Raise to 120 min so slow-GPU jobs finish.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r_calmarg_in_loop

# Conflicts:
#	CHANGES.rst
… spikes)

8192 still held the wide ILE jobs ("over cgroup memory limit of 8192").  The cal draws
do NOT double here (they stay at 100); the memory driver is the AV extrinsic sampler
spinning toward --n-max 4e6 on pathological low-cal-n_eff points, accumulating sample
arrays past 8 GB.  Completers peak ~7.3 GB; 16384 (2.2x) covers the hard-point spikes
and still matches most GPU nodes (median ~27 GB RAM).  ILE_extr auto-gets 2x.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sniff() opened --inj-file in text mode and iterated lines to auto-detect
the hyperpipeline ASCII format. The pipeline routinely hands it gzipped
XML grids (overlap-grid-N.xml.gz): util_ParameterPuffball.py always calls
_hpio.sniff(opts.inj_file) before falling back to xml_to_ChooseWaveformParams_array.
The text-mode read raises UnicodeDecodeError on the gzip magic (1f 8b) ->
  "'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte"
and the except clause only caught (OSError, IOError) (UnicodeDecodeError is a
ValueError), so it propagated and the PUFF node crashed -> the DAG stalled.

Fix: peek the first bytes in binary and bail early for gzip (1f 8b) or a
leading '<' (XML) -- neither is ever a hyperpipeline ASCII grid -- and also
catch UnicodeDecodeError in the text-sniff fallback so any other binary input
returns False instead of raising. True/False detection of real hyperpipeline
files (magic line or lnL/sigma_lnL header) is unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…) + CIP single image

Fixes the CIT-LOCAL hold wave (CITLOCAL_BREADCRUMB_gpus_capability_undefined_holds.md):
~45% of CIT GPU slots satisfy the per-GPU require_gpus floor (per-GPU `Capability`
inside AvailableGPUs) yet do NOT advertise the machine-level rollup attr
`GPUs_Capability` that the family `$$()`/`ifThenElse` selection reads. On those
slots the selection "cannot expand" and the job HOLDS (presents as stuck /
MachineAttrMachine0=undefined). Measured 621 undefined / 741 defined, spanning
node*/aframe/mly (not one bad host).

Correct fix = do NOT match undefined-capability slots (don't guess their image):

- container_manifest.build_capability_defined_requirement(manifest) ->
  "TARGET.<attr> =!= undefined" (generic on capability_attr; no-op where every
  GPU slot advertises it). GPU family jobs (ILE, CALPILOT) append it to
  Requirements. The defined set still includes the cc12.0 Blackwell nodes, so the
  family's purpose (Blackwell vs older) is preserved.

- REVERT the undefined-safe `=?= undefined -> fallback` guard added in the prior
  PR (now on dev). It is UNSAFE for GPU jobs: an undefined-capability slot could be
  a Blackwell that hard-fails on the cuda-11.8 fallback -- the exact failure the
  family exists to avoid. We must not match it, not guess an image. _build_selector
  / build_singularity_image_expr / build_transfer_input_expr / build_container_image_select
  are back to a bare selector (fail-loud: an unexcluded undefined slot HOLDS rather
  than silently running the wrong image).

- CIP (CPU, no GPU) holds the same way -- there is no GPU capability at all. CIP
  needs no GPU/arch-specific image, so it now uses a SINGLE fixed container = the
  manifest fallback on BOTH paths: legacy MY.SingularityImage = "./<fallback>"
  (QUOTED; a bare path is a ClassAd parse error) + transfer just that image;
  container universe container_image = the fallback URL. New helper
  build_fallback_single_image(manifest) -> (runtime_path, transfer_url).

NOTE: the osdf scitokens credential for manifest images (3e18793) is already on
dev. The getenv True->* default (dag_utils_generic vs dag_utils) is a separate,
related item the breadcrumb flags -- not addressed here.

Tests: capability-defined requirement (+ attr override); fallback single image
(cvmfs in place vs osdf transferred); selectors are NOT undefined-guarded; ILE
(legacy + container universe) emit the Requirements exclusion; CIP legacy emits a
single quoted fallback with no exclusion. 22/22 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…TENV=false)

dag_utils_generic.py defaulted default_getenv_value / default_getenv_osg_value to
'True', emitting `getenv = True`, which schedds with SUBMIT_ALLOW_GETENV=false
(e.g. CIT) reject -> the DAG aborts.  The newer dag_utils.py already defaults '*'
(all-env, the modern form); bring generic in line (value-only change, file's own
formatting preserved to minimize a later oshaughn/rift_O4d->rift merge conflict).
Still overridable via RIFT_GETENV / RIFT_GETENV_OSG.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cap_undefined

container family: exclude undefined-capability GPU slots (CIT-local holds) + CIP single image
…slice n_eff

The .dslice export split K into an importance-reweight "core" (reweight the main
sampler's stored Omega RVs at posterior-d quantiles) + fresh fixed-d "wings". At
low main-loop n_eff (the right regime for d-slices: the honesty comes from the
per-slice integration, not a big main loop) the reweight core is STARVED -- it
carries the same MC noise as a fair-draw histogram, reintroducing exactly the
resolution problem the slices are meant to remove.

Add --distance-slice-all-fresh (ILE): emit ALL K slices as FRESH fixed-d Omega
integrations, no core. Placement = posterior-d quantiles (rough placement is fine
even at low n_eff; precision comes from each fresh integral). The old core/wing
split remains the default; the flag overrides it. Threaded through
create_event_parameter_pipeline_BasicIteration
(--last-iteration-export-distance-slices-all-fresh) and util_RIFT_pseudo_pipe.py
(--export-distance-slices-all-fresh).

Also expose the per-slice precision knobs end to end (they were ILE-only):
--export-distance-slices-wing-neff / -wing-nmax -> ... -> ILE
--distance-slice-wing-neff / --distance-slice-wing-nmax. With all-fresh these set
the n_eff/n_max of EVERY slice, i.e. the precision of each L(d) row.

ILE: guard the reweight core + its GMM-low-neff warning behind n_core>0 (so 0
core is real, not clamped to >=1), empty-core arrays flow through the existing
concat unchanged (method column -> all FRESH). Validated locally on the CI
zero-noise data: --export-distance-slices 6 --distance-slice-all-fresh emits a
6-row .dslice, all method=1, one honest fixed-d integral each.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants