Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139
Open
oshaughn wants to merge 247 commits into
Open
Rift o4d junior calmarg in loop: AFTER main 'distance' merge#139oshaughn wants to merge 247 commits into
oshaughn wants to merge 247 commits into
Conversation
Third pass of the hyperpipeline-format work. The first two commits standardised the ILE -> CIP shard chain (commit #1) and the CIP -> ILE grid handoff inside create_event_parameter_pipeline_BasicIteration plus the puff / fetch / dag_utils plumbing (commit #2). This commit makes util_RIFT_pseudo_pipe.py -- the standard wrapper that builds args files and then invokes BasicIteration -- respect the same env-var flag, so end-users can flip the entire wrapper-driven workflow over to .dat format with one environment variable. Also includes the test/test_hyperpipeline_io.py file, which the prior two commits referenced ("12 tests"/"17 tests") but did not actually include in the staged file set. Design constraint ----------------- Per the project policy of "operate cohesively in one mode or the other -- no internal conversion": pseudo_pipe is a thin suffix-substituting wrapper. It does NOT convert XML to .dat (or vice versa) for any input. Upstream inputs (manual seed grids, template-bank-derived grids, etc.) must be staged in the format matching the active mode; pseudo_pipe refuses with a clear message when an XML-only auto-generation path would otherwise produce a file the rest of the workflow can't consume. Files ----- * bin/util_RIFT_pseudo_pipe.py Five surgical patches, all gated on _use_hpip_pp derived from the same RIFT_HYPERPIPELINE_FORMAT env var commits #1/#2 use: - Three new variables (_use_hpip_pp, grid_suffix_pp, sim_grid_flag_pp) defined once near the top, immediately after the RIFT_LOWLATENCY block. Mirrors the BasicIteration placement so the two scripts have parallel structure. - target_params writer (~line 639): in hyperpipeline mode, writes target_params.dat via hyperpipeline_io.write_grid_from_P_list with a column set auto-derived from whether P.eccentricity / P.meanPerAno are nonzero. Otherwise legacy ChooseWaveformParams_array_to_xml emits target_params.xml.gz. No behavioural change in legacy mode. - command-single --sim-xml line (~813): swapped to "{sim_grid_flag_pp} target_params.{grid_suffix_pp}", so the sanity-check ILE invocation routes through ILE's --sim-grid path in hyperpipeline mode. This is the path the --sim-grid reader patch from commit #2 was designed for. - --manual-initial-grid copy site (~line 1399): copies to proposed-grid.{grid_suffix_pp} regardless of mode. shutil.copyfile is format-agnostic; the source file's format must match the active mode (per the design constraint above). The --manual-initial-grid-supplements branch (which uses ligolw_add, XML-only) raises SystemExit in hyperpipeline mode with a clear message pointing the user at pre-merging supplements upstream. - --input-grid argument to create_event_parameter_pipeline_BasicIteration (~line 1418): now passes proposed-grid.{grid_suffix_pp}, threading the suffix through to BasicIteration so the two scripts agree on the seed-grid filename. - AMR / template-bank seed-grid auto-generation guard (~line 1506): in hyperpipeline mode without --manual-initial-grid, raises SystemExit with a message asking the user to stage the initial grid as .dat and pass via --manual-initial-grid. The XML-emitting util_AMRGrid.py and util_GridSubsetOfTemplateBank.py are intentionally untouched -- per the design constraint, no internal conversion. - --manual-initial-grid argparse help text updated to advertise both suffixes and note that the source format must match the active mode. * test/test_hyperpipeline_io.py Recovered from the prior two commits, which referenced this file in their commit messages ("12 tests" in commit #1, extended to "17 tests" in commit #2) but did not include it in the staged file set. The file is otherwise byte-identical to the version exercised end-to-end during the prior commits' development. 17 tests: - default_roundtrip - eccentricity_columns - tides_with_eos_index - to_legacy_dat_default - legacy_column_indices_consistency - sniff_distinguishes_legacy - sniff_recognizes_new_format - env_flag - concatenated_shards - read_many_skips_empties_and_mismatches - consolidate_weighted_average - consolidate_drops_high_sigma - grid_write_read_roundtrip_with_units - grid_distance_unit_conversion - grid_auto_suffix_append - column_alias_bridge - grid_no_lal_module_passthrough The file uses an importlib direct-load shim so it runs in stripped- down environments (no lalsuite / scipy required), making it usable for CI on minimal containers. Audit ----- A full pass over util_RIFT_pseudo_pipe.py confirmed every remaining xml.gz / --sim-xml string is one of: * a comment / variable definition / argparse help mentioning the pair of supported suffixes (lines 34, 35, 47, 48, 321, ...); * a defaults string for an external file (PSD, coinc, ini) that is legitimately external and stays XML; * inside a code path I gated to refuse-and-exit in hyperpipeline mode (the AMR seed-grid block at ~line 1556 -- unreachable when _use_hpip_pp is True). No live XML I/O paths reachable in hyperpipeline mode remain. Tests ----- All 17 tests in test/test_hyperpipeline_io.py pass. Every patched file in this commit and its dependents (commits #1, #2) compile via py_compile. Followups --------- This commit covers util_RIFT_pseudo_pipe.py only. Sibling drivers that still need the same treatment: * bin/cepp_basic_htcondor (htcondor-only twin of BasicIteration) * bin/util_RIFT_pseudo_pipe_lowlatency.py * bin/util_RIFT_hyperpipe.py To round out the seed-grid auto-generation paths so hyperpipeline mode no longer requires --manual-initial-grid, the underlying generators need parallel hyperpipeline output support: * bin/util_AMRGrid.py * bin/util_GridSubsetOfTemplateBank.py * bin/helper_LDG_Events.py The EXTR_out -> LI posterior_samples convert path (called by batchConvertExtr_job and friends in BasicIteration) is the last large XML-resident consumer in the intrinsic-pipeline domain; addressing that closes out the workstream.
…kelihood Move calibration marginalization from postprocessing (calibration_reweighting.py) into the ILE inner loop. Calibration is applied to the data (d -> C(f)d), so the template-template U,V cross terms (rho_sq) stay calibration-independent and are computed once; only the data term kappa changes per realization. DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop gains an n_cal argument. With n_cal==1 the path is byte-identical to before. With n_cal>1 it caches the per-detector Q-product inputs, recomputes kappa per realization via the existing Q_inner_product kernel using a block-offset window (ifirst + c*N_window), and reduces with a streaming log-sum-exp over realizations (memory-neutral, reuses the validated kernel). The driver threads --calibration-n-realizations into the three production call sites. Also fixes a bug in ComputeModeIPTimeSeries: the calibration branch took the inner product against the original data instead of the calibration-modified data_now, so the calibration factor was never applied. Validated CPU and GPU paths against a brute-force per-realization reference to machine precision (RIFT/calmarg/test_calmarg_reduction.py). Design rationale, the apply-to-data vs apply-to-template convention, and remaining work (Option C fused kernel, seeding, param export) in RIFT/calmarg/DESIGN_calmarg_in_loop.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sics scaffold) RIFT/calmarg/backtest_calmarg.py compares calmarg likelihood implementations on controlled synthetic inputs that exercise the per-realization block structure. METHODS registry: reference (brute-force per-block + logsumexp), in_loop_B (the n_cal>1 call), in_loop_C (stub raising NotImplementedError until the fused kernel lands). Reports max|lnL - reference| and best-of-N timing on CPU or GPU; wire the fused kernel into method_in_loop_C and it validates automatically. in_loop_B reproduces reference to ~1e-15 on CPU and GPU, with and without phase marginalization; ~3-4x faster than the brute-force reference on GPU. run_physics_backtest() scaffolds the heavier real-data comparison vs bilby calibration_reweighting.py (needs frames/PSDs/data_dump; runs on the stable host). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…default helper) Adds a single fused kernel (cuda_Q_fused_calmarg.cu + Q_fused_calmarg.py) that, per extrinsic sample, loops realizations x time x detectors x modes, forms kappa, applies the default factored-likelihood helper, and does a streaming Simpson-weighted log-sum-exp over (c,t) on-board -- returning lnL[j] in one launch, with no (batch, n_cal, npts) intermediate and no per-realization Python launches. Selected via cal_method='fused' in DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop (default remains 'loop' = Option B). Time integration matches Option B exactly by passing the composite-Simpson weight vector w_t = simps(I, dx=deltaT) into the kernel. rho_sq is calibration-independent and passed pre-summed over detectors. Validated in backtest_calmarg.py (method in_loop_C) vs the brute-force reference and Option B to ~1e-15 on GPU. ~8-9x faster than Option B and ~25-32x faster than brute force (e.g. n_cal=200 x 8192 samples: 279 ms vs 2422 ms vs 7080 ms on sm_30). Scope (raises NotImplementedError otherwise): GPU only; phase_marginalization=False; default distance-unmarginalized helper only. Stage 2 = port the distmarg loglikelihood (table interp) into the kernel for the dominant distance-marginalized path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
make_synthetic_case now builds per-detector rholms/U/V (dets=("H1","L1") default,
--dets CLI), so the fused kernel's detector loop and the function's per-detector
ifirst stacking are genuinely exercised (each detector gets a distinct ifirst from
its real location). Validated Option C vs reference and Option B to ~4e-15 with
H1,L1,V1; Option C ~9x faster than Option B (50 ms vs 445 ms, n_cal=100 x 1024 x 3 det).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… loop fallback) Adds a SEPARATE fused kernel (cuda_Q_fused_calmarg_distmarg.cu + wrapper Q_fused_calmarg_distmarg_cupy) that reproduces the distance-marginalization loglikelihood on-board: x0=kappa/rho_sq, the asinh-based s/t transforms, the EvenBivariateLinearInterpolator bilinear table lookup with its in-bounds mask, and exponent_max -- then the streaming Simpson-weighted cal+time log-sum-exp. Kept separate from the default-helper kernel on purpose: smaller review surface per kernel, the simpler kernel stays as a baseline, and cal_method='loop' (Option B) remains a full fallback for distmarg on CPU and GPU. Selected via cal_method='fused' plus a cal_distmarg table dict (cal_distmarg=None -> default-helper kernel). Harness gains --loglikelihood distmarg: builds a self-consistent table + a mirror Python closure (reference/Option B) so the fused kernel is validated against the same transform. Agreement ~1e-14 vs brute-force reference, single- and multi-detector; ~6-7x faster than Option B (e.g. n_cal=200 x 2048 x 3det distmarg: 333 ms vs 2358 ms). Default-helper path unchanged and still matches to ~3e-15. Scope (raises NotImplementedError otherwise): GPU only, phase_marginalization=False. Remaining: phase-marg support; wire driver distmarg sites to a cal_distmarg dict behind an opt-in flag (Option B stays default). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ernel (opt-in) Adds opt-in --calibration-fused-kernel (off by default). When set, on GPU, with calibration marginalization active, the driver packages the distance-marginalization lookup_table (s_array, t_array, lnI_array, bmax, bref) + xmin/xmax into a cal_distmarg dict and passes cal_method='fused' at the non-phase-marg distmarg call site (Option C). The phase-marg distmarg site and everything else stay on cal_method='loop' (Option B), which remains the default and the fallback for all cases. On CPU the flag is ignored with a warning (the fused kernels are GPU-only). Driver py_compile clean. Not yet exercised end-to-end on a real run; kernel/reduction correctness is covered by RIFT/calmarg/backtest_calmarg.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…marg table End-to-end driver test surfaced a real bug: in the distance-marginalization path P.dist is fixed at the fiducial, so invDistMpc is a SCALAR, but the fused kernel wrappers require one value per extrinsic sample (assert invDist.shape==(n_ext,)). The loop path tolerated the scalar via broadcasting; the fused path raised AssertionError (swallowed by the driver's generic handler). Fix: broadcast invDistMpc to (npts_extrinsic,) in the fused branch (works for scalar fiducial distance and for sampled-distance vectors alike). Also add backtest_calmarg --real-table to validate the fused distmarg kernel against a production util_InitMargTable .npz (real s/t ranges). Result: fused == reference == loop to ~2e-14 on the real table (default helper and synthetic distmarg unchanged, ~1e-14). Note: full-sampler end-to-end numerical comparison on the local 2GB NVS 510 is unreliable (OOM / nan under load); use a larger GPU for that. Wiring is confirmed: the flag builds the cal_distmarg dict, reaches the fused distmarg kernel, and runs to completion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rginalization) Pipeline access: util_RIFT_pseudo_pipe.py gains --calmarg-envelope-directory, --calmarg-n-realizations, --calmarg-spline-count, --calmarg-fused-kernel, which append the corresponding ILE flags (--calibration-envelope-directory / -n-realizations / -spline-count / --calibration-fused-kernel) to args_ile.txt. Setting the envelope directory enables in-loop calmarg on the distance-marginalization code path; the fused kernel additionally needs GPU and otherwise falls back to the loop method. Demo: demo/rift/calmarg exercises baseline vs loop (Option B) vs fused (Option C) on the zero-spin synthetic CI data in 3 detectors (H1,L1,V1) by running ILE directly (no condor). Makefile targets: inputs, verify-exact (deterministic loop==fused==reference to ~1e-14 on the demo's real distmarg table), run-baseline/run-loop/run-fused, compare. Includes tools/make_cal_envelopes.py and tools/compare_lnL.py, and a README explaining the physics and that full-sampler runs agree only within Monte-Carlo noise (the GPU integrator is not bit-reproducible even with --seed) -- verify-exact is the rigorous equivalence check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o AV sampler Both fused kernels now skip any (c,t) sample whose window offset (ifirst + c*N_window + t) falls outside [0, npts_full) -- pathological/NaN extrinsic draws from the sampler can no longer cause CUDA_ERROR_ILLEGAL_ADDRESS; such samples simply do not contribute. Verified no change to the validated numerics (backtest still ~1e-15, default + distmarg). demo/rift/calmarg now uses the adaptive-volume sampler (SAMPLER=AV) by default instead of GMM. AV (mcsamplerAdaptiveVolume) is the mature/stable GPU code path and sets sampler.xpy=cupy under --gpu, so it works with the fused kernel; GMM (mcsamplerEnsemble) is newer and heavier on the GPU and was the likely source of the illegal-address crash in the full sampler run. SAMPLER is overridable (e.g. SAMPLER=GMM). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…sets Isolated on a CIT GPU (RTX 2080 Ti): baseline and the fused calmarg path run clean, but the loop calmarg path (Option B) hits CUDA_ERROR_ILLEGAL_ADDRESS at scale. Cause: for some sky positions the integration window extends one sample past the precomputed rholm buffer (ifirst+t >= N_window). In baseline (buffer length N_window) the tiny over-read lands in mapped pool memory and is silently tolerated; with calibration marginalization the buffer is n_cal blocks long, so the over-read in the LAST block is past the whole allocation and faults. Fix: the shared Q_inner kernel now skips time offsets with (i_first_time+i_time) >= num_time_points, contributing zero instead of reading out of bounds (a negative int index wraps to a large size_t and is caught too). This makes the loop path robust and also removes the latent silent over-read for ALL GPU runs (incl. non-calmarg). Valid indices are unaffected; backtest numerics unchanged (~1e-15, default + distmarg). NOTE: shared kernel used by all GPU ILE runs (slightly broader scope than the rest of this branch). The underlying window-sizing edge (window can exceed the storage buffer for extreme sky positions) is pre-existing; the guard makes it safe rather than masking a calmarg-specific bug. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… / OOB) A forced-overflow backtest (small N_window) exposed a correctness bug: when the integration window over-runs a calibration block (ifirst+t >= N_window), the previous guard (index < npts_full) only caught the LAST block (the crash). For earlier blocks the read silently bled into block c+1 -- wrong values, not a fault -- in BOTH the loop and fused paths (they agreed with each other but disagreed with the per-block n_cal==1 reference). Fix: guard the WITHIN-block offset against [0, N_window) in both fused kernels, and in the loop path slice Q to the current block and pass the within-block offset (so the shared Q_inner_product kernel's guard fires at the block boundary). The CPU loop branch likewise zeros out-of-range rows. An over-running window now contributes zero from that detector at that (c,t), matching the n_cal==1 reference exactly. Validated: with forced overflow (N_window=140, 3 IFOs) loop == fused == reference to ~1e-14 (default and distmarg); the no-overflow case and the CPU regression test are unchanged (~1e-15). Note: this is independent of the loud-source loop-vs-fused gap seen in the full sampler run (that is Monte-Carlo scatter -- both paths share identical behavior here -- pending the reproducibility check). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… sampling) Generalize the cal marginalization from the (1/n_cal) average to a weighted sum Z_cal = sum_c exp(log_w[c]) * Z_c / sum_c exp(log_w[c]), where log_w[c] are per-realization importance log-weights (w_c = prior/proposal). This is the enabling hook for adaptive / importance cal sampling at high SNR, where prior draws become inefficient as the cal posterior departs from the prior. New optional cal_log_weights (length n_cal) threads through DiscreteFactoredLogLikelihoodViaArrayVectorNoLoop into the loop reduction and both fused kernels (which now take log_w[] + log_w_norm=logsumexp(log_w); lnL_t += log_w[c], final term -log_w_norm). Default None = uniform = the plain (1/n_cal) average, so all current behavior and verify-exact are byte-identical. Validated (backtest_calmarg --random-cal-weights): loop == fused == reference to ~1e-14 with non-uniform weights, for default helper and distmarg, on CPU and GPU, with and without window overflow; uniform path unchanged (~1e-15). The learning loop that produces non-uniform weights (fit a cal proposal from per-realization responsibilities, redraw, iterate) is the planned follow-on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…d low-SNR variant compare_lnL.py was reading the LAST .dat column, which is neff (effective sample count), not the marginalized lnL. The ILE row ends with [... lnL, sqrt_var, ntotal, neff], so lnL is column [-4]. This explains the spurious "loud source / large loop-vs-fused gap" seen earlier: those numbers were neff (~1800-1900 of NMAX=20000), which legitimately scatter run-to-run and are meaningless to compare. The bundled injection is actually network SNR ~17.5 (verified with util_FrameZeroNoiseSNR), so the true marginalized lnL is ~150, not ~1880. compare_lnL.py now reports lnL +- sampling error and neff separately. Add a quiet-source variant: make lowsnr-inputs generates a fainter copy of the same injection (m1=35,m2=30 at larger distance, ~SNR 9) on the fly -- no committed binaries, same path as the CI data (util_WriteInjectionFile.py + util_WriteFrameAndCacheFromXML.sh) -- and make low-snr runs the full comparison on it. INJ_DIST tunes the loudness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The demo was analyzing overlap-grid.xml.gz event 0 (m1=m2=26.4) against an injection at
m1=35,m2=30 -- a far-off template, giving lnL ~ -2.7 ("no signal") and hiding the
calibration-marginalization effect. Point --sim-xml at the injection itself
(mdc.xml.gz, a single matched point) so the signal is present: lnL jumps to ~78 even
heavily undersampled on a 2GB card (-> ~rho^2/2 ~ 150 when converged). The low-snr
target uses the matched mdc_lowsnr.xml.gz. SIM_XML is overridable.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e-point runs The single-point matched-template full-sampler run has a narrow extrinsic peak that RIFT's adaptive sampler can fail to lock robustly (sensitive to NCHUNK/SNR/GPU env): neff~1 (one draw dominates) or neff large but lnL~0 (spread off-peak, missed signal). Sanity-check sqrt(2*lnLmax) ~ injected SNR. Use low-snr + modest NCHUNK for a robust full run; verify-exact (deterministic ~1e-14) is the rigorous loop-vs-fused check that does not depend on sampler convergence. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cal_method='fused' now works with xpy=np: cupy is imported lazily in Q_fused_calmarg.py (module imports fine on a machine without CUDA), and pure-numpy implementations Q_fused_calmarg_numpy / _distmarg_lnL_numpy mirror the CUDA kernels exactly (within-block guard, importance weights, Simpson weights, distmarg table transform). factored_likelihood routes the fused branch to numpy on CPU and the cupy kernels on GPU. This lets the fused path run on a laptop and gives an INDEPENDENT cross-check of the kernel math. Validated (backtest --backend cpu): fused-numpy == loop == reference to ~1e-15 for default helper and distmarg, with non-uniform importance weights and with window overflow; GPU fused-cupy unchanged (~1e-15); module imports with cupy blocked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Driver:
- use_fused_calmarg flag; wire the NON-distance-marginalization ILE call site to the
fused default-helper kernel when --calibration-fused-kernel is set (was only wired
at the distmarg site).
- drop the GPU gate on the fused path -- it now has a numpy backend, so it works on
CPU too (the loop method is still the default/fallback).
- fix a pre-existing bug exposed by running the non-distmarg GPU path with the AV
sampler: that likelihood_function used the passthrough xpy_asarray_already, but AV
hands back numpy arrays, so cupy ufuncs raised "Unsupported type numpy.ndarray".
Use xpy_default.asarray (a no-op for on-device arrays), matching the distmarg path.
Demo: BACKEND={gpu,cpu} and DMARG={1,0} toggles. verify-exact honours both. The full
matrix passes (loop == fused == reference ~1e-14): GPU/CPU x distmarg/default; and the
non-distmarg end-to-end run now works (loop ~ fused within sampling error). Distmarg
path unchanged (regression check passes).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…gnal lost) CRITICAL bug: in ComputeModeIPTimeSeries the concatenated n_cal-block rholm series was created with epoch=data.epoch, but the per-block series (after the roll/cut) has epoch data.epoch - hlms.epoch (rolled). Only the per-block *data* was copied in, never the epoch, so the concatenated series carried the wrong time reference. Downstream this put ifirst at ~within_block + (n_cal-1)*N_window -- i.e. in the LAST block instead of within block 0 -- so the integration window read past/into the wrong block, the within-block guard zeroed it, and the calmarg likelihood returned NaN/collapsed (lnL ~ -3) while baseline was fine (lnL ~ 115). This affected BOTH loop and fused, on CPU and GPU, and reproduced everywhere; verify-exact missed it because it feeds synthetic rholms with a manually-set epoch. Fix: set rholms_so_far.epoch = rholms_here.epoch (block 0's actual rolled/cut epoch), matching the non-calibration branch. Verified end-to-end: ifirst now lands in [0, N_window-npts] and baseline ~ loop ~ fused (lnL ~ 115-123 at the matched template, within the undersampled MC scatter) instead of collapsing. All synthetic regressions (verify-exact gpu/cpu x distmarg/default, CPU reduction test) still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…poch bug) test_precompute_alignment.py exercises the REAL PrecomputeLikelihoodTerms / ComputeModeIPTimeSeries path (3 IFOs, identity calibration) and asserts that the calibration-marginalized rholm series matches the non-calibration series block-by-block in BOTH data and epoch. The epoch assertion fails on the alignment bug just fixed (|delta epoch| ~ (n_cal-1)*N_window*deltaT ~ 0.5 s), which verify-exact / test_calmarg_reduction cannot catch because they feed synthetic rholms with a hand-set epoch. CPU-only, no GPU needed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
return_lnLt with n_cal>1 now returns the calibration-marginalized lnL at each time bin lnL_marg(t) = log( sum_c exp(log_w[c]) exp(lnL_t,c(t)) ) - logsumexp(log_w) (the weighted average of the per-realization likelihood series), instead of raising. It is produced by the loop reduction (the fused scalar kernel is bypassed when a timeseries is requested). Verified: integrating exp(lnL_marg(t)) over time reproduces the time-integrated scalar lnL to ~1e-15. Driver: resample_samples() threads n_cal through, so the time-resampling export uses the cal-marginalized timeseries and all downstream time-sampling paths work unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…, matrix toggle) Both fused kernels (default-helper and distmarg) and the numpy backend now support analytic phase marginalization: use |kappa| instead of Re(kappa) (the (2,-2)-mode conjugation is already baked into Q/A by the caller, exactly as in the loop path). factored_likelihood passes phase_marginalization through; the NotImplementedError is gone. Driver: the phase-marg distmarg ILE call site now uses the fused distmarg kernel when --calibration-fused-kernel is set. Validated against reference + loop to ~1e-14 across the full matrix: gpu/cpu x default/distmarg x phase 0/1 (8 cells PASS), plus precompute-alignment and CPU-reduction regressions. Demo gains a PHASE=0|1 toggle for verify-exact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… node space) RIFT/calmarg/adaptive.py learns a unimodal Gaussian proposal over the calibration spline-node parameters so in-loop calmarg stays efficient at high SNR (where prior draws collapse to ~1 effective cal sample). Uses importance weighting w_c = prior/proposal (so the marginalized result is unbiased) and a TEMPERED proposal fit -- weights softmax(beta*log_resp), beta ramped 0.3->1.0, covariance inflated while tempering is on -- to avoid a single sample dominating given the very large lnL dynamic range, then sharpening as it learns. Pieces: envelope_node_prior, nodes_to_cal_factors (spline, matches generate_realizations), fit_proposal (tempered weighted Gaussian), neff_from_logweights, adaptive_cal (the loop, taking an `evaluate(nodes)->log integral L` callback and returning the final nodes + importance log-weights + neff history). The fit targets the cal posterior prior*L/proposal; neff of those -> n_real when the proposal matches. Self-contained convergence demo (python -m RIFT.calmarg.adaptive, no GPU/lal): a 2-sigma-off, narrow (high-SNR) cal -- prior-only neff ~1/300; adaptive recovers neff -> ~246 and the proposal converges onto the cal posterior to ~0.04 sigma. Driver integration (outer pilot/refine pass that calls ILE per realization to get log integral L) is the next step. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
backtest_calmarg.py --scan-ncal: per-likelihood-evaluation wall-time vs n_cal for reference/loop/fused, to quantify the cost of cal marginalization for planning. Data (GPU, 3 IFO, distmarg, 4096 extrinsic): marginal cost per extra realization ~57 ms (brute), ~23 ms (loop), ~3.3 ms (fused); at n_cal=200 reference is ~11 s/eval (hours for a full integration -> reference only), fused ~0.7 s/eval (production-feasible). DESIGN_adaptive_driver.md: planning doc for learning the cal proposal in the driver. Weighs (A) brute-force reference, (B) portable extrinsic+cal distribution / normalizing flow breadcrumbs, (C) lazy pilot. Recommends: production path must be fused; learn cal ONCE from a cheap pilot of high-likelihood points (cal is boring / extrinsic-independent) + Phase 0 importance weights; brute force is the validation reference; define a save/load breadcrumb interface (Gaussian now, NF later). No multi-stage loop. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…GETENV) The fan-out .sub used 'getenv = True', which CIT (and pools with SUBMIT_ALLOW_GETENV=false) reject -> submit fails. Set the environment explicitly instead (HOME + PYTHONPATH + JAX/thread caps); the absolute conda-python executable + PYTHONPATH are all the job needs. request_cpus=2 + OMP/OPENBLAS/MKL=2 + xla_cpu_multi_thread_eigen=false mirror the ulimit -u thread-spawn fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge JAX-GP export-at-scale into calmarg
…kelihood Merge JAX ILE likelihood prototype
Counterpart to RIFT_BOOLEAN_LIST (which adds (<x> =?= TRUE) requirements to the REMOTE worker jobs ILE/CIP, gated on use_osg). The LOCAL/flock_local non-worker jobs (convert, test, consolidate, puff, join, unify, psd, resample, ...) run with absolute /home paths and NO file transfer, so on a pool whose execute points may lack /home (e.g. EPNFS=undefined) they fail "Failed to open .../*.out: No such file or directory". New helper _nonworker_extra_requirements() reads $RIFT_REQUIRE_NONWORKER (comma-separated ClassAd attrs) and returns ['<attr> =?= TRUE', ...]; appended to the requirements list before each non-worker requirements emission (workers CIP/ILE excluded -> they keep RIFT_BOOLEAN_LIST). Read at DAG-build time, so it is durable through `asimov manage submit` / the asimov daemon. Usage: export RIFT_REQUIRE_NONWORKER='EPNFS' -> pins local jobs to NFS-/home nodes.
…oshaughnessy-junior/research-projects-RIT into rift_O4d_junior_calmarg_in_loop
…opt-in) ALTERNATIVE to the runtime-wrapper approach (branch rift_O4d_osg_runtime_container_select): use HTCondor's container universe with container_image = $$([...]) instead of MY.SingularityImage = ifThenElse(...). Why it works on OSG: MY.SingularityImage=ifThenElse(...) is an execute-side ClassAd expression that OSPool glidein pilots read as a LITERAL string and hold the job on. container_image with a $$() token is resolved by HTCondor via match-time machine-ad substitution (in the schedd, against the matched machine ad) BEFORE the job reaches the EP, so the pilot only ever sees a literal image URL. $$ in container_image is HTCondor's *documented* mechanism for selecting a container image by GPU CUDA capability, and container universe is the current OSPool-standard (it deprecated +SingularityImage); osdf:// container images are supported and OSDF-cached; GPU access is automatic under request_gpus (no --nv needed). The same path also works on the CIT-local pool, so this unifies both pools (vs the ifThenElse path which is CIT-local-only). - container_manifest.build_container_image_select(manifest): returns the $$([ ifThenElse(attr =?= undefined, <fallback img>, <ifThenElse selector>) ]) value. Image branches are the manifest images VERBATIM (osdf URL fetched by container universe, or cvmfs/local path in place) -- not a ./basename rewrite. The =?= undefined guard makes a CPU-only / non-advertising slot fall to the fallback image instead of an undefined $$() that would hold the job. - write_ILE_sub_simple: when RIFT_CONTAINER_UNIVERSE is set (and a family manifest + use_singularity), set universe=container, emit container_image = the $$() selector, and drop MY.SingularityImage / MY.SingularityBindCVMFS / the $$() transfer token (container universe transfers the image itself). The require_gpus floor is still applied. Default (env unset) behavior is unchanged: the existing ifThenElse MY.SingularityImage path for CIT-local runs. Tests: container_image select expression (undefined-safe, verbatim osdf URLs, fallback) and integration (universe=container, container_image=$$([...]), no MY.SingularityImage / no transfer token, floor present). Existing CIT-local and single-sif tests unchanged. Trade-off vs the wrapper branch: this is much smaller and uses native/documented HTCondor machinery, but relies on the matched slot advertising the capability attribute at match time; the wrapper detects the real GPU at job start instead. ILE-only for now (CIP/PSD/calibration still use the ifThenElse path). Open item to confirm on a real OSG GPU job: cvmfs bind + capability advertisement coverage across OSPool sites. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_universe container family: OSG-safe per-machine image via container universe (opt-in)
…ainer universe)
write_calpilot_sub still handed the raw SINGULARITY_RIFT_IMAGE value to
MY.SingularityImage, so a .yaml/.yml family MANIFEST reached condor as the image
path and the job failed (a manifest is not a .sif). The container-universe work
fixed write_ILE_sub_simple but never touched the CALPILOT writer, even though the
CALPILOT job runs ILE internally (GPU) and needs the same per-machine selection.
Mirror write_ILE_sub_simple exactly:
* detect a container manifest (is_container_manifest) and expand it;
* legacy (default): universe=vanilla, MY.SingularityImage = ifThenElse(...),
plus the selective $$() osdf transfer token and a require_gpus floor;
* container universe (opt-in RIFT_CONTAINER_UNIVERSE): universe=container,
container_image = $$([...]) (match-time, OSG-safe), no MY.SingularityImage /
SingularityBindCVMFS, image delivered via container_image (no transfer token).
A plain .sif / osdf:// value keeps the legacy single-image behavior unchanged.
Validated offline (pilot DAG build, OSG=1, family manifest) in both modes: the
generated CALPILOT.sub container_image is byte-identical to ILE.sub, and the
require_gpus floor is applied. test_container_manifest.py: 15/15 pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_universe_calpilot container family: extend CALPILOT to per-machine image (legacy + cont…
…oshaughnessy-junior/research-projects-RIT into rift_O4d_junior_calmarg_in_loop
When SINGULARITY_RIFT_IMAGE is a container-family MANIFEST (.yaml/.yml), the osdf:// image URLs live INSIDE the manifest, so the existing `'osdf:' in singularity_image` auto-detect (which force-sets use_oauth_files='scitokens' for single-image osdf runs) misses it. Result: no `use_oauth_services = scitokens` in the subs -> the execute point has no credential to fetch the selected container -> every ILE/CIP/CALPILOT job is held with "credential is required for osdf://...sif but was not discovered". Add a manifest-aware branch: if singularity_image is a container manifest, inspect its image URLs and pick the same credential the single-image path would (igwn+osdf -> 'igwn', osdf -> 'scitokens'). Pipeline-writer only (bin/), no container rebuild. Validated offline: a family-manifest pilot build now emits `use_oauth_services = scitokens` on ILE/ILE_extr/ILE_puff/CALPILOT/CIP/ CIP_0/CIP_worker0, matching the old working single-image subs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When SINGULARITY_RIFT_IMAGE is a container-family MANIFEST (.yaml/.yml), the osdf:// image URLs live INSIDE the manifest, so the existing `'osdf:' in singularity_image` auto-detect (which force-sets use_oauth_files='scitokens' for single-image osdf runs) misses it. Result: no `use_oauth_services = scitokens` in the subs -> the execute point has no credential to fetch the selected container -> every ILE/CIP/CALPILOT job is held with "credential is required for osdf://...sif but was not discovered". Add a manifest-aware branch: if singularity_image is a container manifest, inspect its image URLs and pick the same credential the single-image path would (igwn+osdf -> 'igwn', osdf -> 'scitokens'). Pipeline-writer only (bin/), no container rebuild. Validated offline: a family-manifest pilot build now emits `use_oauth_services = scitokens` on ILE/ILE_extr/ILE_puff/CALPILOT/CIP/ CIP_0/CIP_worker0, matching the old working single-image subs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…the GPU family
Two holds surfaced running the family-container pp-run/pp-run-pilot on CIT:
* request_disk=4G held jobs that landed on Blackwell ("no space left on device"
mid osdf transfer): the cc90-120 CUDA-12.8 image is 6.35 GB. Bump PP_DISK
default 4G->16G (OSG branch) to cover the largest image + unpack headroom.
* request_memory=4096 (pseudo_pipe default; the demo never overrode it) held ILE
on "memory usage exceeded request_memory": the FUSED calmarg precompute holds N
cal realizations and the adaptive draw count doubles (NCAL 100->800), spiking
RSS past 4 GB. Add PP_MEM_ILE (default 8192, the historical standard, ~2.4x the
observed ~3.4 GB peak) -> --internal-ile-request-memory; flows to ILE, ILE_puff,
ILE_extr (auto 2x=16384) and CALPILOT (request_memory_ILE).
Both overridable per-run (PP_DISK=, PP_MEM_ILE=). Demo Makefile only.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…CIP fix) A CPU-only job (CIP) requests no GPU, so it matches a slot that advertises NO GPU capability attribute. The per-machine container_image = $$([ ... capability ... ]) then has nothing to resolve against: the $$() substitution fails to expand and HTCondor HOLDS the job -> all CIPs lock up. Fix: when a job requests no GPU, do not emit a $$() capability selection at all; use a SINGLE fixed container (the manifest fallback, i.e. the CPU-safe image). - build_container_image_select(manifest, request_gpu=True): with request_gpu= False it returns the plain fallback image literal (no $$(), no ifThenElse). - write_ILE_sub_simple passes request_gpu through (GPU jobs keep the $$ selector; a no-GPU ILE would also collapse). - write_CIP_sub: wire container universe for CIP too (universe=container, container_image = fallback literal, no MY.SingularityImage / BindCVMFS / $$() transfer token). CIP is CPU-only so it always collapses to the single image; no require_gpus floor (unchanged). Also corrects the stale CIP comment that claimed an undefined capability "collapses to the fallback image" -- true-ish for the native ifThenElse, but false for $$(), which holds the job. Tests: build_container_image_select(request_gpu=False) -> bare fallback image; CIP integration (universe=container, container_image = single fallback literal, no MY.SingularityImage / no $$() token / no require_gpus). 17/17 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o-hold)
Folds in the undefined-safe guard (orig 8b9a0c5d, fix/manifest-cpu-fallback) and
unifies it with the container-universe collapse already in this branch.
build_singularity_image_expr and build_transfer_input_expr emitted a bare
ifThenElse/ternary over TARGET.GPUs_Capability with no guard for that attr being
undefined. A job that matches a slot with no capability attribute -- a CPU-only
CIP slot, OR an OSPool GPU site that doesn't advertise it -- makes every
`TARGET.attr >= N` undefined, so the whole $$([...]) token "cannot expand" and
HTCondor HOLDS the job ("Cannot expand $$ expression").
Add an `undefined_safe` option to _build_selector that wraps the selector in
`TARGET.attr =?= undefined ? fallback : <selector>` (ternary for the comma-free
transfer token; ifThenElse otherwise). Apply it to both legacy builders, and
refactor build_container_image_select to reuse it (DRY) instead of its own inline
guard. An undefined-capability match now yields the fallback (smallest, CPU-safe)
image on every path instead of an unresolvable $$().
This is the central no-hold guard for the LEGACY (non-container-universe) path,
complementing the deterministic build-time collapse this branch already does for
CPU-only jobs under container universe (CIP -> single fallback container).
NB: the osdf scitokens credential for manifest images is a separate fix already
on dev (3e18793; re-proposed in PR #11) -- not duplicated here.
Tests: legacy builders are undefined-safe; updated the two exact-string
expression tests to the guarded form. 18/18 pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…r_universe container universe: collapse to a single container for non-GPU jobs (…
…r_universe_calpilot scitouens with manifests
calmarg_ci.ini set ile-runtime-max-minutes=60 under the stale assumption that ILE jobs are "~1-3 min each". With the FUSED calmarg kernel + adaptive cal-draw doubling (NCAL 100->800) + distance marg, a 50-point ILE job routinely exceeds 60 min on the slower 1050 Ti slots (the bulk of the CIT pool), so the 60-min periodic_remove wall killed ~half the jobs -> retried -> churn -> the iteration never converged (no all.net after 5 h). Raise to 120 min so slow-GPU jobs finish. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge Ralph/LISA work into calmarg
…r_calmarg_in_loop # Conflicts: # CHANGES.rst
… spikes)
8192 still held the wide ILE jobs ("over cgroup memory limit of 8192"). The cal draws
do NOT double here (they stay at 100); the memory driver is the AV extrinsic sampler
spinning toward --n-max 4e6 on pathological low-cal-n_eff points, accumulating sample
arrays past 8 GB. Completers peak ~7.3 GB; 16384 (2.2x) covers the hard-point spikes
and still matches most GPU nodes (median ~27 GB RAM). ILE_extr auto-gets 2x.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
sniff() opened --inj-file in text mode and iterated lines to auto-detect the hyperpipeline ASCII format. The pipeline routinely hands it gzipped XML grids (overlap-grid-N.xml.gz): util_ParameterPuffball.py always calls _hpio.sniff(opts.inj_file) before falling back to xml_to_ChooseWaveformParams_array. The text-mode read raises UnicodeDecodeError on the gzip magic (1f 8b) -> "'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte" and the except clause only caught (OSError, IOError) (UnicodeDecodeError is a ValueError), so it propagated and the PUFF node crashed -> the DAG stalled. Fix: peek the first bytes in binary and bail early for gzip (1f 8b) or a leading '<' (XML) -- neither is ever a hyperpipeline ASCII grid -- and also catch UnicodeDecodeError in the text-sniff fallback so any other binary input returns False instead of raising. True/False detection of real hyperpipeline files (magic line or lnL/sigma_lnL header) is unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…) + CIP single image Fixes the CIT-LOCAL hold wave (CITLOCAL_BREADCRUMB_gpus_capability_undefined_holds.md): ~45% of CIT GPU slots satisfy the per-GPU require_gpus floor (per-GPU `Capability` inside AvailableGPUs) yet do NOT advertise the machine-level rollup attr `GPUs_Capability` that the family `$$()`/`ifThenElse` selection reads. On those slots the selection "cannot expand" and the job HOLDS (presents as stuck / MachineAttrMachine0=undefined). Measured 621 undefined / 741 defined, spanning node*/aframe/mly (not one bad host). Correct fix = do NOT match undefined-capability slots (don't guess their image): - container_manifest.build_capability_defined_requirement(manifest) -> "TARGET.<attr> =!= undefined" (generic on capability_attr; no-op where every GPU slot advertises it). GPU family jobs (ILE, CALPILOT) append it to Requirements. The defined set still includes the cc12.0 Blackwell nodes, so the family's purpose (Blackwell vs older) is preserved. - REVERT the undefined-safe `=?= undefined -> fallback` guard added in the prior PR (now on dev). It is UNSAFE for GPU jobs: an undefined-capability slot could be a Blackwell that hard-fails on the cuda-11.8 fallback -- the exact failure the family exists to avoid. We must not match it, not guess an image. _build_selector / build_singularity_image_expr / build_transfer_input_expr / build_container_image_select are back to a bare selector (fail-loud: an unexcluded undefined slot HOLDS rather than silently running the wrong image). - CIP (CPU, no GPU) holds the same way -- there is no GPU capability at all. CIP needs no GPU/arch-specific image, so it now uses a SINGLE fixed container = the manifest fallback on BOTH paths: legacy MY.SingularityImage = "./<fallback>" (QUOTED; a bare path is a ClassAd parse error) + transfer just that image; container universe container_image = the fallback URL. New helper build_fallback_single_image(manifest) -> (runtime_path, transfer_url). NOTE: the osdf scitokens credential for manifest images (3e18793) is already on dev. The getenv True->* default (dag_utils_generic vs dag_utils) is a separate, related item the breadcrumb flags -- not addressed here. Tests: capability-defined requirement (+ attr override); fallback single image (cvmfs in place vs osdf transferred); selectors are NOT undefined-guarded; ILE (legacy + container universe) emit the Requirements exclusion; CIP legacy emits a single quoted fallback with no exclusion. 22/22 pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…TENV=false) dag_utils_generic.py defaulted default_getenv_value / default_getenv_osg_value to 'True', emitting `getenv = True`, which schedds with SUBMIT_ALLOW_GETENV=false (e.g. CIT) reject -> the DAG aborts. The newer dag_utils.py already defaults '*' (all-env, the modern form); bring generic in line (value-only change, file's own formatting preserved to minimize a later oshaughn/rift_O4d->rift merge conflict). Still overridable via RIFT_GETENV / RIFT_GETENV_OSG. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cap_undefined container family: exclude undefined-capability GPU slots (CIT-local holds) + CIP single image
…slice n_eff The .dslice export split K into an importance-reweight "core" (reweight the main sampler's stored Omega RVs at posterior-d quantiles) + fresh fixed-d "wings". At low main-loop n_eff (the right regime for d-slices: the honesty comes from the per-slice integration, not a big main loop) the reweight core is STARVED -- it carries the same MC noise as a fair-draw histogram, reintroducing exactly the resolution problem the slices are meant to remove. Add --distance-slice-all-fresh (ILE): emit ALL K slices as FRESH fixed-d Omega integrations, no core. Placement = posterior-d quantiles (rough placement is fine even at low n_eff; precision comes from each fresh integral). The old core/wing split remains the default; the flag overrides it. Threaded through create_event_parameter_pipeline_BasicIteration (--last-iteration-export-distance-slices-all-fresh) and util_RIFT_pseudo_pipe.py (--export-distance-slices-all-fresh). Also expose the per-slice precision knobs end to end (they were ILE-only): --export-distance-slices-wing-neff / -wing-nmax -> ... -> ILE --distance-slice-wing-neff / --distance-slice-wing-nmax. With all-fresh these set the n_eff/n_max of EVERY slice, i.e. the precision of each L(d) row. ILE: guard the reweight core + its GMM-low-neff warning behind n_core>0 (so 0 core is real, not clamped to >=1), empty-core arrays flow through the existing concat unchanged (method column -> all FRESH). Validated locally on the CI zero-noise data: --export-distance-slices 6 --distance-slice-all-fresh emits a 6-row .dslice, all method=1, one honest fixed-d integral each. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
calmarg done in the ILE loop, including
as well as fancy tools to