Skip to content

HAD Phase 4: trends_lin (Eq 17/18) + R-package end-to-end parity#392

Merged
igerber merged 10 commits intomainfrom
had-phase-4-r-parity
Apr 26, 2026
Merged

HAD Phase 4: trends_lin (Eq 17/18) + R-package end-to-end parity#392
igerber merged 10 commits intomainfrom
had-phase-4-r-parity

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Apr 26, 2026

Summary

  • Adds trends_lin: bool = False keyword-only kwarg on HeterogeneousAdoptionDiD.fit(aggregate="event_study"), joint_pretrends_test, joint_homogeneity_test. Mirrors R DIDHAD::did_had(..., trends_lin=TRUE) (paper Eq 17 / Eq 18 / page 32 joint-Stute-with-trends).
  • Adds end-to-end R-package parity test vs DIDHAD v2.0.0 (Credible-Answers/did_had, SHA edc09197). 24 parity tests across 3 paper-derived synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method combinations (overall, event-study, placebo, yatchew, trends_lin).
  • Pierce-Schott published-number replication (the original Phase 4 plan) deferred indefinitely — the paper's primary analysis panel is LBD-restricted (Census FSRDC) and the public-deposit proxy panel has filtering ambiguity that prevents exact published-number parity. End-to-end R parity is a strictly stronger correctness signal — the R package IS the methodology source of truth (the paper authors wrote it).

Tolerances achieved

  • Point estimate / SE / CI bounds: atol=1e-8 (full pipeline through nprobust numerical paths).
  • Closed-form Yatchew T-stat: atol=1e-10 after a documented × G/(G-1) finite-sample convention shift.
  • 24/24 parity cells pass.

Two intentional convention deviations from R (documented in REGISTRY.md)

(a) Point estimate convention: we report the bias-corrected point estimate att = (mean(ΔY) - tau.bc) / mean(D) (modern CCF 2018 convention); R's Estimate column reports the conventional (mean(ΔY) - tau.us) / mean(D) with the bias-corrected CI separately. Our att matches R's CI midpoint; our se / conf_int_low / conf_int_high match R's se / ci_lo / ci_hi directly.

(b) Yatchew variance denominator: our yatchew_hr_test follows paper Appendix E literally with the (1/G) and (1/(2G)) population-variance denominators. R's YatchewTest::yatchew_test uses base-R var()'s (1/(N-1)) sample-variance convention. Ratio is exactly N/(N-1); both converge to the same asymptotic null distribution.

Feature gap (TODO follow-up)

yatchew_hr_test always fits Y ~ D (linearity null). R's YatchewTest(order=0) fits Y ~ 1 (intercept-only, mean-independence null), which is what DIDHAD uses on placebo rows ("non-parametric pre-trends test" per the README). Adding a null="mean_independence" mode to yatchew_hr_test would close the placebo Yatchew parity gap; tracked in TODO.md.

Test plan

  • 16 new direct trends_lin unit tests (tests/test_had_pretests.py::TestJointPretestsTrendsLin, ::TestHADFitTrendsLin): F<3 guard, naive-Python detrending baseline at atol=1e-12, survey_design × trends_lin NotImplementedError, get/set_params idempotence, consumed-placebo (e=-2) drop invariant.
  • 24 new R-parity tests (tests/test_did_had_parity.py::TestPointSEParity, ::TestYatchewParity, ::TestFixtureMetadata).
  • Bit-exact backcompat: trends_lin=False (default) preserves all 489 baseline test outputs unchanged.
  • Total: 529 tests pass, 0 regressions.
  • R generator runs end-to-end: Rscript benchmarks/R/generate_did_had_golden.R writes benchmarks/data/did_had_golden.json (117 KB committed).
  • black / ruff clean on edited files.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

Blocker

Review note: this was a static review only. The container here does not have pytest or numpy, so I could not execute the added tests.

Executive Summary

  • joint_pretrends_test(trends_lin=True) does not implement the documented “consumed placebo” drop. When base_period-1 is the only placebo horizon, the code deterministically turns that horizon into all zeros and the joint Stute core returns a mechanical non-rejection.
  • The new trends_lin kwarg was not propagated to did_had_pretest_workflow(...), even though that workflow is the in-file wrapper over the updated joint-pretest surfaces.
  • The new R parity suite does not actually exercise Python’s aggregate="overall" path, despite the changelog/registry claiming overall-path parity coverage.
  • The survey-weighted trends_lin limitation and the placebo-Yatchew null mismatch are both documented in REGISTRY.md and tracked in TODO.md, so I did not count those as defects.

Methodology

  • P0diff_diff/had_pretests.py L3575-L3625, diff_diff/had_pretests.py L3054-L3089, docs/methodology/REGISTRY.md L2501-L2502: joint_pretrends_test(trends_lin=True) keeps base_period-1 in pre_periods instead of dropping the consumed placebo that the registry/changelog say must be removed. Algebraically, for t = base_period-1, the new code sets dy_t = (Y_t - Y_base) - (t-base)*(Y_base - Y_{base-1}) = 0 for every unit, then centers that zero vector and feeds it to stute_joint_pretest, whose exact-linear short-circuit returns p_value=1.0. Impact: with only one earlier pre-period, the function silently reports a conclusive pass when there is actually no valid placebo horizon left after detrending; with multiple earlier pre-periods, the returned horizon metadata still includes a placebo the method says should have been consumed. Concrete fix: explicitly remove base_period-1 from the tested placebo set when trends_lin=True; if nothing remains, raise/skip explicitly instead of calling stute_joint_pretest.

Code Quality

  • P1diff_diff/had_pretests.py L4145-L4160, diff_diff/had_pretests.py L4454-L4486: the new public trends_lin parameter was added to joint_pretrends_test and joint_homogeneity_test, but the wrapper did_had_pretest_workflow(...) neither accepts it nor forwards it. Impact: the library’s end-to-end event-study pretest workflow cannot run the new detrended methodology path at all, so the wrapper surface diverges from the underlying public methods. Concrete fix: add trends_lin: bool = False to did_had_pretest_workflow(...), forward it to both joint wrappers, and mirror the same guard/skip behavior there.

Performance

  • No findings in the modified code.

Maintainability

  • No additional findings beyond the wrapper-propagation issue above.

Tech Debt

  • P3TODO.md L105-L106, docs/methodology/REGISTRY.md L2501-L2502: the unsupported trends_lin × survey_design combination and the placebo-Yatchew null mismatch are explicitly documented/tracked. Impact: no blocker for this PR under the project’s deferred-work policy. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P2tests/test_did_had_parity.py L129-L141, benchmarks/R/generate_did_had_golden.R L183-L193, CHANGELOG.md L12: the fixture/docs describe an “overall” parity combo, but the Python side routes overall_e1 through aggregate="event_study" and only compares the e=0 row. Impact: the new parity harness does not actually anchor the aggregate="overall" implementation, so the changelog/registry currently overstate the strength of the claimed end-to-end parity signal. Concrete fix: either make overall_e1 use the true two-period aggregate="overall" path, or rename the combo/docs so they no longer claim overall-path coverage.

Path to Approval

  1. In joint_pretrends_test(..., trends_lin=True), drop base_period-1 from the placebo set before residual construction; if the remaining set is empty, do not run stute_joint_pretest.
  2. Add regression tests for the all-consumed case (pre_periods=[base_period-1]) and for mixed cases where base_period-1 appears alongside earlier placebos, asserting that the consumed horizon is absent from the returned labels/stats.
  3. Add trends_lin to did_had_pretest_workflow(...) and forward it to both joint wrappers, with tests showing the workflow matches the direct surfaces on the detrended path.

igerber added a commit that referenced this pull request Apr 26, 2026
P0 (joint_pretrends_test consumed-placebo drop):
The detrending math collapses dy_t to mechanical zero at t = base-1
(the "consumed" placebo whose F-2 → F-1 evolution is used by the
slope estimator). Feeding that all-zero residual into
stute_joint_pretest tripped the exact-linear short-circuit and
returned a confidently-non-rejecting p_value=1.0 — a placebo that
was actually no placebo at all. Fix: compute base_minus_1_period
early (right after period_rank/base_rank), filter it out of
pre_periods before _aggregate_for_joint_test, emit UserWarning
when the filter fires, raise ValueError when nothing testable
remains. Mirrors HAD.fit's `e=-2` drop and R's "max placebo lag
reduces by 1 with the same effect" convention.

Regression tests (3):
  - consumed placebo dropped + warning emitted when pre_periods
    includes base_period - 1
  - all-consumed case (pre_periods=[base-1]) raises ValueError
  - no-consumed case (pre_periods excludes base-1) emits no warning

P1 (did_had_pretest_workflow trends_lin propagation):
Added trends_lin: bool = False kwarg to did_had_pretest_workflow,
forwarded to both joint_pretrends_test and joint_homogeneity_test
on the event-study path. Front-door rejects trends_lin=True with
aggregate="overall" via NotImplementedError (the overall path's
qug + stute + yatchew block has no joint-pretest surface).

Regression tests (2):
  - workflow(aggregate=event_study, trends_lin=True) bit-exactly
    matches direct joint_pretrends_test + joint_homogeneity_test
    calls
  - workflow(aggregate=overall, trends_lin=True) raises
    NotImplementedError

P2 (parity test exercises true overall path):
The R `overall_e1` combo was being routed through Python's
aggregate="event_study" with a single horizon. Now slices the
panel to two periods (F-1, F) and routes through
aggregate="overall" — exercising the actual two-period overall
code path that the changelog and registry claim parity for.
Result-extraction logic branches on combo_name to handle the
scalar HeterogeneousAdoptionDiDResults (overall) vs the array
HeterogeneousAdoptionDiDEventStudyResults (event_study).

Stats: 534 tests pass (529 prior + 5 new R1 regressions), 0
regressions. Parity tolerances unchanged: atol=1e-8 (point/SE/CI),
atol=1e-10 (Yatchew T-stat after N/(N-1) shift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 29399f1241fab6f12e7ec2605c0ab26cd528bd88


Overall Assessment

⚠️ Needs changes

Static review only: this container does not have pytest or numpy, so I could not execute the added tests.

Executive Summary

  • The prior re-review items look addressed: the direct consumed-placebo bug is fixed in joint_pretrends_test, trends_lin is now forwarded through did_had_pretest_workflow, and the parity suite now exercises the true aggregate="overall" path.
  • [Newly identified] The composite workflow still mishandles the minimal trends_lin=True event-study case: when the only earlier placebo is the consumed one, it raises instead of skipping step 2.
  • The documented trends_lin × survey_design exclusion and the placebo-Yatchew null mismatch are properly recorded in docs/methodology/REGISTRY.md / TODO.md, so I did not count them as defects.
  • I did not find new NaN/inference anti-patterns in the touched estimator paths.
  • The new public trends_lin kwarg is not yet documented in the affected function docstrings.

Methodology

Affected methods: HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow.

  • P1 [Newly identified]diff_diff/had_pretests.py:L3502-L3550, diff_diff/had_pretests.py:L4475-L4519, diff_diff/had_pretests.py:L4155-L4156, docs/methodology/REGISTRY.md:L2501-L2502: joint_pretrends_test(trends_lin=True) now correctly drops base_period-1 as the consumed placebo, but did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True) still decides whether to run step 2 from earlier_pre = t_pre_list[:-1] before accounting for that drop. On a valid minimal panel with exactly two pre-periods total (for example periods 1, 2, 3, 4 with F=3, so base_period=2 and earlier_pre=[1]), the workflow calls joint_pretrends_test, which then removes 1 and raises “no testable placebo horizons remain.”
    Impact: valid trends_lin workflow runs fail outright instead of producing pretrends_joint=None with the existing “joint pre-trends skipped” verdict path. This is the exact empty-result-set edge case the review checklist calls out.
    Concrete fix: in did_had_pretest_workflow, compute the effective earlier-placebo set after removing base_period-1 when trends_lin=True; if nothing remains, set pretrends_joint=None and continue to the homogeneity step instead of calling joint_pretrends_test.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings beyond the workflow edge case above.

Tech Debt

  • P3TODO.md:L104-L107, docs/methodology/REGISTRY.md:L2501-L2502: the weighted trends_lin gap, placebo-Yatchew mean-independence gap, and external Stute parity gap are explicitly tracked/documented.
    Impact: none for merge under the project’s mitigation policy.
    Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P3diff_diff/had.py:L2830-L2906, diff_diff/had_pretests.py:L3388-L3423, diff_diff/had_pretests.py:L3778-L3803, diff_diff/had_pretests.py:L4234-L4260: the new public trends_lin kwarg is in the signatures, changelog, and registry, but not in the parameter docs for HeterogeneousAdoptionDiD.fit, joint_pretrends_test, joint_homogeneity_test, or did_had_pretest_workflow.
    Impact: generated API docs / IDE help will miss the new contract, especially the consumed-placebo behavior.
    Concrete fix: add trends_lin parameter docs to those four docstrings, including the consumed-placebo drop and the workflow’s effective “need one surviving earlier placebo” rule.

Path to Approval

  1. In did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True), remove the consumed placebo from earlier_pre before deciding whether to call joint_pretrends_test; if the set becomes empty, return pretrends_joint=None instead of raising.
  2. Add a regression test on a minimal valid panel (for example periods [1, 2, 3, 4] with F=3) asserting that did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True) succeeds, sets pretrends_joint=None, and still runs homogeneity_joint.

igerber added a commit that referenced this pull request Apr 26, 2026
P1 (workflow consumed-placebo handling):
did_had_pretest_workflow(aggregate="event_study", trends_lin=True)
computed `earlier_pre = t_pre_list[:-1]` and dispatched to
joint_pretrends_test BEFORE accounting for the consumed-placebo drop
that joint_pretrends_test applies internally. On a minimal valid
panel (4 periods, F=3, base=2, earlier_pre=[1]=base-1=consumed),
the workflow propagated joint_pretrends_test's "no testable placebo
horizons remain" ValueError instead of taking the existing "no
earlier placebo → pretrends_joint=None, skip step 2" verdict path.
Fix: pre-filter t_pre_list[-2] (= base_period-1) out of earlier_pre
when trends_lin=True. If empty, the existing `len(earlier_pre) >= 1`
guard sets pretrends_joint=None and step 3 (homogeneity) still runs.

Regression test: minimal 4-period F=3 panel, trends_lin=True, asserts
workflow succeeds, pretrends_joint is None, homogeneity_joint is
populated with finite p_value.

P3 (parameter docstrings):
Added trends_lin parameter documentation to the four affected
public surfaces:
  - HeterogeneousAdoptionDiD.fit
  - joint_pretrends_test
  - joint_homogeneity_test
  - did_had_pretest_workflow
Each docstring covers: the formula, R-package mirror, requirements
(F>=3 / base-1 in panel), the consumed-placebo behavior (auto-drop
+ UserWarning on direct surfaces; skip-step-2 escape on workflow),
the survey-weighting NotImplementedError contract, and bit-exact
default backcompat.

Stats: 535 tests pass (534 prior + 1 new R2 P1 regression), 0
regressions. Parity tolerances unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1abd5b5dce154102888f9c1d595b22b1e8933875


Overall Assessment

⚠️ Needs changes

Static review only: this container lacks numpy, pandas, and pytest, so I could not execute the added tests/parity suite.

Executive Summary

  • Prior re-review items look addressed: the workflow now removes the consumed placebo before deciding whether step 2 runs, and there is a regression for the minimal F=3 case in diff_diff/had_pretests.py:L4534-L4581 and tests/test_had_pretests.py:L4667-L4717.
  • [Newly identified] The new direct trends_lin=True joint-wrapper surfaces still accept nonterminal base_periods, which silently changes the detrending away from the documented Eq. 17/18 F-1/F-2 construction.
  • [Newly identified] On ordered-categorical time axes with unused categories, the workflow/joint-wrapper trends_lin path can resolve base_period-1 to an unobserved level and then fail at the slope pivot.
  • I did not find new safe_inference / NaN-propagation anti-patterns in the changed estimator paths.
  • The survey-weighted trends_lin gap, placebo-Yatchew null mismatch, and Stute-family parity gap are explicitly tracked in TODO.md / REGISTRY.md, so I did not count them as blockers.

Methodology

  • P1 [Newly identified] diff_diff/had_pretests.py:L3598-L3612, diff_diff/had_pretests.py:L3644-L3673, diff_diff/had_pretests.py:L3934-L3948, diff_diff/had_pretests.py:L3984-L4025, docs/methodology/REGISTRY.md:L2501-L2502: when trends_lin=True, joint_pretrends_test and joint_homogeneity_test only require base_period to be some validated pre-period, but the documented method is the F-1/F-2 detrending used by Eq. 17/18 / DIDHAD::did_had(..., trends_lin=TRUE). A direct caller can pass base_period=F-2 (or earlier) and get a different slope/detrending rule with no warning. Impact: silent methodological deviation on the new public direct-wrapper surfaces; the workflow and HeterogeneousAdoptionDiD.fit(..., aggregate="event_study") are safe because they always use the terminal pre-period. Concrete fix: when trends_lin=True, require base_period == t_pre_list[-1] after validation in both wrappers, and add regressions showing a nonterminal pre-period base raises.

Code Quality

  • P1 [Newly identified] diff_diff/had_pretests.py:L2441-L2461, diff_diff/had_pretests.py:L3496-L3559, diff_diff/had_pretests.py:L3990-L4010, diff_diff/had_pretests.py:L4534-L4549: the new trends_lin path resolves base_period-1 from the full ordered-category list, while the workflow’s consumed-placebo logic uses the observed validated pre-period list. With an ordered categorical time_col that has unused categories between observed periods, the workflow can drop one “consumed” placebo but the joint wrappers then look for a different, unobserved predecessor and fail when indexing wide_y[...]. Impact: valid ordered-categorical panels can error in production instead of cleanly using the previous observed period or raising the documented ValueError. Concrete fix: derive the predecessor from the validated observed pre-period sequence (t_pre_list) on both workflow and direct-wrapper paths, or at minimum verify the chosen predecessor exists in the pivot before indexing; add a regression with an unused ordered category between observed periods.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • P3 (tracked in TODO/REGISTRY) TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502: the weighted trends_lin gap, placebo-Yatchew mean-independence gap, and Stute-family external parity gap are explicitly tracked/documented. Impact: none for merge under the repo’s mitigation policy. Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • P3 benchmarks/R/generate_did_had_golden.R:L22-L29, tests/test_did_had_parity.py:L430-L436, CHANGELOG.md:L12-L12: the parity docs claim a benchmark anchored to DIDHAD v2.0.0 / the cited SHA, but the generator and pytest only enforce >= 2.0.0. Future regeneration can silently re-anchor the goldens to newer R code while still passing. Impact: reproducibility drift in the parity baseline, not current numerical correctness (the committed fixture metadata is still 2.0.0). Concrete fix: pin exact package versions in the generator/requirements and assert exact metadata in pytest, or install from the cited commit SHA.

Path to Approval

  1. In joint_pretrends_test and joint_homogeneity_test, add a trends_lin=True guard that requires base_period to equal the last validated pre-period t_pre_list[-1], and add direct-wrapper regressions showing a nonterminal pre-period base raises.
  2. Make base_period-1 resolution use the previous observed period from the validated panel rather than the full ordered-category list, and add an ordered-categorical regression with an unused intermediate category proving the workflow/joint wrappers no longer fail on that input.

igerber added a commit that referenced this pull request Apr 26, 2026
P1 (non-terminal-base methodology guard):
joint_pretrends_test(trends_lin=True) and joint_homogeneity_test(
trends_lin=True) silently accepted base_period values other than the
last validated pre-period (= F-1). Paper Eq 17 / Eq 18 specifically
anchor the detrending at F-1 with slope Y[F-1] - Y[F-2]; non-
terminal anchors compute a different slope at a different anchor —
silently changing the methodology away from the documented Eq
17/18 / R DIDHAD::did_had(trends_lin=TRUE) construction.

Fix: under trends_lin=True, require base_period == t_pre_list[-1]
inside the validator block. The workflow and HAD.fit always pass
F-1, so they're unaffected. Direct callers passing a non-terminal
base get a clear ValueError pointing at t_pre_list[-1].

Regression tests (2):
  - joint_pretrends_test(base=F-2, trends_lin=True) raises
  - joint_homogeneity_test(base=F-2, trends_lin=True) raises

P1 (ordered-categorical observed-period base-1 lookup):
The previous base_period - 1 lookup walked period_rank (= the full
ordered-categorical level list) decrementing by rank. On panels
with unused intermediate categorical levels (e.g. dtype levels
[t1, t2, t_unused, t3, t4] with only t1..t4 observed), base=t3
under trends_lin would resolve base-1 to t_unused (rank 2), and
the slope-pivot lookup would KeyError because t_unused is not in
the data.

Fix: derive base_minus_1 from the validator's t_pre_list[-2] —
the validator builds t_pre_list from observed contiguous pre-
periods only, so it correctly skips unused categorical levels.
Both joint wrappers and the workflow now use the same observed-
period predecessor, eliminating the failure mode.

Restructure: moved the trends_lin consumed-placebo drop from
before the validator block into it (so t_pre_list is in scope).
Added an early `n_periods < 3` gate for trends_lin so 2-period
panels get a clean error instead of silently failing downstream.

Regression test:
  - joint_pretrends_test on an ordered-categorical panel with an
    unused intermediate level succeeds (no KeyError on slope pivot)

Stats: 538 tests pass (535 prior + 3 new R3 P1 regressions), 0
regressions. Parity tolerances unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: acbf699523593cf2f2bbb62c14eb63d56ba8cdb9


Overall Assessment

Blocker

Static review only: this container does not have numpy/pandas, so I could not execute the added tests.

Executive Summary

  • Prior re-review issue on nonterminal base_period looks fixed in both direct joint-wrapper surfaces.
  • [Newly identified] The ordered-categorical fix is incomplete: trends_lin=True now picks F-2 from the observed panel, but the detrending multiplier still uses the full categorical rank, so unused levels can silently change the joint-test statistic and p-value.
  • The event-study workflow inherits that same bug because it forwards into the same wrappers.
  • I did not find new inline inference / safe_inference anti-patterns in the changed estimator paths.
  • The trends_lin × survey_design, placebo-Yatchew null gap, and missing Stute-family parity gaps are properly tracked in TODO.md / REGISTRY.md.
  • The new R-parity fixture still is not exactly pinned to the cited DIDHAD release/commit, so future regeneration can drift.

Methodology

  • Severity: P0 [Newly identified]
    Impact: Affected methods: joint_pretrends_test, joint_homogeneity_test, and did_had_pretest_workflow(..., aggregate="event_study", trends_lin=True). After the fix, the slope uses the previous observed pre-period (t_pre_list[-2]), but the detrending multiplier still uses _build_period_rank() over the full categorical dtype. On an ordered categorical with unused intermediate levels, the same observed panel can therefore produce different detrending, cvm_stat_joint, and p_value depending only on whether an unused category is present. That is silent wrong statistical output, and it also breaks parity with the event-study fit path, which detrends on observed horizons only. References: diff_diff/had_pretests.py:L3588-L3617, diff_diff/had_pretests.py:L3672-L3701, diff_diff/had_pretests.py:L3985-L4008, diff_diff/had_pretests.py:L4044-L4078, diff_diff/had.py:L1793-L1901, diff_diff/had.py:L4006-L4054, docs/methodology/REGISTRY.md:L2501-L2502.
    Concrete fix: Build the detrending rank from the validated observed periods on the filtered panel (same chronology HAD.fit uses), not from the full categorical dtype; alternatively, reject unused intermediate levels and document that limitation. Add a regression that compares the same observed panel with and without unused levels and asserts identical joint statistics.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

  • Severity: P3 (tracked in TODO/REGISTRY)
    Impact: The trends_lin × survey_design gap, placebo yatchew_hr_test(null="mean_independence") gap, and missing Stute-family parity are explicitly tracked and are non-blocking under the repo policy. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.
    Concrete fix: None required in this PR.

Security

No findings.

Documentation/Tests

  • Severity: P3
    Impact: The parity harness still allows any DIDHAD >= 2.0.0; requirements.R installs whatever CRAN serves, the generator accepts >=, and pytest only checks >=. Future regeneration can therefore silently re-anchor benchmarks/data/did_had_golden.json to a newer upstream release while the docs/changelog still cite v2.0.0 / edc09197. References: CHANGELOG.md:L11-L12, benchmarks/R/generate_did_had_golden.R:L22-L29, benchmarks/R/requirements.R:L30-L45, tests/test_did_had_parity.py:L430-L436.
    Concrete fix: Pin DIDHAD to the cited SHA (or at minimum assert exact version equality in both the generator and pytest); do the same for any other upstream package whose version is part of the parity contract.

Path to Approval

  1. In both joint wrappers, compute the trends_lin detrending delta from the validated observed-period order on data_filtered, not from _build_period_rank(data, time_col).
  2. Add regression coverage for both directions of the bug:
    • pretrends: unused level between an earlier placebo and base_period;
    • homogeneity/workflow: unused level between base_period and a post period;
      and assert the statistics match the same panel with unused categories removed.

igerber added a commit that referenced this pull request Apr 26, 2026
P0 (observed-period detrending delta):
The R3 P1 fix made the slope LOOKUP use observed periods, but the
detrending DELTA (`t_rank - base_rank`) still pulled ranks from the
full categorical dtype via _build_period_rank. On panels with
unused intermediate categorical levels, the same observed data
produced different (t - base) multipliers and corrupted the
joint test statistic — silent wrong statistical output.

Fix: under trends_lin=True, build a fresh observed_rank dict from
sorted(set(data_filtered[time_col].unique())) and use it for both
the base and horizon ranks in the delta computation. Mirrors
HAD.fit's `_aggregate_multi_period_first_differences` convention
(`sorted(t_pre_list + t_post_list, ...)` for the event-time rank).
Both joint wrappers fixed; workflow inherits the fix automatically.

Regression tests (2):
  - joint_pretrends_test on (categorical with 2 unused levels)
    produces identical cvm_stat_joint and p_value to (categorical
    without unused levels) on the same observed data
  - joint_homogeneity_test twin invariant (unused level between
    base and post)

P3 (exact upstream-version pin):
The parity contract cited DIDHAD v2.0.0 / SHA edc09197 in
CHANGELOG, REGISTRY, and the parity test docstrings, but the
generator and test only enforced `>= 2.0.0`. Future regeneration
could silently re-anchor goldens to a newer release while docs
still cited the old version.

Fix: pin exactly DIDHAD == 2.0.0 and YatchewTest == 1.1.1 in both
the generator's stopifnot guards and the parity test's metadata
assertion. Document the bump procedure in the comments.

Stats: 540 tests pass (538 prior + 2 new R4 P0 regressions), 0
regressions. All 24 R-parity cells still green at atol=1e-8 /
1e-10. Note: existing categorical-level invariance tests added in
R3 still pass — they exercised correctness on simple unused-level
shifts; the R4 invariants are stricter, asserting bit-exact
identity of the joint statistics across categorical re-ordering of
the same observed panel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b7b7eb3f249f0e0929e9d03cbccf20a5e9ae7caa


Overall Assessment
✅ Looks good

Executive Summary

  • Previous blocker on ordered-categorical unused levels appears resolved: the joint wrappers now derive both the predecessor period and the detrending multiplier from validated observed periods, matching HAD.fit (diff_diff/had_pretests.py:L3598-L3617, diff_diff/had_pretests.py:L3696-L3718, diff_diff/had_pretests.py:L4002-L4025, diff_diff/had_pretests.py:L4091-L4107, diff_diff/had.py:L1793-L1901, diff_diff/had.py:L4006-L4054).
  • I did not find an unmitigated methodology deviation in the shipped trends_lin estimator/pretest paths.
  • The remaining methodology gaps are properly documented/tracked and are non-blocking under repo policy: trends_lin × survey_design, placebo Yatchew mean-independence mode, and missing Stute-family parity (docs/methodology/REGISTRY.md:L2501-L2502, TODO.md:L104-L107).
  • Non-blocking issues remain in the parity/repro harness: nprobust is still not exact-pinned even though the parity claim depends on it, and the Python parity test does not assert exact horizon shape/order (benchmarks/R/generate_did_had_golden.R:L22-L34, benchmarks/R/requirements.R:L16-L18, tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L188, tests/test_did_had_parity.py:L224-L310, tests/test_did_had_parity.py:L430-L447).
  • Static review only: this container lacks pytest, numpy, and pandas, so I could not execute the new tests.

Methodology

  • Severity: P3 (tracked in REGISTRY.md / TODO.md). Impact: The remaining differences from R are documented rather than silent: bias-corrected point-estimate convention, Yatchew denominator convention, survey-weighted trends_lin deferral, placebo-Yatchew null gap, and missing Stute-family parity. Concrete fix: None required in this PR. References: docs/methodology/REGISTRY.md:L2501-L2504, TODO.md:L104-L107.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P3. Impact: The repro harness is not fully self-pinned. generate_did_had_golden.R/pytest now exact-pin DIDHAD and YatchewTest, but not nprobust, and benchmarks/R/requirements.R still installs whatever CRAN currently serves. Because the parity contract explicitly runs through nprobust numerical paths, a fresh regeneration can still drift or fail depending on upstream CRAN state. Concrete fix: exact-pin nprobust alongside the other upstreams and make benchmarks/R/requirements.R install the archived pinned versions. References: benchmarks/R/generate_did_had_golden.R:L22-L34, benchmarks/R/requirements.R:L16-L18, tests/test_did_had_parity.py:L430-L447.

Tech Debt

  • Severity: P3 (tracked in TODO.md). Impact: The PR correctly records deferred work for survey-weighted trends_lin, placebo Yatchew parity, and Stute-family parity, so these are non-blocking under the project policy. Concrete fix: None required in this PR. References: TODO.md:L105-L107.

Security

  • No findings.

Documentation/Tests

  • Severity: P3. Impact: did_had_pretest_workflow’s top-level docstring still says Eq. 18 linear-trend detrending is deferred, which now contradicts the shipped trends_lin behavior and the updated registry/changelog. Concrete fix: update diff_diff/had_pretests.py:L4325-L4327 to describe the shipped forwarding behavior.
  • Severity: P3. Impact: tests/test_did_had_parity.py compares mapped R rows to Python rows but does not assert exact event_times length/order, and _python_fit() ignores effects/placebo on the event-study path. That weakens the claimed “end-to-end parity” coverage for future horizon-shape regressions. Concrete fix: assert the exact mapped event_times sequence/length (and optionally QUG/bandwidth metadata) before value comparisons, or slice Python output explicitly to the requested effects/placebos. References: tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L188, tests/test_did_had_parity.py:L224-L310.

igerber added a commit that referenced this pull request Apr 26, 2026
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: eb27bf5914e9c747eeedeca2a68e595e3705d265


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • The new trends_lin paths follow the shipped methodology contract: event-study only, F >= 3, canonical F-1 anchor on the joint wrappers, consumed-placebo drop, and survey-weight rejection are all enforced in code (diff_diff/had.py:L2939-L3041, diff_diff/had.py:L4006-L4052, diff_diff/had_pretests.py:L3581-L3712, diff_diff/had_pretests.py:L4004-L4108, docs/methodology/REGISTRY.md:L2501-L2502).
  • The earlier ordered-categorical / unused-level risk is addressed. The joint wrappers now derive both the predecessor period and detrending rank from validated observed periods, and regression tests lock that invariant (diff_diff/had_pretests.py:L3600-L3619, diff_diff/had_pretests.py:L3698-L3712, diff_diff/had_pretests.py:L4025-L4108, tests/test_had_pretests.py:L4757-L4912).
  • The workflow surface/docs are now consistent with shipped behavior, including the minimal-panel “skip step 2, still run step 3” case (diff_diff/had_pretests.py:L4327-L4399, diff_diff/had_pretests.py:L4626-L4690, tests/test_had_pretests.py:L4667-L4717).
  • Severity P3: the repro installer is still not version-aware. benchmarks/R/requirements.R installs current CRAN heads and does nothing when the wrong version is already installed, while the generator/test now require exact versions; clean regeneration can still fail immediately (benchmarks/R/requirements.R:L30-L53, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L460-L487).
  • Severity P3: the parity harness is stronger than before, but _python_fit() still ignores effects/placebo on the event-study path and _zip_r_python() only enforces subset inclusion, so horizon-selection regressions could still slip through while mapped-row comparisons pass (tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L204).
  • Static review only: pytest is not installed in this container, so I could not execute the targeted test modules.

Methodology

  • No findings.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3. Impact: benchmarks/R/requirements.R does not install or correct to the exact DIDHAD / YatchewTest / nprobust versions that the generator and parity test now hard-pin, so the repro path is still brittle on a fresh or drifted R environment. Concrete fix: make benchmarks/R/requirements.R version-aware and install archived exact versions when the installed version is missing or mismatched. References: benchmarks/R/requirements.R:L30-L53, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L460-L487.

Tech Debt

  • Severity P3 (informational, tracked). Impact: the remaining gaps — trends_lin × survey_design, placebo Yatchew mean-independence mode, and Stute-family parity — are already tracked and therefore non-blocking under repo policy. Concrete fix: None required in this PR. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the parity harness still does not constrain the full Python event-study horizon set to the requested R effects / placebo surface; it only checks that the R-mapped horizons exist in Python output. That weakens coverage for future horizon-selection regressions. Concrete fix: derive the expected horizon list from the R combo, slice Python arrays to that ordered set, and assert exact equality of the compared horizon sequence before value comparisons. References: tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L204.
  • Execution note: pytest is unavailable in this container, so this review is based on static inspection only.

igerber added a commit that referenced this pull request Apr 26, 2026
R6 was ✅ Looks good — 2 P3 polish items.

P3 #1 — version-aware repro installer:
benchmarks/R/requirements.R installed whatever CRAN currently
served via install.packages, while the generator and parity test
hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust ==
0.5.0. A fresh R environment regenerating the goldens would have
the generator's stopifnot(packageVersion == "X.Y.Z") immediately
abort.

Fix: add `install_pinned_version()` helper using
remotes::install_version with `upgrade = "never"`, run it after
the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent
when the correct version is already installed. Bump procedure
documented in lockstep with the generator + parity-test pins.

P3 #2 — exact-set parity event_times:
_zip_r_python() previously asserted only that R-mapped horizons
were a SUBSET of Python's event_times (missing-in-python check).
Tighten to FULL SET EQUALITY: also reject horizons present in
Python but absent from R's requested set ("extra_in_python"). This
catches future event_study horizon-selection regressions in both
directions — e.g. if our effects/placebo cap drifts and Python
emits an extra row R didn't request.

Stats: 540 tests pass, 0 regressions. Still no estimator changes
— all P3 polish on the parity / repro infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: aae3531915ad7d1ab2517bfb140e6788f1865d3d


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Previous re-review items are addressed. The R installer now enforces exact upstream pins, and the parity harness now exercises the true two-period overall path and checks exact horizon-set equality instead of subset inclusion. References: benchmarks/R/requirements.R:L39-L98, tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L168-L211.
  • The new trends_lin implementation matches the shipped registry contract across all three methodology surfaces: event-study-only gate, F >= 3 / F-1 anchor enforcement, consumed-placebo drop, and survey-weight rejection are all implemented and covered by regression tests. References: diff_diff/had.py:L2939-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3521-L3642, diff_diff/had_pretests.py:L4004-L4108, docs/methodology/REGISTRY.md:L2501-L2502, tests/test_had_pretests.py:L4364-L4867, tests/test_had_pretests.py:L5013-L5039.
  • The earlier ordered-categorical / unused-level risk is fixed. The joint wrappers now derive both the predecessor period and the detrending delta from validated observed periods, with invariant tests on both pretrends and homogeneity. References: diff_diff/had_pretests.py:L3600-L3618, diff_diff/had_pretests.py:L3698-L3717, diff_diff/had_pretests.py:L4004-L4027, diff_diff/had_pretests.py:L4093-L4108, tests/test_had_pretests.py:L4757-L4867.
  • The remaining placebo-Yatchew parity gap and trends_lin × survey_design limitation are explicitly tracked in TODO.md, so they are non-blocking under repo policy. References: TODO.md:L105-L107, tests/test_did_had_parity.py:L360-L370, tests/test_did_had_parity.py:L438-L443.
  • Static review only. pytest is unavailable in this container, and a direct Python smoke test also failed because pandas is not installed here.

Methodology

  • Severity P3 (informational). Impact: the only R deviations surfaced by this PR are the bias-corrected point-estimate convention and the Yatchew G/(G-1) finite-sample convention shift; both are explicitly documented in the Methodology Registry and mirrored by the parity assertions, so they are non-defects under repo policy. Concrete fix: None required. References: docs/methodology/REGISTRY.md:L2501-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L460.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3 (informational, tracked). Impact: placebo Yatchew parity still intentionally skips R’s order=0 mean-independence null, and trends_lin × survey_design plus Stute-family external parity remain deferred, but all are now explicitly tracked in TODO.md, so they are non-blocking. Concrete fix: None required in this PR. References: TODO.md:L105-L107, docs/methodology/REGISTRY.md:L2501-L2502.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new trends_lin parameter docs for the joint wrappers say the panel must contain base_period - 1, but the actual public contract is stricter: direct callers must also pass base_period == F-1 (the last validated pre-period). The implementation and tests are correct; the docstring just understates the contract. Concrete fix: update the trends_lin parameter docs on joint_pretrends_test and joint_homogeneity_test to state explicitly that base_period must be the canonical F-1 anchor. References: diff_diff/had_pretests.py:L3424-L3440, diff_diff/had_pretests.py:L3867-L3878, diff_diff/had_pretests.py:L3590-L3599, diff_diff/had_pretests.py:L4008-L4016.
  • Execution note: static review only; I could not run the targeted tests in this container.

igerber added a commit that referenced this pull request Apr 26, 2026
R7 was ✅ Looks good — only 1 P3 docstring nit.

The trends_lin parameter docstrings on joint_pretrends_test and
joint_homogeneity_test only documented the "base_period - 1 must
exist in panel" requirement, but the actual public contract is
stricter: direct callers must also pass `base_period == F-1` (the
last validated pre-period). The PR #392 R3 P1 guard enforces this
in code; the docstring just understated it.

Updated both docstrings to state explicitly that base_period must
be the canonical F-1 anchor, not just any pre-period — and that
non-terminal anchors raise ValueError per Eq 17 / R semantics.

Stats: 540 tests pass, 0 regressions. Doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: bbf75e7359df894b059fd2a5ac61ef0091a0e046


Overall Assessment

✅ Looks good - no unmitigated P0/P1 findings.

Executive Summary

  • Previous re-review items appear addressed: the R parity path now hard-pins upstream package versions, the fixture/test metadata enforce those pins, the parity harness now checks exact horizon-set equality, and the joint-wrapper docstrings now state the canonical F-1 anchor requirement. References: benchmarks/R/requirements.R:L39-L98, benchmarks/R/generate_did_had_golden.R:L28-L37, tests/test_did_had_parity.py:L168-L211, diff_diff/had_pretests.py:L3424-L3446, diff_diff/had_pretests.py:L3873-L3888.
  • The affected methodology surfaces are HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, and joint_homogeneity_test. The new trends_lin path matches the documented Eq. 17 / Eq. 18 contract on the key guards: event-study-only scope, F >= 3, canonical F-1 anchoring, consumed-placebo drop, observed-period predecessor lookup, and workflow forwarding. References: diff_diff/had.py:L2939-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3473-L3648, diff_diff/had_pretests.py:L3915-L4120, diff_diff/had_pretests.py:L4636-L4700, tests/test_had_pretests.py:L4418-L4717.
  • I did not find any undocumented estimator/math/variance deviations. The only numerical differences from R surfaced by this PR are the two deviations already documented in the Methodology Registry and reflected in the parity assertions. References: docs/methodology/REGISTRY.md:L2501-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L461.
  • One non-blocking note remains: the changelog/registry describe general HAD-to-R end-to-end parity, but the harness explicitly forces HeterogeneousAdoptionDiD(design="continuous_at_zero"), so it validates the Design 1' parity path rather than the library's default design="auto" surface. References: CHANGELOG.md:L12-L12, docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L119-L156, diff_diff/had.py:L2641-L2653.
  • Static review only. The changed Python files compile, but I could not run pytest here because the container is missing numpy/pandas.

Methodology

  • Severity P3 (informational). Impact: the PR's two intentional R deviations - bias-corrected point estimate vs. R's conventional Estimate, and the Yatchew G/(G-1) finite-sample convention shift - are explicitly documented in the registry and implemented in the parity assertions, so they are non-defects under the repo policy. Concrete fix: None. References: docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L293-L333, tests/test_did_had_parity.py:L344-L461.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity P3 (informational, tracked). Impact: placebo-row Yatchew parity remains intentionally out of scope because R uses YatchewTest::yatchew_test(order=0) (Y ~ 1) on placebo rows while Python only exposes the linearity-null Y ~ D surface. This limitation is already tracked in TODO.md, so it is non-blocking. Concrete fix: None required in this PR; the tracked follow-up is yatchew_hr_test(null="mean_independence"). References: TODO.md:L106-L106, tests/test_did_had_parity.py:L360-L370, tests/test_did_had_parity.py:L438-L443.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: the new changelog/registry wording overstates the scope of the R-parity claim. The actual harness explicitly forces HeterogeneousAdoptionDiD(design="continuous_at_zero") because design="auto" can legitimately choose a different surface, so the PR does not validate parity for the default public estimator configuration. Concrete fix: either narrow the changelog/registry wording to the design="continuous_at_zero" parity contract, or add separate coverage plus an explicit note for the expected design="auto" divergence. References: CHANGELOG.md:L12-L12, docs/methodology/REGISTRY.md:L2502-L2502, tests/test_did_had_parity.py:L119-L156, diff_diff/had.py:L2641-L2653.

Execution note: static review only; I could not run the targeted tests in this container, though compile(...) succeeded for the changed Python files.

igerber added a commit that referenced this pull request Apr 26, 2026
R8 was ✅ Looks good — only 1 P3 doc nit.

The CHANGELOG and REGISTRY parity claim said "HAD R-package end-
to-end parity" without qualifying that the harness explicitly
forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R
`did_had` always evaluates the local-linear at d=0 regardless of
dose distribution; our default `design="auto"` may legitimately
resolve to `continuous_near_d_lower` or `mass_point` on dose
distributions with boundary density bounded away from zero (e.g.
Beta(2,2) at G=200), in which case the WAS estimand evaluates at
a different point and diverges from R numerically. That
divergence is methodologically defensible — our auto-detect uses
more information when boundary mass is sparse — but it means the
parity test does NOT validate the default `design="auto"`
surface, only the Design 1' surface that R also uses.

Updated both wording surfaces to qualify "on the
`design='continuous_at_zero'` (Design 1') surface" and explain
the auto-detect divergence as out-of-scope-for-this-test (not a
defect).

Stats: 540 tests pass, 0 regressions. Doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 476cc3233065c513e1336377f8917a1a919e05f6


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

Methodology

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

Tech Debt

  • Severity P3 (informational, tracked). Impact: trends_lin=True remains intentionally unsupported with survey weighting, but the PR fails fast with NotImplementedError on all three affected surfaces and tracks the follow-up in TODO.md, so there is no silent statistical error. Concrete fix: none required in this PR; only add support after deriving and validating a weighted slope estimator. diff_diff/had.py L2945-L2968, diff_diff/had_pretests.py L3473-L3486, diff_diff/had_pretests.py L3915-L3926, TODO.md L105-L105
  • Severity P3 (informational, tracked). Impact: placebo-row Yatchew parity is still intentionally skipped because R’s placebo path uses YatchewTest(order=0) / Y ~ 1, while Python only exposes the linearity-null Y ~ D test. Concrete fix: none required in this PR; add a null="mean_independence" mode if placebo parity becomes a goal. tests/test_did_had_parity.py L360-L370, TODO.md L106-L106, docs/methodology/REGISTRY.md L2502-L2502

Security

  • No findings.

Documentation/Tests

@igerber igerber added the ready-for-ci Triggers CI test workflows label Apr 26, 2026
igerber and others added 4 commits April 26, 2026 11:24
Adds `trends_lin: bool = False` keyword-only kwarg to
`HeterogeneousAdoptionDiD.fit(aggregate="event_study")`,
`joint_pretrends_test`, and `joint_homogeneity_test`. Mirrors R
`DIDHAD::did_had(..., trends_lin=TRUE)` (paper Eq 17 / Eq 18 / page 32
joint-Stute homogeneity-with-trends). Per-group linear-trend slope
estimated as `Y[g, F-1] - Y[g, F-2]`; applied uniformly as
`(t - base) × slope` adjustment that absorbs both the effect-side
detrending and the placebo-side anchor swap. Requires F ≥ 3 (panel
must contain F-2). The "consumed" placebo at our event-time `e=-2` is
auto-dropped (R reduces max placebo lag by 1 with the same effect).
Mutually exclusive with survey weighting; raises NotImplementedError.
Bit-exact backcompat for trends_lin=False (default) across all
existing surfaces.

Adds end-to-end R-package parity test vs `DIDHAD` v2.0.0
(Credible-Answers/did_had, SHA `edc09197`). New generator
`benchmarks/R/generate_did_had_golden.R` produces
`benchmarks/data/did_had_golden.json` covering 3 paper-derived
synthetic DGPs (Uniform, Beta(2,2), Beta(0.5,1)) × 5 method
combinations (overall, event-study, placebo, yatchew, trends_lin).
`tests/test_did_had_parity.py` asserts:
  - point estimate / SE / CI bounds at atol=1e-8
  - closed-form Yatchew T-stat at atol=1e-10 after a documented
    `× G/(G-1)` finite-sample convention shift
on all 24 (DGP × combo) cells.

Two intentional convention deviations from R, documented in
REGISTRY.md and the parity test docstring:
  (a) we report the bias-corrected point estimate (modern CCF 2018
      convention); R's `Estimate` reports the conventional estimate
      with the bias-corrected CI separately. Our `att` matches R's
      CI midpoint.
  (b) Yatchew uses paper Appendix E's literal (1/G) variance
      denominator; R uses base-R `var()`'s (1/(N-1)) sample-variance
      convention. Ratio is exactly N/(N-1); both converge to the same
      asymptotic null.
Yatchew on placebos with R's mean-independence null (`order=0`) is
not exposed in our `yatchew_hr_test` and is skipped in the parity
test; tracked as TODO follow-up.

Stats:
- 16 new direct trends_lin unit tests in test_had_pretests.py
- 24 new R-parity tests in test_did_had_parity.py
- 489 baseline + 16 + 24 = 529 tests pass, 0 regressions
- Generator script: ~280 LoC; fixture: 117 KB

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P0 (joint_pretrends_test consumed-placebo drop):
The detrending math collapses dy_t to mechanical zero at t = base-1
(the "consumed" placebo whose F-2 → F-1 evolution is used by the
slope estimator). Feeding that all-zero residual into
stute_joint_pretest tripped the exact-linear short-circuit and
returned a confidently-non-rejecting p_value=1.0 — a placebo that
was actually no placebo at all. Fix: compute base_minus_1_period
early (right after period_rank/base_rank), filter it out of
pre_periods before _aggregate_for_joint_test, emit UserWarning
when the filter fires, raise ValueError when nothing testable
remains. Mirrors HAD.fit's `e=-2` drop and R's "max placebo lag
reduces by 1 with the same effect" convention.

Regression tests (3):
  - consumed placebo dropped + warning emitted when pre_periods
    includes base_period - 1
  - all-consumed case (pre_periods=[base-1]) raises ValueError
  - no-consumed case (pre_periods excludes base-1) emits no warning

P1 (did_had_pretest_workflow trends_lin propagation):
Added trends_lin: bool = False kwarg to did_had_pretest_workflow,
forwarded to both joint_pretrends_test and joint_homogeneity_test
on the event-study path. Front-door rejects trends_lin=True with
aggregate="overall" via NotImplementedError (the overall path's
qug + stute + yatchew block has no joint-pretest surface).

Regression tests (2):
  - workflow(aggregate=event_study, trends_lin=True) bit-exactly
    matches direct joint_pretrends_test + joint_homogeneity_test
    calls
  - workflow(aggregate=overall, trends_lin=True) raises
    NotImplementedError

P2 (parity test exercises true overall path):
The R `overall_e1` combo was being routed through Python's
aggregate="event_study" with a single horizon. Now slices the
panel to two periods (F-1, F) and routes through
aggregate="overall" — exercising the actual two-period overall
code path that the changelog and registry claim parity for.
Result-extraction logic branches on combo_name to handle the
scalar HeterogeneousAdoptionDiDResults (overall) vs the array
HeterogeneousAdoptionDiDEventStudyResults (event_study).

Stats: 534 tests pass (529 prior + 5 new R1 regressions), 0
regressions. Parity tolerances unchanged: atol=1e-8 (point/SE/CI),
atol=1e-10 (Yatchew T-stat after N/(N-1) shift).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 (workflow consumed-placebo handling):
did_had_pretest_workflow(aggregate="event_study", trends_lin=True)
computed `earlier_pre = t_pre_list[:-1]` and dispatched to
joint_pretrends_test BEFORE accounting for the consumed-placebo drop
that joint_pretrends_test applies internally. On a minimal valid
panel (4 periods, F=3, base=2, earlier_pre=[1]=base-1=consumed),
the workflow propagated joint_pretrends_test's "no testable placebo
horizons remain" ValueError instead of taking the existing "no
earlier placebo → pretrends_joint=None, skip step 2" verdict path.
Fix: pre-filter t_pre_list[-2] (= base_period-1) out of earlier_pre
when trends_lin=True. If empty, the existing `len(earlier_pre) >= 1`
guard sets pretrends_joint=None and step 3 (homogeneity) still runs.

Regression test: minimal 4-period F=3 panel, trends_lin=True, asserts
workflow succeeds, pretrends_joint is None, homogeneity_joint is
populated with finite p_value.

P3 (parameter docstrings):
Added trends_lin parameter documentation to the four affected
public surfaces:
  - HeterogeneousAdoptionDiD.fit
  - joint_pretrends_test
  - joint_homogeneity_test
  - did_had_pretest_workflow
Each docstring covers: the formula, R-package mirror, requirements
(F>=3 / base-1 in panel), the consumed-placebo behavior (auto-drop
+ UserWarning on direct surfaces; skip-step-2 escape on workflow),
the survey-weighting NotImplementedError contract, and bit-exact
default backcompat.

Stats: 535 tests pass (534 prior + 1 new R2 P1 regression), 0
regressions. Parity tolerances unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 (non-terminal-base methodology guard):
joint_pretrends_test(trends_lin=True) and joint_homogeneity_test(
trends_lin=True) silently accepted base_period values other than the
last validated pre-period (= F-1). Paper Eq 17 / Eq 18 specifically
anchor the detrending at F-1 with slope Y[F-1] - Y[F-2]; non-
terminal anchors compute a different slope at a different anchor —
silently changing the methodology away from the documented Eq
17/18 / R DIDHAD::did_had(trends_lin=TRUE) construction.

Fix: under trends_lin=True, require base_period == t_pre_list[-1]
inside the validator block. The workflow and HAD.fit always pass
F-1, so they're unaffected. Direct callers passing a non-terminal
base get a clear ValueError pointing at t_pre_list[-1].

Regression tests (2):
  - joint_pretrends_test(base=F-2, trends_lin=True) raises
  - joint_homogeneity_test(base=F-2, trends_lin=True) raises

P1 (ordered-categorical observed-period base-1 lookup):
The previous base_period - 1 lookup walked period_rank (= the full
ordered-categorical level list) decrementing by rank. On panels
with unused intermediate categorical levels (e.g. dtype levels
[t1, t2, t_unused, t3, t4] with only t1..t4 observed), base=t3
under trends_lin would resolve base-1 to t_unused (rank 2), and
the slope-pivot lookup would KeyError because t_unused is not in
the data.

Fix: derive base_minus_1 from the validator's t_pre_list[-2] —
the validator builds t_pre_list from observed contiguous pre-
periods only, so it correctly skips unused categorical levels.
Both joint wrappers and the workflow now use the same observed-
period predecessor, eliminating the failure mode.

Restructure: moved the trends_lin consumed-placebo drop from
before the validator block into it (so t_pre_list is in scope).
Added an early `n_periods < 3` gate for trends_lin so 2-period
panels get a clean error instead of silently failing downstream.

Regression test:
  - joint_pretrends_test on an ordered-categorical panel with an
    unused intermediate level succeeds (no KeyError on slope pivot)

Stats: 538 tests pass (535 prior + 3 new R3 P1 regressions), 0
regressions. Parity tolerances unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber and others added 5 commits April 26, 2026 11:24
P0 (observed-period detrending delta):
The R3 P1 fix made the slope LOOKUP use observed periods, but the
detrending DELTA (`t_rank - base_rank`) still pulled ranks from the
full categorical dtype via _build_period_rank. On panels with
unused intermediate categorical levels, the same observed data
produced different (t - base) multipliers and corrupted the
joint test statistic — silent wrong statistical output.

Fix: under trends_lin=True, build a fresh observed_rank dict from
sorted(set(data_filtered[time_col].unique())) and use it for both
the base and horizon ranks in the delta computation. Mirrors
HAD.fit's `_aggregate_multi_period_first_differences` convention
(`sorted(t_pre_list + t_post_list, ...)` for the event-time rank).
Both joint wrappers fixed; workflow inherits the fix automatically.

Regression tests (2):
  - joint_pretrends_test on (categorical with 2 unused levels)
    produces identical cvm_stat_joint and p_value to (categorical
    without unused levels) on the same observed data
  - joint_homogeneity_test twin invariant (unused level between
    base and post)

P3 (exact upstream-version pin):
The parity contract cited DIDHAD v2.0.0 / SHA edc09197 in
CHANGELOG, REGISTRY, and the parity test docstrings, but the
generator and test only enforced `>= 2.0.0`. Future regeneration
could silently re-anchor goldens to a newer release while docs
still cited the old version.

Fix: pin exactly DIDHAD == 2.0.0 and YatchewTest == 1.1.1 in both
the generator's stopifnot guards and the parity test's metadata
assertion. Document the bump procedure in the comments.

Stats: 540 tests pass (538 prior + 2 new R4 P0 regressions), 0
regressions. All 24 R-parity cells still green at atol=1e-8 /
1e-10. Note: existing categorical-level invariance tests added in
R3 still pass — they exercised correctness on simple unused-level
shifts; the R4 invariants are stricter, asserting bit-exact
identity of the joint statistics across categorical re-ordering of
the same observed panel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R5 was ✅ Looks good — only P3 polish remained. All addressed:

P3 #1 — exact-pin nprobust:
The parity contract runs through nprobust numerical paths
(DIDHAD's local-linear bandwidth + bias-correction calls), so a
fresh regeneration could drift if CRAN serves a newer nprobust.
Pin nprobust == 0.5.0 in both the R generator's stopifnot guard
and the parity test's metadata assertion alongside DIDHAD and
YatchewTest.

P3 #2 — workflow docstring:
did_had_pretest_workflow's top-level docstring still said "Eq 18
linear-trend detrending is a Phase 4 follow-up" which contradicts
the shipped trends_lin behavior. Updated to describe the
forwarding contract (trends_lin → joint_pretrends_test +
joint_homogeneity_test, consumed-placebo skip path on minimal
panels). Same fix on the StuteJointResult class docstring.

P3 #3 — parity test horizon-shape assertions:
Added an explicit "missing in Python" assertion in _zip_r_python:
every R-mapped event time must be present in Python's event_times
(catches future horizon-shape regressions where Python silently
drops a horizon R requested). Added an effects+placebo row-count
sanity check in test_yatchew_t_stat_parity (uses the previously-
unused effects/placebo parametrize values to catch fixture drift).

Stats: 540 tests pass, 0 regressions. No estimator/methodology
changes — all P3 polish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 was ✅ Looks good — 2 P3 polish items.

P3 #1 — version-aware repro installer:
benchmarks/R/requirements.R installed whatever CRAN currently
served via install.packages, while the generator and parity test
hard-pin DIDHAD == 2.0.0 / YatchewTest == 1.1.1 / nprobust ==
0.5.0. A fresh R environment regenerating the goldens would have
the generator's stopifnot(packageVersion == "X.Y.Z") immediately
abort.

Fix: add `install_pinned_version()` helper using
remotes::install_version with `upgrade = "never"`, run it after
the bulk CRAN install for DIDHAD/YatchewTest/nprobust. Idempotent
when the correct version is already installed. Bump procedure
documented in lockstep with the generator + parity-test pins.

P3 #2 — exact-set parity event_times:
_zip_r_python() previously asserted only that R-mapped horizons
were a SUBSET of Python's event_times (missing-in-python check).
Tighten to FULL SET EQUALITY: also reject horizons present in
Python but absent from R's requested set ("extra_in_python"). This
catches future event_study horizon-selection regressions in both
directions — e.g. if our effects/placebo cap drifts and Python
emits an extra row R didn't request.

Stats: 540 tests pass, 0 regressions. Still no estimator changes
— all P3 polish on the parity / repro infrastructure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R7 was ✅ Looks good — only 1 P3 docstring nit.

The trends_lin parameter docstrings on joint_pretrends_test and
joint_homogeneity_test only documented the "base_period - 1 must
exist in panel" requirement, but the actual public contract is
stricter: direct callers must also pass `base_period == F-1` (the
last validated pre-period). The PR #392 R3 P1 guard enforces this
in code; the docstring just understated it.

Updated both docstrings to state explicitly that base_period must
be the canonical F-1 anchor, not just any pre-period — and that
non-terminal anchors raise ValueError per Eq 17 / R semantics.

Stats: 540 tests pass, 0 regressions. Doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 was ✅ Looks good — only 1 P3 doc nit.

The CHANGELOG and REGISTRY parity claim said "HAD R-package end-
to-end parity" without qualifying that the harness explicitly
forces `HeterogeneousAdoptionDiD(design="continuous_at_zero")`. R
`did_had` always evaluates the local-linear at d=0 regardless of
dose distribution; our default `design="auto"` may legitimately
resolve to `continuous_near_d_lower` or `mass_point` on dose
distributions with boundary density bounded away from zero (e.g.
Beta(2,2) at G=200), in which case the WAS estimand evaluates at
a different point and diverges from R numerically. That
divergence is methodologically defensible — our auto-detect uses
more information when boundary mass is sparse — but it means the
parity test does NOT validate the default `design="auto"`
surface, only the Design 1' surface that R also uses.

Updated both wording surfaces to qualify "on the
`design='continuous_at_zero'` (Design 1') surface" and explain
the auto-detect divergence as out-of-scope-for-this-test (not a
defect).

Stats: 540 tests pass, 0 regressions. Doc-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the had-phase-4-r-parity branch from 476cc32 to fc436a5 Compare April 26, 2026 15:24
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Apr 26, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: fc436a57fe26cadd3689ac1228643daa4c470a07


Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

  • Previous re-review items appear addressed: parity scope is now explicitly limited to the design="continuous_at_zero" surface, and the R generator / installer / metadata test all exact-pin DIDHAD, YatchewTest, and nprobust (CHANGELOG.md:L11-L12, benchmarks/R/generate_did_had_golden.R:L28-L37, benchmarks/R/requirements.R:L39-L70, tests/test_did_had_parity.py:L467-L494).
  • The affected methodology surfaces are HeterogeneousAdoptionDiD.fit(..., aggregate="event_study"), joint_pretrends_test, and joint_homogeneity_test. DIDHAD’s official docs define trends_lin as using the F-2 to F-1 evolution to estimate a group-specific linear trend and reducing feasible placebos by 1; the implementation matches that contract. diff_diff/had.py:L2906-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3587-L3726, diff_diff/had_pretests.py:L4014-L4119, docs/methodology/REGISTRY.md:L2501-L2502. (search.r-project.org)
  • I did not find an undocumented estimator / math / variance deviation in the changed code.
  • Parameter propagation is complete on the visible wrappers: fit() forwards trends_lin into _fit_event_study, and did_had_pretest_workflow() forwards it into both joint wrappers while preserving the “skip step 2” path when the consumed placebo leaves no earlier placebo. diff_diff/had.py:L3030-L3042, diff_diff/had_pretests.py:L4483-L4700.
  • Remaining gaps are documented and tracked rather than silent bugs: trends_lin × survey_design is fail-fast, and placebo-row Yatchew parity is intentionally out of scope because the upstream placebo null differs from Python’s current yatchew_hr_test surface. TODO.md:L105-L107, tests/test_did_had_parity.py:L360-L460. (github.com)
  • Review was static-only: this container lacks numpy, pandas, and pytest, so I could not execute the suite; I only verified the changed Python files compile with compile().

Methodology

  • No findings. The new trends_lin path is consistent with the documented DIDHAD / registry contract on the load-bearing pieces: F-1 anchoring, F >= 3 / presence of F-2, consumed-placebo drop, non-terminal-base rejection on direct joint-test calls, and explicit rejection of the unsupported weighted variant. diff_diff/had.py:L2906-L2968, diff_diff/had.py:L4006-L4054, diff_diff/had_pretests.py:L3587-L3726, diff_diff/had_pretests.py:L4014-L4119, docs/methodology/REGISTRY.md:L2501-L2502. (search.r-project.org)

Code Quality

  • No findings. No new inline inference anti-patterns were introduced; the event-study path still routes pointwise inference through safe_inference, and the new kwarg is forwarded instead of being silently ignored. diff_diff/had.py:L4441-L4449, diff_diff/had_pretests.py:L4666-L4700.

Performance

  • No findings. The estimator path reuses dy_dict[-2] as the slope source and avoids an extra pivot; the joint-wrapper pivots are small relative to the existing joint-Stute bootstrap cost. diff_diff/had.py:L4020-L4049, diff_diff/had_pretests.py:L3684-L3726, diff_diff/had_pretests.py:L4079-L4119.

Maintainability

  • No findings. The prior parity-drift risk is mitigated by exact version pinning in the R generator, the R requirements installer, and the fixture metadata test. benchmarks/R/generate_did_had_golden.R:L28-L37, benchmarks/R/requirements.R:L39-L70, tests/test_did_had_parity.py:L467-L494.

Tech Debt

  • Severity P3 (informational, tracked). Impact: trends_lin=True remains intentionally unsupported with survey weighting, but every affected surface fails fast with NotImplementedError, so the PR does not introduce a silent weighted-methodology deviation. Concrete fix: none required in this PR; derive and validate a weighted slope estimator before enabling the combination. diff_diff/had.py:L2945-L2968, diff_diff/had_pretests.py:L3473-L3486, diff_diff/had_pretests.py:L3915-L3926, TODO.md:L105.
  • Severity P3 (informational, tracked). Impact: placebo-row Yatchew parity is still intentionally skipped because upstream DIDHAD defines a different placebo null (mean-independence) than Python’s current yatchew_hr_test (linearity). Concrete fix: none required in this PR; add null="mean_independence" if placebo parity becomes a goal. tests/test_did_had_parity.py:L360-L460, TODO.md:L106. (github.com)

Security

  • No findings.

Documentation/Tests

  • No findings. The changelog, registry, and TODO entries are aligned on the new trends_lin contract and on the limited parity scope, and the new tests cover consumed-placebo handling, non-terminal-base rejection, ordered-categorical invariance, workflow skip behavior, and version-pin drift. CHANGELOG.md:L11-L12, docs/methodology/REGISTRY.md:L2501-L2502, TODO.md:L104-L107, tests/test_had_pretests.py:L4248-L5103, tests/test_did_had_parity.py:L113-L156, tests/test_did_had_parity.py:L360-L494.

The HeterogeneousAdoptionDiD example snippet in choosing_estimator.rst
failed the Pure Python Fallback CI job (test_doc_snippets.py
::test_doc_snippet[choosing_estimator:block7]) due to three latent
drift bugs from PR #389 (docs-rtd-audit):

1. Missing aggregate='event_study' on both did_had_pretest_workflow
   and HAD.fit calls — default aggregate='overall' requires exactly 2
   periods, but the doc-snippet test framework's namespace `data`
   (built via generate_staggered_data) has 10 periods.

2. Used the namespace's generic `data` variable, which has nonzero
   dose in every period (rng.choice from {0.0, 0.5, 1.0, 2.0}). HAD
   requires D=0 for all units in at least one pre-period.

3. `print(f"Estimate: {results.att:.3f}")` formatted att as a scalar,
   but under aggregate='event_study' results.att is a numpy array.

Fix: rewrite the snippet to construct its own HAD-shape panel inline
(mirrors how block6 handles ContinuousDiD with its own data
generator); thread aggregate='event_study' through both calls;
iterate the per-horizon att array for output.

Pre-existing on origin/main; surfaced on this PR's CI re-run after
the rebase. Other failing snippets (troubleshooting:block18, :block20,
r_comparison:block6, :block7) are also pre-existing on main but are
out of scope for this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber merged commit b560c80 into main Apr 26, 2026
21 of 22 checks passed
@igerber igerber deleted the had-phase-4-r-parity branch April 26, 2026 18:14
igerber added a commit that referenced this pull request Apr 26, 2026
Mirrors R YatchewTest::yatchew_test(order=0). Closes the placebo
Yatchew R-parity gap from PR #392.

- New keyword-only `null: Literal["linearity", "mean_independence"]`
  on `yatchew_hr_test` (default `"linearity"` is bit-exact backcompat).
- `"mean_independence"` fits intercept-only OLS (residuals = dy - mean(dy));
  the downstream sigma2_diff / sigma2_W / sort-by-d machinery is shared.
- Wired through both unweighted and survey-weighted code paths
  (4-arm dispatch on (null × weighted)).
- `YatchewTestResults` gained `null_form: str = "linearity"` field;
  `summary()` renders the correct null-hypothesis title; `__repr__`
  and `to_dict()` updated.
- `tests/test_did_had_parity.py::TestYatchewParity` removed the
  placebo skip; routes effect rows through `null="linearity"` (R order=1)
  and placebo rows through `null="mean_independence"` (R order=0); both
  modes share the documented `× G/(G-1)` finite-sample convention shift
  and parity holds at `atol=1e-10`.
- New `TestYatchewHRTestMeanIndependence` class (15 tests) covering
  happy path, naive Python baseline at `atol=1e-12`, population-variance
  closed form, invalid value, default backcompat, mode-agnostic
  tie/constant-d rejection, NaN handling, weighted reduction at
  w=ones(G) at `atol=1e-14`, weighted non-uniform baseline,
  default-under-weights, survey×null orthogonality, the (linearity,
  weighted) baseline (4-arm coverage), zero/replicate-weight rejection,
  and G<3 mode-agnostic. One additive backcompat case in each of
  `TestYatchewHRTest` and `TestYatchewHRTestSurvey`.
- REGISTRY.md HAD § Yatchew note: TODO marker replaced with shipped
  description. CHANGELOG.md and TODO.md updated.

Patch-level (additive keyword-only kwarg + additive dataclass field
with default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
igerber added a commit that referenced this pull request Apr 26, 2026
Mirrors R YatchewTest::yatchew_test(order=0). Closes the placebo
Yatchew R-parity gap from PR #392.

- New keyword-only `null: Literal["linearity", "mean_independence"]`
  on `yatchew_hr_test` (default `"linearity"` is bit-exact backcompat).
- `"mean_independence"` fits intercept-only OLS (residuals = dy - mean(dy));
  the downstream sigma2_diff / sigma2_W / sort-by-d machinery is shared.
- Wired through both unweighted and survey-weighted code paths
  (4-arm dispatch on (null × weighted)).
- `YatchewTestResults` gained `null_form: str = "linearity"` field;
  `summary()` renders the correct null-hypothesis title; `__repr__`
  and `to_dict()` updated.
- `tests/test_did_had_parity.py::TestYatchewParity` removed the
  placebo skip; routes effect rows through `null="linearity"` (R order=1)
  and placebo rows through `null="mean_independence"` (R order=0); both
  modes share the documented `× G/(G-1)` finite-sample convention shift
  and parity holds at `atol=1e-10`.
- New `TestYatchewHRTestMeanIndependence` class (15 tests) covering
  happy path, naive Python baseline at `atol=1e-12`, population-variance
  closed form, invalid value, default backcompat, mode-agnostic
  tie/constant-d rejection, NaN handling, weighted reduction at
  w=ones(G) at `atol=1e-14`, weighted non-uniform baseline,
  default-under-weights, survey×null orthogonality, the (linearity,
  weighted) baseline (4-arm coverage), zero/replicate-weight rejection,
  and G<3 mode-agnostic. One additive backcompat case in each of
  `TestYatchewHRTest` and `TestYatchewHRTestSurvey`.
- REGISTRY.md HAD § Yatchew note: TODO marker replaced with shipped
  description. CHANGELOG.md and TODO.md updated.

Patch-level (additive keyword-only kwarg + additive dataclass field
with default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant