Fix SDPR initialization and seed handling by hsun3163 · Pull Request #484 · StatFunGen/pecotmr

hsun3163 · 2026-04-25T20:54:33Z

Summary

This PR fixes the SDPR initialization and reproducibility issue in pecotmr by exposing the SDPR initialization mode and passing deterministic seed handling through the R API into the C++ sampler.

The default is now init = "legacy_random", which restores original SDPR-style random cluster initialization. The current/null initialization path is not removed; it remains available explicitly as init = "null".

What is reverted and why

The current pecotmr SDPR path initialized all SNPs in the null cluster:

cls_assgn.assign(num_snp, 0);

That behavior was introduced as a defensive implementation choice for the rewritten sampler, because original SDPR-style random initialization can make the first sample_beta() step allocate a large dense matrix when many SNPs start as non-null. However, the OTTERS regression experiments showed that this current/null initialization is not equivalent to the original SDPR behavior and can be unstable as a prediction model when run without controlled seed handling.

This PR therefore reverts the current/null initialization as the default OTTERS-compatible behavior. It does not delete that implementation; it keeps it as an explicit opt-in mode with init = "null" for debugging or intentional use.

The option name legacy_random is new in pecotmr. It means original SDPR-style random initialization, not an upstream SDPR API name.

Prediction-scale validation rationale

For each comparison, beta1 and beta2 are two SDPR weight vectors being compared, for example two repeated pecotmr runs or Old OTTERS weights versus new pecotmr weights. I used the LD-weighted prediction correlation:

beta1' R beta2 / sqrt((beta1' R beta1) * (beta2' R beta2))

where R is the same LD matrix used by SDPR. This measures whether the two weight vectors produce the same genetically predicted expression under the reference LD.

Before the fix, the current/null SDPR path was not reproducible as a prediction model:

pecotmr current/null init, unseeded run A vs run B:
- raw beta Pearson = 0.035
- LD prediction correlation = 0.318

More precisely, the current/null path was not a stable prediction model across random states. It was deterministic when the exact same seed was reused, but changing the seed changed the fitted prediction substantially:

pecotmr current/null init, seed0 vs seed1:
raw beta Pearson = 0.026
LD prediction correlation = 0.308
opposite signs = 871 / 3404

The current/null initialization made SDPR highly sensitive to the random seed, to the point that two valid seeded runs on the same input produced different predicted expressions.

After switching to deterministic original-SDPR-style initialization, the SDPR output becomes stable on the prediction scale:

pecotmr original-SDPR-style init, seed0 vs seed1:
- raw beta Pearson = 0.774
- LD prediction correlation = 0.988

Trying to force every SNP beta to match exactly fails. Even after increasing the old SDPR run to 10,000 iterations, beta-vector agreement did not fully converge:

Old SDPR seed0 vs seed1, 10k iterations:
- raw beta Pearson = 0.714
- LD prediction correlation = 0.977

This likely reflects the fact that, under LD, correlated SNPs can exchange weights. Therefore beta Pearson can remain modest even when the predicted expression is nearly identical.

After the fix, pecotmr also agrees well with Old OTTERS on the prediction scale:

Old OTTERS vs pecotmr original-SDPR-style init seed0:
- raw beta Pearson = 0.773
- LD prediction correlation = 0.993
Old OTTERS vs pecotmr original-SDPR-style init seed1:
- raw beta Pearson = 0.783
- LD prediction correlation = 0.984

What this PR changes

Adds init = c("legacy_random", "null") to sdpr().
Makes init = "legacy_random" the default to restore original SDPR-style random initialization.
Keeps init = "null" as an explicit option instead of silently using it by default.
Passes seed and initialization mode into the C++ sampler.
Regenerates cpp11 registration and man/sdpr.Rd.
Adds tests for invalid init, fixed-seed reproducibility with n_threads = 1, explicit null initialization, and sdpr_weights() argument forwarding.

Interpretation

The current/null pecotmr SDPR path had a real prediction-level reproducibility problem.
Restoring original SDPR-style random initialization and deterministic seed handling fixes the prediction-level instability.
Trying to make every per-SNP beta exactly match fails, but prediction-scale agreement is high.
Therefore the validation target should be LD-weighted prediction correlation, not exact per-SNP beta equality.

hsun3163 · 2026-04-25T20:58:04Z

PRS-CS follow-up finding

As a follow-up, I ran the same seed-controlled prediction-scale check for PRS-CS on the two OTTERS regression fixtures used for lassosum/SDPR evidence: fixture 161 and fixture 206.

The PRS-CS experiment is useful because it separates three targets that can otherwise be confused:

deterministic replay within the same method and same seed,
seed sensitivity within the same method,
old Python versus pecotmr implementation compatibility.

For PRS-CS, exact beta identity is achievable when the method path and seed are identical:

same-method, same-seed A vs B reruns:
- beta Pearson = 1
- LD prediction correlation = 1
- exact/allclose match = true

However, changing only the seed within the same PRS-CS method changes the per-SNP beta allocation:

same-method seed0 vs seed1:
- beta Pearson = 0.326-0.507
- LD prediction correlation = 0.9499-0.9647

Comparing old Python PRS-CS to pecotmr PRS-CS with the same frozen inputs and the same seed also gives low raw beta correlation but high prediction-scale agreement:

old Python vs pecotmr, same seed:
- beta Pearson = 0.3899-0.4114
- LD prediction correlation = 0.9560-0.9644

This supports the same validation principle used for SDPR: raw per-SNP beta Pearson can be too strict for stochastic shrinkage methods under LD. Exact beta equality is a valid
determinism test for repeated same-seed replay, but it is not the right cross-seed or cross-implementation validation target. For PRS-CS, the useful target is fixed-seed determinism
within each implementation plus LD-weighted prediction correlation between implementations.

This PR does not change PRS-CS behavior. The PRS-CS follow-up is included only to clarify the validation standard: prediction-scale agreement is the practical model-level metric,
while exact per-SNP beta parity should be reserved for deterministic replay tests.

hsun3163 · 2026-04-30T02:05:24Z

This PR is due to testing artifact from an earlier version. Therefore nolonger needed

Fix SDPR initialization and seed handling

621d9f2

Hao Sun added 2 commits April 25, 2026 23:34

Tighten SDPR initialization API

bc37d98

Use portable single-precision SDPR math calls

37ccad3

hsun3163 closed this Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SDPR initialization and seed handling#484

Fix SDPR initialization and seed handling#484
hsun3163 wants to merge 3 commits intoStatFunGen:mainfrom
hsun3163:fix/sdpr-legacy-init-seed

hsun3163 commented Apr 25, 2026 •

edited

Loading

Uh oh!

hsun3163 commented Apr 25, 2026

Uh oh!

hsun3163 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hsun3163 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is reverted and why

Prediction-scale validation rationale

What this PR changes

Interpretation

Uh oh!

hsun3163 commented Apr 25, 2026

PRS-CS follow-up finding

Uh oh!

hsun3163 commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hsun3163 commented Apr 25, 2026 •

edited

Loading