Skip to content

Fix SDPR initialization and seed handling#484

Closed
hsun3163 wants to merge 3 commits intoStatFunGen:mainfrom
hsun3163:fix/sdpr-legacy-init-seed
Closed

Fix SDPR initialization and seed handling#484
hsun3163 wants to merge 3 commits intoStatFunGen:mainfrom
hsun3163:fix/sdpr-legacy-init-seed

Conversation

@hsun3163
Copy link
Copy Markdown
Contributor

@hsun3163 hsun3163 commented Apr 25, 2026

Summary

This PR fixes the SDPR initialization and reproducibility issue in pecotmr by exposing the SDPR initialization mode and passing deterministic seed handling through the R API into the C++ sampler.

The default is now init = "legacy_random", which restores original SDPR-style random cluster initialization. The current/null initialization path is not removed; it remains available explicitly as init = "null".

What is reverted and why

The current pecotmr SDPR path initialized all SNPs in the null cluster:

cls_assgn.assign(num_snp, 0);

That behavior was introduced as a defensive implementation choice for the rewritten sampler, because original SDPR-style random initialization can make the first sample_beta() step allocate a large dense matrix when many SNPs start as non-null. However, the OTTERS regression experiments showed that this current/null initialization is not equivalent to the original SDPR behavior and can be unstable as a prediction model when run without controlled seed handling.

This PR therefore reverts the current/null initialization as the default OTTERS-compatible behavior. It does not delete that implementation; it keeps it as an explicit opt-in mode with init = "null" for debugging or intentional use.

The option name legacy_random is new in pecotmr. It means original SDPR-style random initialization, not an upstream SDPR API name.

Prediction-scale validation rationale

For each comparison, beta1 and beta2 are two SDPR weight vectors being compared, for example two repeated pecotmr runs or Old OTTERS weights versus new pecotmr weights. I used the LD-weighted prediction correlation:

beta1' R beta2 / sqrt((beta1' R beta1) * (beta2' R beta2))

where R is the same LD matrix used by SDPR. This measures whether the two weight vectors produce the same genetically predicted expression under the reference LD.

Before the fix, the current/null SDPR path was not reproducible as a prediction model:

  • pecotmr current/null init, unseeded run A vs run B:
    • raw beta Pearson = 0.035
    • LD prediction correlation = 0.318

More precisely, the current/null path was not a stable prediction model across random states. It was deterministic when the exact same seed was reused, but changing the seed changed the fitted prediction substantially:

pecotmr current/null init, seed0 vs seed1:
raw beta Pearson = 0.026
LD prediction correlation = 0.308
opposite signs = 871 / 3404

The current/null initialization made SDPR highly sensitive to the random seed, to the point that two valid seeded runs on the same input produced different predicted expressions.

After switching to deterministic original-SDPR-style initialization, the SDPR output becomes stable on the prediction scale:

  • pecotmr original-SDPR-style init, seed0 vs seed1:
    • raw beta Pearson = 0.774
    • LD prediction correlation = 0.988

Trying to force every SNP beta to match exactly fails. Even after increasing the old SDPR run to 10,000 iterations, beta-vector agreement did not fully converge:

  • Old SDPR seed0 vs seed1, 10k iterations:
    • raw beta Pearson = 0.714
    • LD prediction correlation = 0.977

This likely reflects the fact that, under LD, correlated SNPs can exchange weights. Therefore beta Pearson can remain modest even when the predicted expression is nearly identical.

After the fix, pecotmr also agrees well with Old OTTERS on the prediction scale:

  • Old OTTERS vs pecotmr original-SDPR-style init seed0:

    • raw beta Pearson = 0.773
    • LD prediction correlation = 0.993
  • Old OTTERS vs pecotmr original-SDPR-style init seed1:

    • raw beta Pearson = 0.783
    • LD prediction correlation = 0.984

What this PR changes

  • Adds init = c("legacy_random", "null") to sdpr().
  • Makes init = "legacy_random" the default to restore original SDPR-style random initialization.
  • Keeps init = "null" as an explicit option instead of silently using it by default.
  • Passes seed and initialization mode into the C++ sampler.
  • Regenerates cpp11 registration and man/sdpr.Rd.
  • Adds tests for invalid init, fixed-seed reproducibility with n_threads = 1, explicit null initialization, and sdpr_weights() argument forwarding.

Interpretation

  1. The current/null pecotmr SDPR path had a real prediction-level reproducibility problem.
  2. Restoring original SDPR-style random initialization and deterministic seed handling fixes the prediction-level instability.
  3. Trying to make every per-SNP beta exactly match fails, but prediction-scale agreement is high.
  4. Therefore the validation target should be LD-weighted prediction correlation, not exact per-SNP beta equality.

@hsun3163
Copy link
Copy Markdown
Contributor Author

PRS-CS follow-up finding

As a follow-up, I ran the same seed-controlled prediction-scale check for PRS-CS on the two OTTERS regression fixtures used for lassosum/SDPR evidence: fixture 161 and fixture 206.

The PRS-CS experiment is useful because it separates three targets that can otherwise be confused:

  1. deterministic replay within the same method and same seed,
  2. seed sensitivity within the same method,
  3. old Python versus pecotmr implementation compatibility.

For PRS-CS, exact beta identity is achievable when the method path and seed are identical:

  • same-method, same-seed A vs B reruns:
    • beta Pearson = 1
    • LD prediction correlation = 1
    • exact/allclose match = true

However, changing only the seed within the same PRS-CS method changes the per-SNP beta allocation:

  • same-method seed0 vs seed1:
    • beta Pearson = 0.326-0.507
    • LD prediction correlation = 0.9499-0.9647

Comparing old Python PRS-CS to pecotmr PRS-CS with the same frozen inputs and the same seed also gives low raw beta correlation but high prediction-scale agreement:

  • old Python vs pecotmr, same seed:
    • beta Pearson = 0.3899-0.4114
    • LD prediction correlation = 0.9560-0.9644

This supports the same validation principle used for SDPR: raw per-SNP beta Pearson can be too strict for stochastic shrinkage methods under LD. Exact beta equality is a valid
determinism test for repeated same-seed replay, but it is not the right cross-seed or cross-implementation validation target. For PRS-CS, the useful target is fixed-seed determinism
within each implementation plus LD-weighted prediction correlation between implementations.

This PR does not change PRS-CS behavior. The PRS-CS follow-up is included only to clarify the validation standard: prediction-scale agreement is the practical model-level metric,
while exact per-SNP beta parity should be reserved for deterministic replay tests.

@hsun3163
Copy link
Copy Markdown
Contributor Author

This PR is due to testing artifact from an earlier version. Therefore nolonger needed

@hsun3163 hsun3163 closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant