Skip to content

learnslowly/bsBSLMM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bsBSLMM — Block-Sparse BSLMM for cis-expression prediction

A hierarchical Bayesian sparse linear mixed model for predicting gene expression from cis-window genotype. bsBSLMM extends Zhou et al.'s BSLMM (2013) with two architectural additions matched to the structure of cis-regulatory genetic variation:

  1. Block-level spike-and-slab gate $z_b\sim\mathrm{Bernoulli}(\pi_b)$ on top of the per-SNP gate $\gamma_j$ — when $z_b=0$ the entire LD-coherent block is deterministically zeroed. This is block sparsity in the sparse-group-lasso sense (Yuan & Lin 2006), made Bayesian.
  2. Annotation-informed per-SNP prior $\pi_{s,j}=\sigma(\alpha+\beta,a_j)$ where $a_j$ is the TSS-distance feature. The hyper-pair $(\alpha,\beta)$ is learned by Metropolis-within-Gibbs at every sweep — no precomputed annotation regression.

Highlights

  • Genome-wide cis-prediction yield + per-gene-quality lift: bsBSLMM passes more genes than BSLMM at matched MCMC budgets and beats BSLMM on paired Wilcoxon head-to-head — significant on 21 of 22 autosomes individually (chr18 borderline at p = 0.068), with overall one-sided paired Wilcoxon p = 1.1×10⁻¹⁸¹ over the 5,584-gene shared set (mean Δ test r = +0.024).
  • Calibrated posterior: per-SNP credible intervals (95% CI excludes 0) deliver an active-SNP set that is 3.04× enriched in ENCODE GM12878 DNase peaks (vs BSLMM 1.40×; 95% CIs disjoint).
  • Downstream TWAS application: a SPrediXcan-style TWAS against the de Lange 2017 IBD GWAS finds canonical IBD susceptibility loci (IL23R, PTPN2, ATG16L1, CARD9, the chr17q21 ORMDL3-GSDMB-IKZF3 cluster) with signal distributed across 15 chromosomes.

Installation

git clone https://github.com/learnslowly/bsBSLMM.git
cd bsBSLMM

# Create the modelling env
conda create -n bsbslmm_env -c conda-forge -c bioconda --yes \
    python=3.10 numpy=1.24.3 scipy scikit-learn numba joblib \
    pandas matplotlib seaborn statsmodels cyvcf2 tabix bcftools

# Optional: BSLMM baseline (GEMMA)
conda create -n gemma_env -c bioconda --yes gemma=0.98.5 python=3.10 numpy

# Optional: lbatch — recommended Slurm queue throttle for parallel submission
conda activate bsbslmm_env
pip install lbatch

See PIPELINE.md §0 for full environment + data setup (GEUVADIS expression, 1000G EUR genotype, GENCODE GTF, annotation downloads).

Optional: native C++ kernel

The per-iteration block-Gibbs sweep has an optional C++ implementation (pybind11 extension) alongside the default Numba-JIT path. Both implement identical statistical semantics, consume identical RNG draws per iteration, and produce posteriors that agree to one ULP of float64 at the production schedule. On chr20 the C++ kernel runs ~1.9× faster per gene (30.8 vs 55.6 s under single-thread BLAS). Single-threaded SIMD; per-gene parallelism is via the SLURM array jobs.

Build

CMake + vendored pybind11. Requirements: C++17 compiler (g++ ≥ 7 or clang ≥ 5), CMake ≥ 3.15, the project conda env (for Python headers).

bash scripts/build_native.sh           # wraps cmake in a one-shot sbatch
bash scripts/build_native.sh --local   # builds in-place

The first invocation clones pybind11 (~5 MB) into extern/pybind11/. After the build, bsbslmm/_native_mcmc.cpython-<py>-<arch>.so is in place next to the Python files. Both the .so and extern/pybind11/ are gitignored — each developer builds locally on their target machine.

>>> from bsbslmm._native import HAS_NATIVE
>>> HAS_NATIVE
True

Enable at runtime

bsbslmm_gibbs(..., use_native=True)

or set BSBSLMM_USE_NATIVE=1 in the environment so existing call sites pick it up unchanged:

export BSBSLMM_USE_NATIVE=1
sbatch job/bsbslmm_array.batch

If the extension is not built, use_native=True raises RuntimeError (no silent fallback). Equivalence between paths is tested by tests/test_native_smoke.py and tests/test_native_sweep_equivalence.py.

Configuration (one-time)

Everything cluster- and environment-specific lives in config/cluster.env: Slurm partition/account, filesystem locations (PROJECT_ROOT, DATA_DIR, CONDA_ROOT, TIGAR_ROOT, METAXCAN_DIR), and conda env names (CONDA_ENV, CONDA_ENV_GEMMA, CONDA_ENV_METAXCAN). Edit it once on a fresh clone:

$EDITOR config/cluster.env       # set partition, account, paths, env names
source config/cluster.env        # or rely on bin/sbatch_with_creds.sh

See PIPELINE.md §0 for the full list and per-var descriptions.

Quick start

Once data prep has run (see PIPELINE.md §1), fit bsBSLMM on chr20:

GENE_LIST=data/real_G/gene_list_chr20.txt
lbatch --array=1-100 --export=ALL,GENE_LIST=$GENE_LIST,GENES_PER_TASK=30 \
    job/bsbslmm_array.batch

# After it finishes, aggregate + finalize:
lbatch --export=ALL,CHR=20,PREFIX=bsbslmm \
    job/bsbslmm_finalize_chr.batch

For the full reproduction recipe (panel benchmark, ablation, biological insights, TWAS), follow PIPELINE.md section by section.

Repository layout

bsbslmm/                     Python package — model + Numba kernel
scripts/                     Per-step runner scripts
job/                         Slurm batch scripts
config/                      Cluster credentials (edit cluster.env once)
data/                        Per-gene cache (gitignored; populated by prepare_real_G.py)
results/                     Per-method fits + analysis outputs (gitignored)
PIPELINE.md                  End-to-end reproduction guide
LICENSE                      MIT

The Python package exposes:

from bsbslmm.bsbslmm_mcmc            import bsbslmm_gibbs,            predict_bsbslmm
from bsbslmm.bsbslmm_block_only_mcmc import bsbslmm_block_only_gibbs, predict_bsbslmm_block_only
from bsbslmm.bslmm_mcmc              import bslmm_gibbs,              predict_bslmm
from bsbslmm.blocks                  import compute_ld_matrix, partition_blocks

Citation

If you use bsBSLMM in your work, please cite:

@misc{huang2026annotationinformedblocksparsebayesianmodeling,
      title={Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction}, 
      author={Lei Huang and Hui Shen and Kuan-Jui Su and Chuan Qiu and Martha Isabel Gonzalez-Ramirez and Anqi Liu and Zhe Luo and Yun Gong and Yipu Zhang and Dawei Li and Chaoyang Zhang and Hong-Wen Deng},
      year={2026},
      eprint={2606.00483},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN},
      url={https://arxiv.org/abs/2606.00483}, 
}

License

MIT. See LICENSE.

About

bsBSLMM: a block-sparse extension of BSLMM with annotation-informed per-SNP priors for cis-expression prediction

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors