bsBSLMM — Block-Sparse BSLMM for cis-expression prediction

A hierarchical Bayesian sparse linear mixed model for predicting gene expression from cis-window genotype. bsBSLMM extends Zhou et al.'s BSLMM (2013) with two architectural additions matched to the structure of cis-regulatory genetic variation:

Block-level spike-and-slab gate $z_b\sim\mathrm{Bernoulli}(\pi_b)$ on top of the per-SNP gate $\gamma_j$ — when $z_b=0$ the entire LD-coherent block is deterministically zeroed. This is block sparsity in the sparse-group-lasso sense (Yuan & Lin 2006), made Bayesian.
Annotation-informed per-SNP prior $\pi_{s,j}=\sigma(\alpha+\beta,a_j)$ where $a_j$ is the TSS-distance feature. The hyper-pair $(\alpha,\beta)$ is learned by Metropolis-within-Gibbs at every sweep — no precomputed annotation regression.

Highlights

Genome-wide cis-prediction yield + per-gene-quality lift: bsBSLMM passes more genes than BSLMM at matched MCMC budgets and beats BSLMM on paired Wilcoxon head-to-head — significant on 21 of 22 autosomes individually (chr18 borderline at p = 0.068), with overall one-sided paired Wilcoxon p = 1.1×10⁻¹⁸¹ over the 5,584-gene shared set (mean Δ test r = +0.024).
Calibrated posterior: per-SNP credible intervals (95% CI excludes 0) deliver an active-SNP set that is 3.04× enriched in ENCODE GM12878 DNase peaks (vs BSLMM 1.40×; 95% CIs disjoint).
Downstream TWAS application: a SPrediXcan-style TWAS against the de Lange 2017 IBD GWAS finds canonical IBD susceptibility loci (IL23R, PTPN2, ATG16L1, CARD9, the chr17q21 ORMDL3-GSDMB-IKZF3 cluster) with signal distributed across 15 chromosomes.

Installation

git clone https://github.com/learnslowly/bsBSLMM.git
cd bsBSLMM

# Create the modelling env
conda create -n bsbslmm_env -c conda-forge -c bioconda --yes \
    python=3.10 numpy=1.24.3 scipy scikit-learn numba joblib \
    pandas matplotlib seaborn statsmodels cyvcf2 tabix bcftools

# Optional: BSLMM baseline (GEMMA)
conda create -n gemma_env -c bioconda --yes gemma=0.98.5 python=3.10 numpy

# Optional: lbatch — recommended Slurm queue throttle for parallel submission
conda activate bsbslmm_env
pip install lbatch

See PIPELINE.md §0 for full environment + data setup (GEUVADIS expression, 1000G EUR genotype, GENCODE GTF, annotation downloads).

Optional: native C++ kernel

The per-iteration block-Gibbs sweep has an optional C++ implementation (pybind11 extension) alongside the default Numba-JIT path. Both implement identical statistical semantics, consume identical RNG draws per iteration, and produce posteriors that agree to one ULP of float64 at the production schedule. On chr20 the C++ kernel runs ~1.9× faster per gene (30.8 vs 55.6 s under single-thread BLAS). Single-threaded SIMD; per-gene parallelism is via the SLURM array jobs.

Build

CMake + vendored pybind11. Requirements: C++17 compiler (g++ ≥ 7 or clang ≥ 5), CMake ≥ 3.15, the project conda env (for Python headers).

bash scripts/build_native.sh           # wraps cmake in a one-shot sbatch
bash scripts/build_native.sh --local   # builds in-place

The first invocation clones pybind11 (~5 MB) into extern/pybind11/. After the build, bsbslmm/_native_mcmc.cpython-<py>-<arch>.so is in place next to the Python files. Both the .so and extern/pybind11/ are gitignored — each developer builds locally on their target machine.

>>> from bsbslmm._native import HAS_NATIVE
>>> HAS_NATIVE
True

Enable at runtime

bsbslmm_gibbs(..., use_native=True)

or set BSBSLMM_USE_NATIVE=1 in the environment so existing call sites pick it up unchanged:

export BSBSLMM_USE_NATIVE=1
sbatch job/bsbslmm_array.batch

If the extension is not built, use_native=True raises RuntimeError (no silent fallback). Equivalence between paths is tested by tests/test_native_smoke.py and tests/test_native_sweep_equivalence.py.

Configuration (one-time)

Everything cluster- and environment-specific lives in config/cluster.env: Slurm partition/account, filesystem locations (PROJECT_ROOT, DATA_DIR, CONDA_ROOT, TIGAR_ROOT, METAXCAN_DIR), and conda env names (CONDA_ENV, CONDA_ENV_GEMMA, CONDA_ENV_METAXCAN). Edit it once on a fresh clone:

$EDITOR config/cluster.env       # set partition, account, paths, env names
source config/cluster.env        # or rely on bin/sbatch_with_creds.sh

See PIPELINE.md §0 for the full list and per-var descriptions.

Quick start

Once data prep has run (see PIPELINE.md §1), fit bsBSLMM on chr20:

GENE_LIST=data/real_G/gene_list_chr20.txt
lbatch --array=1-100 --export=ALL,GENE_LIST=$GENE_LIST,GENES_PER_TASK=30 \
    job/bsbslmm_array.batch

# After it finishes, aggregate + finalize:
lbatch --export=ALL,CHR=20,PREFIX=bsbslmm \
    job/bsbslmm_finalize_chr.batch

For the full reproduction recipe (panel benchmark, ablation, biological insights, TWAS), follow PIPELINE.md section by section.

Repository layout

bsbslmm/                     Python package — model + Numba kernel
scripts/                     Per-step runner scripts
job/                         Slurm batch scripts
config/                      Cluster credentials (edit cluster.env once)
data/                        Per-gene cache (gitignored; populated by prepare_real_G.py)
results/                     Per-method fits + analysis outputs (gitignored)
PIPELINE.md                  End-to-end reproduction guide
LICENSE                      MIT

The Python package exposes:

from bsbslmm.bsbslmm_mcmc            import bsbslmm_gibbs,            predict_bsbslmm
from bsbslmm.bsbslmm_block_only_mcmc import bsbslmm_block_only_gibbs, predict_bsbslmm_block_only
from bsbslmm.bslmm_mcmc              import bslmm_gibbs,              predict_bslmm
from bsbslmm.blocks                  import compute_ld_matrix, partition_blocks

Citation

If you use bsBSLMM in your work, please cite:

@misc{huang2026annotationinformedblocksparsebayesianmodeling,
      title={Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction}, 
      author={Lei Huang and Hui Shen and Kuan-Jui Su and Chuan Qiu and Martha Isabel Gonzalez-Ramirez and Anqi Liu and Zhe Luo and Yun Gong and Yipu Zhang and Dawei Li and Chaoyang Zhang and Hong-Wen Deng},
      year={2026},
      eprint={2606.00483},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN},
      url={https://arxiv.org/abs/2606.00483}, 
}

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bsBSLMM — Block-Sparse BSLMM for cis-expression prediction

Highlights

Installation

Optional: native C++ kernel

Build

Enable at runtime

Configuration (one-time)

Quick start

Repository layout

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
bsbslmm		bsbslmm
config		config
job		job
los		los
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
PIPELINE.md		PIPELINE.md
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

bsBSLMM — Block-Sparse BSLMM for cis-expression prediction

Highlights

Installation

Optional: native C++ kernel

Build

Enable at runtime

Configuration (one-time)

Quick start

Repository layout

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages