A hierarchical Bayesian sparse linear mixed model for predicting gene expression from cis-window genotype. bsBSLMM extends Zhou et al.'s BSLMM (2013) with two architectural additions matched to the structure of cis-regulatory genetic variation:
-
Block-level spike-and-slab gate
$z_b\sim\mathrm{Bernoulli}(\pi_b)$ on top of the per-SNP gate$\gamma_j$ — when$z_b=0$ the entire LD-coherent block is deterministically zeroed. This is block sparsity in the sparse-group-lasso sense (Yuan & Lin 2006), made Bayesian. -
Annotation-informed per-SNP prior
$\pi_{s,j}=\sigma(\alpha+\beta,a_j)$ where$a_j$ is the TSS-distance feature. The hyper-pair$(\alpha,\beta)$ is learned by Metropolis-within-Gibbs at every sweep — no precomputed annotation regression.
- Genome-wide cis-prediction yield + per-gene-quality lift: bsBSLMM passes more genes than BSLMM at matched MCMC budgets and beats BSLMM on paired Wilcoxon head-to-head — significant on 21 of 22 autosomes individually (chr18 borderline at p = 0.068), with overall one-sided paired Wilcoxon p = 1.1×10⁻¹⁸¹ over the 5,584-gene shared set (mean Δ test r = +0.024).
- Calibrated posterior: per-SNP credible intervals (95% CI excludes 0) deliver an active-SNP set that is 3.04× enriched in ENCODE GM12878 DNase peaks (vs BSLMM 1.40×; 95% CIs disjoint).
- Downstream TWAS application: a SPrediXcan-style TWAS against the de Lange 2017 IBD GWAS finds canonical IBD susceptibility loci (IL23R, PTPN2, ATG16L1, CARD9, the chr17q21 ORMDL3-GSDMB-IKZF3 cluster) with signal distributed across 15 chromosomes.
git clone https://github.com/learnslowly/bsBSLMM.git
cd bsBSLMM
# Create the modelling env
conda create -n bsbslmm_env -c conda-forge -c bioconda --yes \
python=3.10 numpy=1.24.3 scipy scikit-learn numba joblib \
pandas matplotlib seaborn statsmodels cyvcf2 tabix bcftools
# Optional: BSLMM baseline (GEMMA)
conda create -n gemma_env -c bioconda --yes gemma=0.98.5 python=3.10 numpy
# Optional: lbatch — recommended Slurm queue throttle for parallel submission
conda activate bsbslmm_env
pip install lbatchSee PIPELINE.md §0 for full environment + data setup
(GEUVADIS expression, 1000G EUR genotype, GENCODE GTF, annotation
downloads).
The per-iteration block-Gibbs sweep has an optional C++ implementation (pybind11 extension) alongside the default Numba-JIT path. Both implement identical statistical semantics, consume identical RNG draws per iteration, and produce posteriors that agree to one ULP of float64 at the production schedule. On chr20 the C++ kernel runs ~1.9× faster per gene (30.8 vs 55.6 s under single-thread BLAS). Single-threaded SIMD; per-gene parallelism is via the SLURM array jobs.
CMake + vendored pybind11. Requirements: C++17 compiler (g++ ≥ 7 or clang ≥ 5), CMake ≥ 3.15, the project conda env (for Python headers).
bash scripts/build_native.sh # wraps cmake in a one-shot sbatch
bash scripts/build_native.sh --local # builds in-placeThe first invocation clones pybind11 (~5 MB) into extern/pybind11/.
After the build, bsbslmm/_native_mcmc.cpython-<py>-<arch>.so is in
place next to the Python files. Both the .so and extern/pybind11/
are gitignored — each developer builds locally on their target machine.
>>> from bsbslmm._native import HAS_NATIVE
>>> HAS_NATIVE
Truebsbslmm_gibbs(..., use_native=True)or set BSBSLMM_USE_NATIVE=1 in the environment so existing call
sites pick it up unchanged:
export BSBSLMM_USE_NATIVE=1
sbatch job/bsbslmm_array.batchIf the extension is not built, use_native=True raises RuntimeError
(no silent fallback). Equivalence between paths is tested by
tests/test_native_smoke.py and tests/test_native_sweep_equivalence.py.
Everything cluster- and environment-specific lives in config/cluster.env:
Slurm partition/account, filesystem locations (PROJECT_ROOT,
DATA_DIR, CONDA_ROOT, TIGAR_ROOT, METAXCAN_DIR), and conda env
names (CONDA_ENV, CONDA_ENV_GEMMA, CONDA_ENV_METAXCAN). Edit it
once on a fresh clone:
$EDITOR config/cluster.env # set partition, account, paths, env names
source config/cluster.env # or rely on bin/sbatch_with_creds.shSee PIPELINE.md §0 for the full list and per-var
descriptions.
Once data prep has run (see PIPELINE.md §1), fit bsBSLMM on chr20:
GENE_LIST=data/real_G/gene_list_chr20.txt
lbatch --array=1-100 --export=ALL,GENE_LIST=$GENE_LIST,GENES_PER_TASK=30 \
job/bsbslmm_array.batch
# After it finishes, aggregate + finalize:
lbatch --export=ALL,CHR=20,PREFIX=bsbslmm \
job/bsbslmm_finalize_chr.batchFor the full reproduction recipe (panel benchmark, ablation, biological
insights, TWAS), follow PIPELINE.md section by section.
bsbslmm/ Python package — model + Numba kernel
scripts/ Per-step runner scripts
job/ Slurm batch scripts
config/ Cluster credentials (edit cluster.env once)
data/ Per-gene cache (gitignored; populated by prepare_real_G.py)
results/ Per-method fits + analysis outputs (gitignored)
PIPELINE.md End-to-end reproduction guide
LICENSE MIT
The Python package exposes:
from bsbslmm.bsbslmm_mcmc import bsbslmm_gibbs, predict_bsbslmm
from bsbslmm.bsbslmm_block_only_mcmc import bsbslmm_block_only_gibbs, predict_bsbslmm_block_only
from bsbslmm.bslmm_mcmc import bslmm_gibbs, predict_bslmm
from bsbslmm.blocks import compute_ld_matrix, partition_blocksIf you use bsBSLMM in your work, please cite:
@misc{huang2026annotationinformedblocksparsebayesianmodeling,
title={Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction},
author={Lei Huang and Hui Shen and Kuan-Jui Su and Chuan Qiu and Martha Isabel Gonzalez-Ramirez and Anqi Liu and Zhe Luo and Yun Gong and Yipu Zhang and Dawei Li and Chaoyang Zhang and Hong-Wen Deng},
year={2026},
eprint={2606.00483},
archivePrefix={arXiv},
primaryClass={q-bio.GN},
url={https://arxiv.org/abs/2606.00483},
}
MIT. See LICENSE.