fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro) by bschilder · Pull Request #156 · mcvickerlab/GenVarLoader

bschilder · 2026-05-03T04:33:42Z

TL;DR

bed_to_regions only normalises the strand column when strand is exactly pl.Utf8. BEDs that round-trip a strand category (e.g. through polars.bed.sort, which is what Dataset.open calls internally) end up with strand: pl.Categorical and fall into the else branch:

if bed.schema.get(\"strand\", None) == pl.Utf8:
    cols.append(pl.col(\"strand\").replace_strict({\"+\": 1, \"-\": -1}, return_dtype=pl.Int32))
elif \"strand\" not in bed.schema:
    cols.append(pl.lit(1).cast(pl.Int32).alias(\"strand\"))
else:
    cols.append(pl.col(\"strand\"))   # ← Categorical survives unchanged

bed.select(...).to_numpy() on a frame mixing Int32 and Categorical collapses to dtype=object, which then propagates into Dataset._full_regions (typed as NDArray[np.int32]). Downstream the numba kernel _dataset._genotypes.get_diffs_sparse rejects the resulting q_starts / q_ends slices:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, A)

The break is invisible at write time and only surfaces on the first ds[i, :].to_ak() (or any haplotype query that hits _haplotype_ilens). Smaller BEDs sometimes happen to work because their dtype path through polars stays Utf8 — multi-region BEDs from realistic panels reliably reproduce this.

Reproducer

This is the minimal trigger — no real data needed:

import polars as pl
from genvarloader._dataset._utils import bed_to_regions
from genvarloader._utils import ContigNormalizer

bed = pl.DataFrame({
    \"chrom\": [\"chr19\", \"chr19\"],
    \"chromStart\": [100, 200],
    \"chromEnd\": [150, 250],
    \"strand\": [\"+\", \"-\"],
}).with_columns(pl.col(\"strand\").cast(pl.Categorical))

regions = bed_to_regions(bed, ContigNormalizer([\"chr19\"]))
print(regions.dtype)  # before: object,  after: int32

End-to-end, this surfaced as the smb-protopheno#302 chr19 ADNI cohort extraction failure — every Dataset.open(...).with_seqs(\"haplotypes\")[i, :].to_ak() call on the chr19 panel (1,381 genes / 11,044 CDS regions / 808 samples) hit the numba TypingError because _full_regions was dtype=object.

Fix

Branch on Categorical-or-Utf8 and .cast(pl.Utf8) before replace_strict, since replace_strict({\"+\": 1, \"-\": -1}, ...) doesn't reliably accept Categorical keys across all supported polars versions. A non-string strand dtype now falls through to an explicit Int32 cast (callers who pre-encoded as 1/-1 still work; previously the else branch silently let any non-Utf8 type through unconverted).

strand_dtype = bed.schema.get(\"strand\", None)
if strand_dtype is None:
    cols.append(pl.lit(1).cast(pl.Int32).alias(\"strand\"))
elif strand_dtype == pl.Utf8 or strand_dtype == pl.Categorical:
    cols.append(
        pl.col(\"strand\").cast(pl.Utf8).replace_strict({\"+\": 1, \"-\": -1}, return_dtype=pl.Int32)
    )
else:
    cols.append(pl.col(\"strand\").cast(pl.Int32))

Tests

Three regression tests in tests/test_utils.py, one per branch:

test_bed_to_regions_categorical_strand_returns_int32 — pins the fix.
test_bed_to_regions_utf8_strand_still_works — guards against the previously-correct path.
test_bed_to_regions_no_strand_defaults_to_plus — the default strand=1 branch.

Verified end-to-end

After applying this patch in-place on the v15-7 ADNI pod's _utils.py:

_full_regions.dtype flips from object → int32.
ds[0, :].to_ak() on the 11,044-region chr19 dataset returns the expected 808 * 2 * bytes Ragged array (no numba error).
The downstream chr19 haplotype-extraction pipeline progresses past the previously-fatal first call.

Test plan

pytest tests/test_utils.py -v (locally, on the patched copy)
End-to-end smoke on a 1,381-gene chr19 panel with 808 samples (smb-protopheno extractor), previously failing
Reviewer to run full pytest tests/ -q -k \"not pgen\" if desired

🤖 Generated with Claude Code

`bed_to_regions` only normalised the strand column when its dtype was exactly `pl.Utf8`. BEDs that round-trip a strand category (e.g. through `polars.bed.sort` -- per the seqpro and polars internals path) end up with `strand: pl.Categorical`, which falls into the `else` branch and keeps the column as Categorical: ```python elif "strand" not in bed.schema: cols.append(pl.lit(1).cast(pl.Int32).alias("strand")) else: cols.append(pl.col("strand")) # <-- Categorical, never replaced ``` `bed.select(...).to_numpy()` on a frame whose columns mix Int32 and Categorical collapses to `dtype=object`, which then propagates through `Dataset._full_regions` (declared as `NDArray[np.int32]`). Downstream numba kernels in `genvarloader._dataset._genotypes.get_diffs_sparse` reject the resulting `q_starts` / `q_ends` slices with: numba.core.errors.TypingError: Failed in nopython mode pipeline non-precise type array(pyobject, 1d, A) This silently breaks any haplotype query against a multi-region dataset whose BED was sorted via the standard polars path -- the failure mode that surfaced as the chr19 ADNI 808-cohort extraction in [smb-protopheno#302](standardmodelbio/smb-protopheno#302). Repro (mirrors what `Dataset.open` does internally): ```python import polars as pl from genvarloader._dataset._utils import bed_to_regions from genvarloader._utils import ContigNormalizer bed = pl.DataFrame({ "chrom": ["chr19"], "chromStart": [100], "chromEnd": [200], "strand": ["+"], }).with_columns(pl.col("strand").cast(pl.Categorical)) regions = bed_to_regions(bed, ContigNormalizer(["chr19"])) print(regions.dtype) # before: object, after: int32 ``` Fix: branch on Categorical-or-Utf8 and `.cast(pl.Utf8)` before `replace_strict`, since `replace_strict({"+": 1, "-": -1}, ...)` won't reliably accept Categorical keys across all supported polars versions. A non-string strand dtype now falls through to an `Int32` cast (callers who pre-encoded as 1/-1 still work). Tests in `tests/test_utils.py` pin all three branches: Categorical input, Utf8 input, and missing-strand default.

The CI 'lock-file not up-to-date with the workspace' error fires for every PR because the 0.22.1 -> 0.22.2 bump on main updated pyproject.toml's version field but did not re-run pixi lock. This commit regenerates the lockfile from the current pixi.toml + pyproject.toml so the CI's pixi install --locked check passes again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

d-laub

LGTM, merging!

bschilder force-pushed the bschilder/fix-categorical-strand branch from e9d2f58 to 47c6409 Compare May 3, 2026 04:38

bschilder force-pushed the bschilder/fix-categorical-strand branch from 28c7f28 to 8cd676a Compare May 3, 2026 04:39

bschilder force-pushed the bschilder/fix-categorical-strand branch from ca90b60 to 671e12c Compare May 3, 2026 04:49

d-laub approved these changes May 4, 2026

View reviewed changes

d-laub merged commit 0c425f9 into mcvickerlab:main May 4, 2026
5 checks passed

d-laub mentioned this pull request May 4, 2026

bed_to_regions silently drops strand when polars Categorical #152

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro)#156

fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro)#156
d-laub merged 2 commits intomcvickerlab:mainfrom
bschilder:bschilder/fix-categorical-strand

bschilder commented May 3, 2026

Uh oh!

d-laub left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bschilder commented May 3, 2026

TL;DR

Reproducer

Fix

Tests

Verified end-to-end

Test plan

Uh oh!

d-laub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants