fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro)#156
Merged
d-laub merged 2 commits intomcvickerlab:mainfrom May 4, 2026
Merged
Conversation
e9d2f58 to
47c6409
Compare
`bed_to_regions` only normalised the strand column when its dtype was
exactly `pl.Utf8`. BEDs that round-trip a strand category (e.g. through
`polars.bed.sort` -- per the seqpro and polars internals path) end up
with `strand: pl.Categorical`, which falls into the `else` branch and
keeps the column as Categorical:
```python
elif "strand" not in bed.schema:
cols.append(pl.lit(1).cast(pl.Int32).alias("strand"))
else:
cols.append(pl.col("strand")) # <-- Categorical, never replaced
```
`bed.select(...).to_numpy()` on a frame whose columns mix Int32 and
Categorical collapses to `dtype=object`, which then propagates through
`Dataset._full_regions` (declared as `NDArray[np.int32]`). Downstream
numba kernels in `genvarloader._dataset._genotypes.get_diffs_sparse`
reject the resulting `q_starts` / `q_ends` slices with:
numba.core.errors.TypingError: Failed in nopython mode pipeline
non-precise type array(pyobject, 1d, A)
This silently breaks any haplotype query against a multi-region dataset
whose BED was sorted via the standard polars path -- the failure mode
that surfaced as the chr19 ADNI 808-cohort extraction in
[smb-protopheno#302](standardmodelbio/smb-protopheno#302).
Repro (mirrors what `Dataset.open` does internally):
```python
import polars as pl
from genvarloader._dataset._utils import bed_to_regions
from genvarloader._utils import ContigNormalizer
bed = pl.DataFrame({
"chrom": ["chr19"], "chromStart": [100], "chromEnd": [200],
"strand": ["+"],
}).with_columns(pl.col("strand").cast(pl.Categorical))
regions = bed_to_regions(bed, ContigNormalizer(["chr19"]))
print(regions.dtype) # before: object, after: int32
```
Fix: branch on Categorical-or-Utf8 and `.cast(pl.Utf8)` before
`replace_strict`, since `replace_strict({"+": 1, "-": -1}, ...)` won't
reliably accept Categorical keys across all supported polars versions.
A non-string strand dtype now falls through to an `Int32` cast (callers
who pre-encoded as 1/-1 still work).
Tests in `tests/test_utils.py` pin all three branches: Categorical input,
Utf8 input, and missing-strand default.
28c7f28 to
8cd676a
Compare
The CI 'lock-file not up-to-date with the workspace' error fires for every PR because the 0.22.1 -> 0.22.2 bump on main updated pyproject.toml's version field but did not re-run pixi lock. This commit regenerates the lockfile from the current pixi.toml + pyproject.toml so the CI's pixi install --locked check passes again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ca90b60 to
671e12c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
bed_to_regionsonly normalises the strand column whenstrandis exactlypl.Utf8. BEDs that round-trip a strand category (e.g. throughpolars.bed.sort, which is whatDataset.opencalls internally) end up withstrand: pl.Categoricaland fall into theelsebranch:bed.select(...).to_numpy()on a frame mixingInt32andCategoricalcollapses todtype=object, which then propagates intoDataset._full_regions(typed asNDArray[np.int32]). Downstream the numba kernel_dataset._genotypes.get_diffs_sparserejects the resultingq_starts/q_endsslices:The break is invisible at write time and only surfaces on the first
ds[i, :].to_ak()(or any haplotype query that hits_haplotype_ilens). Smaller BEDs sometimes happen to work because their dtype path through polars stays Utf8 — multi-region BEDs from realistic panels reliably reproduce this.Reproducer
This is the minimal trigger — no real data needed:
End-to-end, this surfaced as the smb-protopheno#302 chr19 ADNI cohort extraction failure — every
Dataset.open(...).with_seqs(\"haplotypes\")[i, :].to_ak()call on the chr19 panel (1,381 genes / 11,044 CDS regions / 808 samples) hit the numba TypingError because_full_regionswasdtype=object.Fix
Branch on Categorical-or-Utf8 and
.cast(pl.Utf8)beforereplace_strict, sincereplace_strict({\"+\": 1, \"-\": -1}, ...)doesn't reliably accept Categorical keys across all supported polars versions. A non-string strand dtype now falls through to an explicitInt32cast (callers who pre-encoded as 1/-1 still work; previously theelsebranch silently let any non-Utf8 type through unconverted).Tests
Three regression tests in
tests/test_utils.py, one per branch:test_bed_to_regions_categorical_strand_returns_int32— pins the fix.test_bed_to_regions_utf8_strand_still_works— guards against the previously-correct path.test_bed_to_regions_no_strand_defaults_to_plus— the defaultstrand=1branch.Verified end-to-end
After applying this patch in-place on the v15-7 ADNI pod's
_utils.py:_full_regions.dtypeflips fromobject→int32.ds[0, :].to_ak()on the 11,044-region chr19 dataset returns the expected808 * 2 * bytesRagged array (no numba error).Test plan
pytest tests/test_utils.py -v(locally, on the patched copy)pytest tests/ -q -k \"not pgen\"if desired🤖 Generated with Claude Code