Skip to content

fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro)#156

Merged
d-laub merged 2 commits intomcvickerlab:mainfrom
bschilder:bschilder/fix-categorical-strand
May 4, 2026
Merged

fix(_utils): handle Categorical strand in bed_to_regions (numba TypingError repro)#156
d-laub merged 2 commits intomcvickerlab:mainfrom
bschilder:bschilder/fix-categorical-strand

Conversation

@bschilder
Copy link
Copy Markdown
Collaborator

TL;DR

bed_to_regions only normalises the strand column when strand is exactly pl.Utf8. BEDs that round-trip a strand category (e.g. through polars.bed.sort, which is what Dataset.open calls internally) end up with strand: pl.Categorical and fall into the else branch:

if bed.schema.get(\"strand\", None) == pl.Utf8:
    cols.append(pl.col(\"strand\").replace_strict({\"+\": 1, \"-\": -1}, return_dtype=pl.Int32))
elif \"strand\" not in bed.schema:
    cols.append(pl.lit(1).cast(pl.Int32).alias(\"strand\"))
else:
    cols.append(pl.col(\"strand\"))   # ← Categorical survives unchanged

bed.select(...).to_numpy() on a frame mixing Int32 and Categorical collapses to dtype=object, which then propagates into Dataset._full_regions (typed as NDArray[np.int32]). Downstream the numba kernel _dataset._genotypes.get_diffs_sparse rejects the resulting q_starts / q_ends slices:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
non-precise type array(pyobject, 1d, A)

The break is invisible at write time and only surfaces on the first ds[i, :].to_ak() (or any haplotype query that hits _haplotype_ilens). Smaller BEDs sometimes happen to work because their dtype path through polars stays Utf8 — multi-region BEDs from realistic panels reliably reproduce this.

Reproducer

This is the minimal trigger — no real data needed:

import polars as pl
from genvarloader._dataset._utils import bed_to_regions
from genvarloader._utils import ContigNormalizer

bed = pl.DataFrame({
    \"chrom\": [\"chr19\", \"chr19\"],
    \"chromStart\": [100, 200],
    \"chromEnd\": [150, 250],
    \"strand\": [\"+\", \"-\"],
}).with_columns(pl.col(\"strand\").cast(pl.Categorical))

regions = bed_to_regions(bed, ContigNormalizer([\"chr19\"]))
print(regions.dtype)  # before: object,  after: int32

End-to-end, this surfaced as the smb-protopheno#302 chr19 ADNI cohort extraction failure — every Dataset.open(...).with_seqs(\"haplotypes\")[i, :].to_ak() call on the chr19 panel (1,381 genes / 11,044 CDS regions / 808 samples) hit the numba TypingError because _full_regions was dtype=object.

Fix

Branch on Categorical-or-Utf8 and .cast(pl.Utf8) before replace_strict, since replace_strict({\"+\": 1, \"-\": -1}, ...) doesn't reliably accept Categorical keys across all supported polars versions. A non-string strand dtype now falls through to an explicit Int32 cast (callers who pre-encoded as 1/-1 still work; previously the else branch silently let any non-Utf8 type through unconverted).

strand_dtype = bed.schema.get(\"strand\", None)
if strand_dtype is None:
    cols.append(pl.lit(1).cast(pl.Int32).alias(\"strand\"))
elif strand_dtype == pl.Utf8 or strand_dtype == pl.Categorical:
    cols.append(
        pl.col(\"strand\").cast(pl.Utf8).replace_strict({\"+\": 1, \"-\": -1}, return_dtype=pl.Int32)
    )
else:
    cols.append(pl.col(\"strand\").cast(pl.Int32))

Tests

Three regression tests in tests/test_utils.py, one per branch:

  • test_bed_to_regions_categorical_strand_returns_int32 — pins the fix.
  • test_bed_to_regions_utf8_strand_still_works — guards against the previously-correct path.
  • test_bed_to_regions_no_strand_defaults_to_plus — the default strand=1 branch.

Verified end-to-end

After applying this patch in-place on the v15-7 ADNI pod's _utils.py:

  • _full_regions.dtype flips from objectint32.
  • ds[0, :].to_ak() on the 11,044-region chr19 dataset returns the expected 808 * 2 * bytes Ragged array (no numba error).
  • The downstream chr19 haplotype-extraction pipeline progresses past the previously-fatal first call.

Test plan

  • pytest tests/test_utils.py -v (locally, on the patched copy)
  • End-to-end smoke on a 1,381-gene chr19 panel with 808 samples (smb-protopheno extractor), previously failing
  • Reviewer to run full pytest tests/ -q -k \"not pgen\" if desired

🤖 Generated with Claude Code

@bschilder bschilder force-pushed the bschilder/fix-categorical-strand branch from e9d2f58 to 47c6409 Compare May 3, 2026 04:38
`bed_to_regions` only normalised the strand column when its dtype was
exactly `pl.Utf8`. BEDs that round-trip a strand category (e.g. through
`polars.bed.sort` -- per the seqpro and polars internals path) end up
with `strand: pl.Categorical`, which falls into the `else` branch and
keeps the column as Categorical:

```python
elif "strand" not in bed.schema:
    cols.append(pl.lit(1).cast(pl.Int32).alias("strand"))
else:
    cols.append(pl.col("strand"))   # <-- Categorical, never replaced
```

`bed.select(...).to_numpy()` on a frame whose columns mix Int32 and
Categorical collapses to `dtype=object`, which then propagates through
`Dataset._full_regions` (declared as `NDArray[np.int32]`). Downstream
numba kernels in `genvarloader._dataset._genotypes.get_diffs_sparse`
reject the resulting `q_starts` / `q_ends` slices with:

    numba.core.errors.TypingError: Failed in nopython mode pipeline
    non-precise type array(pyobject, 1d, A)

This silently breaks any haplotype query against a multi-region dataset
whose BED was sorted via the standard polars path -- the failure mode
that surfaced as the chr19 ADNI 808-cohort extraction in
[smb-protopheno#302](standardmodelbio/smb-protopheno#302).

Repro (mirrors what `Dataset.open` does internally):

```python
import polars as pl
from genvarloader._dataset._utils import bed_to_regions
from genvarloader._utils import ContigNormalizer

bed = pl.DataFrame({
    "chrom": ["chr19"], "chromStart": [100], "chromEnd": [200],
    "strand": ["+"],
}).with_columns(pl.col("strand").cast(pl.Categorical))

regions = bed_to_regions(bed, ContigNormalizer(["chr19"]))
print(regions.dtype)  # before: object,  after: int32
```

Fix: branch on Categorical-or-Utf8 and `.cast(pl.Utf8)` before
`replace_strict`, since `replace_strict({"+": 1, "-": -1}, ...)` won't
reliably accept Categorical keys across all supported polars versions.
A non-string strand dtype now falls through to an `Int32` cast (callers
who pre-encoded as 1/-1 still work).

Tests in `tests/test_utils.py` pin all three branches: Categorical input,
Utf8 input, and missing-strand default.
@bschilder bschilder force-pushed the bschilder/fix-categorical-strand branch from 28c7f28 to 8cd676a Compare May 3, 2026 04:39
The CI 'lock-file not up-to-date with the workspace' error fires for
every PR because the 0.22.1 -> 0.22.2 bump on main updated pyproject.toml's
version field but did not re-run pixi lock. This commit regenerates the
lockfile from the current pixi.toml + pyproject.toml so the CI's
pixi install --locked check passes again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bschilder bschilder force-pushed the bschilder/fix-categorical-strand branch from ca90b60 to 671e12c Compare May 3, 2026 04:49
Copy link
Copy Markdown
Collaborator

@d-laub d-laub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging!

@d-laub d-laub merged commit 0c425f9 into mcvickerlab:main May 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants