fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features#123
Conversation
…features Same raw h5 read pattern as merge_annotation_features (line 142). After np.array(h5["features"]), cast bfloat16 (|V2 opaque void) and float16 to float32 before passing to downstream aggregation. - bfloat16 stored as |V2 via ml_dtypes: view + astype(float32) - float16: direct astype(float32) Tests added: - test_float16_features_upcast_to_float32 - test_bfloat16_features_upcast_to_float32 All 17 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes dtype incompatibilities when aggregating per-slide H5 feature embeddings into per-sample outputs by upcasting float16 and HDF5-stored bfloat16 (|V2 opaque void) feature arrays to float32 immediately after reading. This aligns aggregate_sample_features with downstream consumer expectations (e.g., scikit-learn, parquet).
Changes:
- Upcast
float16and HDF5|V2(bfloat16) feature arrays tofloat32during H5 ingestion inaggregate_sample_features. - Extend the aggregate-sample CLI test helper to support writing synthetic features in configurable dtypes.
- Add round-trip tests asserting float32 output dtype and value preservation for
float16andbfloat16inputs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| mussel/cli/aggregate_sample_features.py | Casts float16/HDF5-encoded bfloat16 features to float32 right after reading from H5 to prevent downstream dtype issues. |
| tests/mussel/cli/test_aggregate_sample_features.py | Adds dtype-parametrized H5 writer and new tests for float16/bfloat16 upcasting behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
raylim
left a comment
There was a problem hiding this comment.
The fix in aggregate_sample_features.py is correct — the bfloat16 void detection, .view(ml_dtypes.bfloat16).astype(np.float32) reinterpretation, and float16 upcasting are all sound. Tests are meaningful and cover the right round-trips.
One real gap: load_features_from_h5 in mussel/utils/file.py:345 has the identical bug and is not addressed here. See inline comment for details.
Apply the same bfloat16/float16 → float32 upcasting fix to load_features_from_h5 in mussel/utils/file.py. Previously, float16 would return a float16 torch.Tensor and bfloat16 (stored as |V2 opaque void by h5py/ml_dtypes) would raise a TypeError in torch.from_numpy. Also add three tests in tests/mussel/utils/test_file.py covering float32 pass-through, float16 upcast, and bfloat16 upcast. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_sample_features.py ml_dtypes is a required dependency (pyproject.toml), so the import belongs at the top of the file rather than inline at the call site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d-trip mismatch Both upcast tests previously computed 'expected' from an independently generated float32 array. The float64→lowprec cast in _write_fake_h5 can produce different values than float32→lowprec, making the assertions potentially flaky. Fix by reading back the exact bytes written to the H5 file and using those as the expected values. Also move 'import ml_dtypes' to the top of the test file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Problem
Follow-up to #122.
aggregate_sample_features.pyhas the same raw h5 read pattern (line 142):|V2(2-byte opaque void) viaml_dtypes— downstream consumers (sklearn, parquet) cannot handle this dtype.float32.Fix
Cast bfloat16 and float16 features to
float32immediately after reading from h5:Tests
Two new round-trip tests added to
test_aggregate_sample_features.py:test_float16_features_upcast_to_float32— verifies output H5 dtype isfloat32and values match the float16 round-triptest_bfloat16_features_upcast_to_float32— same for bfloat16 (|V2opaque void)All 17 tests pass.