fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features by raylim · Pull Request #123 · pathology-data-mining/Mussel

raylim · 2026-05-07T17:50:50Z

Problem

Follow-up to #122. aggregate_sample_features.py has the same raw h5 read pattern (line 142):

features_list.append(np.array(h5["features"]))

bfloat16 is stored as |V2 (2-byte opaque void) via ml_dtypes — downstream consumers (sklearn, parquet) cannot handle this dtype.
float16 causes dtype errors in sklearn estimators that require float32.

Fix

Cast bfloat16 and float16 features to float32 immediately after reading from h5:

features_arr = np.array(h5["features"])
if features_arr.dtype.kind == "V" and features_arr.dtype.itemsize == 2:
    import ml_dtypes
    features_arr = features_arr.view(ml_dtypes.bfloat16).astype(np.float32)
elif features_arr.dtype == np.float16:
    features_arr = features_arr.astype(np.float32)
features_list.append(features_arr)

Tests

Two new round-trip tests added to test_aggregate_sample_features.py:

test_float16_features_upcast_to_float32 — verifies output H5 dtype is float32 and values match the float16 round-trip
test_bfloat16_features_upcast_to_float32 — same for bfloat16 (|V2 opaque void)

All 17 tests pass.

…features Same raw h5 read pattern as merge_annotation_features (line 142). After np.array(h5["features"]), cast bfloat16 (|V2 opaque void) and float16 to float32 before passing to downstream aggregation. - bfloat16 stored as |V2 via ml_dtypes: view + astype(float32) - float16: direct astype(float32) Tests added: - test_float16_features_upcast_to_float32 - test_bfloat16_features_upcast_to_float32 All 17 tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes dtype incompatibilities when aggregating per-slide H5 feature embeddings into per-sample outputs by upcasting float16 and HDF5-stored bfloat16 (|V2 opaque void) feature arrays to float32 immediately after reading. This aligns aggregate_sample_features with downstream consumer expectations (e.g., scikit-learn, parquet).

Changes:

Upcast float16 and HDF5 |V2 (bfloat16) feature arrays to float32 during H5 ingestion in aggregate_sample_features.
Extend the aggregate-sample CLI test helper to support writing synthetic features in configurable dtypes.
Add round-trip tests asserting float32 output dtype and value preservation for float16 and bfloat16 inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
mussel/cli/aggregate_sample_features.py	Casts `float16`/HDF5-encoded `bfloat16` features to `float32` right after reading from H5 to prevent downstream dtype issues.
tests/mussel/cli/test_aggregate_sample_features.py	Adds dtype-parametrized H5 writer and new tests for float16/bfloat16 upcasting behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

raylim

The fix in aggregate_sample_features.py is correct — the bfloat16 void detection, .view(ml_dtypes.bfloat16).astype(np.float32) reinterpretation, and float16 upcasting are all sound. Tests are meaningful and cover the right round-trips.

One real gap: load_features_from_h5 in mussel/utils/file.py:345 has the identical bug and is not addressed here. See inline comment for details.

Apply the same bfloat16/float16 → float32 upcasting fix to load_features_from_h5 in mussel/utils/file.py. Previously, float16 would return a float16 torch.Tensor and bfloat16 (stored as |V2 opaque void by h5py/ml_dtypes) would raise a TypeError in torch.from_numpy. Also add three tests in tests/mussel/utils/test_file.py covering float32 pass-through, float16 upcast, and bfloat16 upcast. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…_sample_features.py ml_dtypes is a required dependency (pyproject.toml), so the import belongs at the top of the file rather than inline at the call site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…d-trip mismatch Both upcast tests previously computed 'expected' from an independently generated float32 array. The float64→lowprec cast in _write_fake_h5 can produce different values than float32→lowprec, making the assertions potentially flaky. Fix by reading back the exact bytes written to the H5 file and using those as the expected values. Also move 'import ml_dtypes' to the top of the test file. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 7, 2026 17:50

Copilot started reviewing on behalf of raylim May 7, 2026 17:51 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread tests/mussel/cli/test_aggregate_sample_features.py Outdated

Comment thread tests/mussel/cli/test_aggregate_sample_features.py Outdated

raylim commented Jun 16, 2026

View reviewed changes

Comment thread mussel/cli/aggregate_sample_features.py

raylim and others added 3 commits June 16, 2026 14:54

raylim merged commit 782d747 into main Jun 17, 2026
3 checks passed

raylim deleted the fix/aggregate-sample-features-precision branch June 17, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features#123

fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features#123
raylim merged 4 commits into
mainfrom
fix/aggregate-sample-features-precision

raylim commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

raylim left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raylim commented May 7, 2026

Problem

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

raylim left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants