Skip to content

fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features#123

Merged
raylim merged 4 commits into
mainfrom
fix/aggregate-sample-features-precision
Jun 17, 2026
Merged

fix: upcast float16/bfloat16 features to float32 in aggregate_sample_features#123
raylim merged 4 commits into
mainfrom
fix/aggregate-sample-features-precision

Conversation

@raylim

@raylim raylim commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Problem

Follow-up to #122. aggregate_sample_features.py has the same raw h5 read pattern (line 142):

features_list.append(np.array(h5["features"]))
  • bfloat16 is stored as |V2 (2-byte opaque void) via ml_dtypes — downstream consumers (sklearn, parquet) cannot handle this dtype.
  • float16 causes dtype errors in sklearn estimators that require float32.

Fix

Cast bfloat16 and float16 features to float32 immediately after reading from h5:

features_arr = np.array(h5["features"])
if features_arr.dtype.kind == "V" and features_arr.dtype.itemsize == 2:
    import ml_dtypes
    features_arr = features_arr.view(ml_dtypes.bfloat16).astype(np.float32)
elif features_arr.dtype == np.float16:
    features_arr = features_arr.astype(np.float32)
features_list.append(features_arr)

Tests

Two new round-trip tests added to test_aggregate_sample_features.py:

  • test_float16_features_upcast_to_float32 — verifies output H5 dtype is float32 and values match the float16 round-trip
  • test_bfloat16_features_upcast_to_float32 — same for bfloat16 (|V2 opaque void)

All 17 tests pass.

…features

Same raw h5 read pattern as merge_annotation_features (line 142).
After np.array(h5["features"]), cast bfloat16 (|V2 opaque void) and
float16 to float32 before passing to downstream aggregation.

- bfloat16 stored as |V2 via ml_dtypes: view + astype(float32)
- float16: direct astype(float32)

Tests added:
- test_float16_features_upcast_to_float32
- test_bfloat16_features_upcast_to_float32

All 17 tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 17:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes dtype incompatibilities when aggregating per-slide H5 feature embeddings into per-sample outputs by upcasting float16 and HDF5-stored bfloat16 (|V2 opaque void) feature arrays to float32 immediately after reading. This aligns aggregate_sample_features with downstream consumer expectations (e.g., scikit-learn, parquet).

Changes:

  • Upcast float16 and HDF5 |V2 (bfloat16) feature arrays to float32 during H5 ingestion in aggregate_sample_features.
  • Extend the aggregate-sample CLI test helper to support writing synthetic features in configurable dtypes.
  • Add round-trip tests asserting float32 output dtype and value preservation for float16 and bfloat16 inputs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
mussel/cli/aggregate_sample_features.py Casts float16/HDF5-encoded bfloat16 features to float32 right after reading from H5 to prevent downstream dtype issues.
tests/mussel/cli/test_aggregate_sample_features.py Adds dtype-parametrized H5 writer and new tests for float16/bfloat16 upcasting behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/mussel/cli/test_aggregate_sample_features.py Outdated
Comment thread tests/mussel/cli/test_aggregate_sample_features.py Outdated

@raylim raylim left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix in aggregate_sample_features.py is correct — the bfloat16 void detection, .view(ml_dtypes.bfloat16).astype(np.float32) reinterpretation, and float16 upcasting are all sound. Tests are meaningful and cover the right round-trips.

One real gap: load_features_from_h5 in mussel/utils/file.py:345 has the identical bug and is not addressed here. See inline comment for details.

Comment thread mussel/cli/aggregate_sample_features.py
raylim and others added 3 commits June 16, 2026 14:54
Apply the same bfloat16/float16 → float32 upcasting fix to
load_features_from_h5 in mussel/utils/file.py. Previously, float16
would return a float16 torch.Tensor and bfloat16 (stored as |V2 opaque
void by h5py/ml_dtypes) would raise a TypeError in torch.from_numpy.

Also add three tests in tests/mussel/utils/test_file.py covering
float32 pass-through, float16 upcast, and bfloat16 upcast.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…_sample_features.py

ml_dtypes is a required dependency (pyproject.toml), so the import
belongs at the top of the file rather than inline at the call site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d-trip mismatch

Both upcast tests previously computed 'expected' from an independently
generated float32 array. The float64→lowprec cast in _write_fake_h5 can
produce different values than float32→lowprec, making the assertions
potentially flaky. Fix by reading back the exact bytes written to the H5
file and using those as the expected values.

Also move 'import ml_dtypes' to the top of the test file.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@raylim raylim merged commit 782d747 into main Jun 17, 2026
3 checks passed
@raylim raylim deleted the fix/aggregate-sample-features-precision branch June 17, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants