Fix ISTP validation FILLVAL errors#3191
Conversation
…local builds" This reverts commit 19148a0.
SWE L2 writes acq_duration as CDF_UINT4, but the L2 variable
attribute template still used the signed int64 fill sentinel. When the
CDF is written, cdflib coerces FILLVAL to the variable's actual CDF
type, so the negative sentinel wraps to 0 for the emitted UINT4 field.
That produces a non-standard L2 CDF and causes SPDF to report:
acq_duration
FILLVAL value of '0' is non-standard.
The recommended value is '4294967295'.
Fix the L2 metadata template to use the correct UINT4 fill value
(4294967295) for acq_duration.
Also extend the existing SWE L2 end-to-end test to inspect the written
CDF directly with cdflib and verify that acq_duration is emitted as
CDF_UINT4 with the expected FILLVAL. This gives us a regression test on
the actual serialized CDF metadata, not just the in-memory dataset.
This change is intentionally scoped to SWE L2. L1 metadata is unchanged.
Several products were writing variables with CDF types that did not match the
configured FILLVAL and FORMAT metadata. The science values were otherwise fine,
but the emitted CDFs failed SPDF/ISTP checks because the metadata still
described the pre-coercion or intended type rather than the serialized type.
Update the staged products so their attrs match what is actually written:
- CoDICE lo-direct-events
- change data_quality to use the UINT2 fill value
- change spin_sector to use floating-point fill/format because the written
variable is CDF_DOUBLE after invalid positions are represented as NaN
- CoDICE hi-direct-events
- change data_quality to use the UINT2 fill value
- CoDICE lo-sw-species / lo-nsw-species
- restore nso_esa_step and nso_spin_sector to uint8 before write-out
- replace non-finite values with the integer fill sentinel and recast to
CDF_UINT1 semantics instead of leaving these support vars widened to float
- HIT L2
- correct dynamic_threshold_state FILLVAL from the signed sentinel to the
UINT1 sentinel expected by the emitted CDF
- IDEX
- treat spin_phase as floating-point metadata in the attr template so its
FILLVAL and FORMAT match the serialized CDF_DOUBLE variable
- MAG L2
- correct vector fill metadata to use the standard floating-point fill value
used by the emitted magnetic field vectors
- SWAPI
- correct esa_energy FILLVAL and FORMAT to the floating-point form used by
the written support variable
Also add CDF-level regression coverage for the products exercised in this
change:
- CoDICE tests now inspect the written CDF metadata for nso_esa_step,
nso_spin_sector, lo-direct-events spin_sector, and direct-events data_quality
- HIT tests now verify dynamic_threshold_state is written as CDF_UINT1 with
FILLVAL 255
- IDEX tests now verify spin_phase is written as CDF_DOUBLE with the expected
floating-point fill value
This change is intentionally focused on the staged fill-value/type mismatches.
It does not include the separate SWE fix or the remaining CoDICE epoch_delta_*
cleanup.
GLOWS L2 re-expands daily lightcurve arrays back to STANDARD_BIN_COUNT
before writing the final CDF. That padding step was using
`np.full(..., fillval)` without controlling the dtype, which let NumPy
choose a default signed integer type based on the fill value.
That caused two metadata/type mismatches in the emitted CDF:
- `histogram_flag_array` is logically an 8-bit flag array, but padding
with the fill value widened it away from `uint8`
- `number_of_bins` was written from a plain Python integer, so it did
not reliably preserve the intended unsigned-16 representation
SPDF then flagged the resulting files because the serialized CDF types
and FILLVALs no longer matched the intended product schema.
Fix this in two parts:
1. Make the GLOWS L2 schema explicit
- declare `histogram_flag_array` as `CDF_UINT1`
- declare `number_of_bins` as `CDF_UINT2`
- update `number_of_bins` to use the `65535` UINT2 fill sentinel
2. Preserve array dtypes when rebuilding padded bin arrays
- write `number_of_bins` explicitly as `np.uint16`
- when padding bin-dimensioned daily lightcurve arrays, start from the
existing array dtype instead of letting `np.full` infer one from the
fill value
- if the configured fill value fits exactly in the current integer
dtype, keep that dtype
- only widen when the fill value cannot be represented without
changing value
This keeps integer support arrays like `histogram_flag_array` in their
intended unsigned type while still allowing float arrays such as
`photon_flux` and `ecliptic_lon` to remain floating-point.
Add regression coverage for both the in-memory and written-CDF paths:
- verify `create_l2_dataset()` keeps `photon_flux` and `ecliptic_lon`
as floating arrays after padding
- verify the emitted CDF writes
- `histogram_flag_array` as `CDF_UINT1` with `FILLVAL=255`
- `number_of_bins` as `CDF_UINT2` with `FILLVAL=65535`
- `photon_flux` and `ecliptic_lon` as floating-point variables with
the expected real fill value
CoDICE hi-omni and hi-sectored were still carrying
epoch_delta_minus/epoch_delta_plus through L1B as integer-valued support
variables even though the current schema intent for those fields is
real-valued temporal support data. That mismatch then propagated into the
serialized L2 products and conflicted with the float-style ISTP metadata
needed for the written CDFs.
Update the Hi L1B path to normalize epoch_delta_minus and
epoch_delta_plus to float64 before CDF write-out for:
- imap_codice_l1b_hi-omni
- imap_codice_l1b_hi-sectored
Also update the Hi L2 variable-attribute templates so the final products
describe those variables consistently with the written type:
- FILLVAL: real fill sentinel (-1.0e31)
- FORMAT: floating-point format (F19.1)
This keeps the fix split cleanly by responsibility:
- L1B now emits the support variables with the intended dtype
- L2 no longer advertises them with integer fill/format metadata
Test coverage is updated in two layers:
1. L1B metadata regression
- add a parameterized CDF-level test for hi-omni and hi-sectored
- assert epoch_delta_minus and epoch_delta_plus are written as
CDF_DOUBLE with FILLVAL -1.0e31
2. L2 regression and metadata verification
- keep the existing validation-based L2 science regression tests
- add a separate parameterized write-path test that generates fresh
Hi L1B inputs, runs L2, writes the output CDF, and verifies the
emitted epoch_delta variables are CDF_DOUBLE with the expected
float fill value
The generated-L1B L2 metadata test is intentional: the checked-in Hi L1B
validation artifacts still predate this dtype fix, so they cannot by
themselves verify the regenerated L1B -> L2 path. Science regression
coverage stays anchored to the historical validation files, while the new
metadata test proves the current pipeline output is correct.
This change is limited to CoDICE Hi epoch_delta support-variable typing
and metadata. It does not refresh the checked-in validation artifacts.
| histogram_flag_array: | ||
| <<: *lightcurve_defaults | ||
| CATDESC: Bad-angle flags for histogram bins | ||
| CDF_DATA_TYPE: CDF_UINT1 |
There was a problem hiding this comment.
Should this be uint8? I don't know enough about this key CDF_DATA_TYPE though. This is new one. similar comment for below.
There was a problem hiding this comment.
Yes, CDF_UINT1 is the CDF-side spelling for an unsigned 8-bit integer, so this is the explicit uint8 form for histogram_flag_array. The same idea applies to number_of_bins below: that one is CDF_UINT2 because it needs a uint16 range and the ISTP-recommended fill sentinel is 65535.
Based on the code, I think uint8 is the appropriate data type as well:
- histogram_flag_array is already produced as np.uint8 in L1B
- L2 preserves that and explicitly casts the OR-reduced result back to np.uint8
There was a problem hiding this comment.
So does adding CDF_DATA_TYPE cause cdflib to cast the variable to this specified type when writing the CDF? I believe that typically, the variable data type informs cdflib what to set the CDF_DATA_TYPE attribute to.
There was a problem hiding this comment.
Yes, cdflib.xarray_to_cdf treats CDF_DATA_TYPE as a write-time override when building the CDF variable spec. In this case the data is also preserved as uint8, so we are not relying only on the attr. The regression test opens the written CDF and verifies histogram_flag_array is actually written as CDF_UINT1 with a uint8(255) fill.
| l1b_dataset = l1a_dataset.copy(deep=True) | ||
|
|
||
| if descriptor in ["hi-omni", "hi-sectored"]: | ||
| l1b_dataset = _cast_epoch_delta_vars_to_float64(l1b_dataset) |
There was a problem hiding this comment.
If I remember @jtniehof and @andrkoval , they want epoch delta to be int type rather than float.
There was a problem hiding this comment.
The main reason I made epoch_delta_plus / epoch_delta_minus float in the hi-omni and hi-sectored path is consistency with how CoDICE is already handling these variables elsewhere.
These two support variables were already effectively real-valued in other CoDICE products:
- the CoDICE L1A attr template already defines them with real-valued metadata (
FILLVAL = -1.0E31, floatFORMAT) - the
lo-direct-eventsandhi-direct-eventsproducts were already emitting them asCDF_DOUBLE
hi-omni and hi-sectored were the inconsistent pair still carrying them through L1B as integer arrays, which then conflicted with the float-style metadata needed for the written L2 CDFs.
So the change here was not meant to redefine the variable semantics from scratch. It was meant to make hi-omni / hi-sectored line up with the existing CoDICE behavior for the same epoch_delta_* support vars and eliminate the mixed integer/float treatment across products.
Maybe I went the wrong way though, if the CoDICE decision is that epoch_delta_* should remain integer everywhere, I think the cleaner fix would be to make that a broader product-wide change and update the direct-event paths and upstream attrs consistently, rather than leaving hi-omni / hi-sectored as the only products behaving differently.
There was a problem hiding this comment.
Ah, interesting... I see this comment. No need to respond to my individual comments regarding this.
There was a problem hiding this comment.
Yeah, I can make them integer valued on all the CoDICE products if that is better?
There was a problem hiding this comment.
I will revert these changes and address epoch_delta inconsistency in another PR.
| for species in species_list: | ||
| dataset[species].data[half_spin_boundary] = np.nan | ||
|
|
||
| for var in ["nso_esa_step", "nso_spin_sector"]: |
There was a problem hiding this comment.
@lacoak21 , can you check this one since CoDICE was picking up using specific fillval for their data.
There was a problem hiding this comment.
For context, with these two variables, the intent is to preserve them as integer support variables in the final L2 product. They are indices / enumerated support values, not continuous measured quantities.
What was happening before is that older packet versions do not include these fields, so missing values were represented with NaN upstream. Once NaN is introduced, NumPy promotes the array to float, and that widened dtype propagated all the way to the written CDF. That is what led to the non-standard FILLVAL / FORMAT issues here.
So the change in this PR is not redefining the variable type. It is restoring the intended type before write-out:
- replace non-finite values with the integer fill sentinel
- cast back to
uint8 - write them out as integer support vars again
In other words, this fix is meant to undo float promotion caused by missing-data handling, not to introduce a new convention for these CoDICE variables.
| fillval = var_attrs["FILLVAL"] | ||
| padded = np.full(GlowsConstants.STANDARD_BIN_COUNT, fillval) | ||
| padded[:n_bins] = value | ||
| value_array = np.asarray(value) |
There was a problem hiding this comment.
@maxinelasp or @mstrumik , can you guys check if this update looks ok?
There was a problem hiding this comment.
For context, the GLOWS change is limited to preserving the existing dtype during the daily-lightcurve padding step. histogram_flag_array now stays uint8 instead of widening based on the fill value, while float arrays like photon_flux and ecliptic_lon still remain floating-point. I also added regression checks on the written CDF output for both cases.
|
Thanks @davidt0x. This PR is looking good in general. I pinged couple people on code changes. |
There was a problem hiding this comment.
Pull request overview
This PR aligns several emitted CDF variable dtypes, FILLVALs, and related metadata with ISTP expectations across L2 products, with added regression tests that inspect written CDF files.
Changes:
- Updates CDF metadata templates for SWE, SWAPI, IDEX, HIT, MAG, GLOWS, and CoDICE products.
- Restores/normalizes selected runtime dtypes before CDF write-out for GLOWS and CoDICE.
- Adds CDF-level metadata regression coverage and ignores generated
_version.py.
Reviewed changes
Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
.gitignore |
Ignores generated dynamic-versioning file. |
imap_processing/cdf/config/imap_codice_l2-hi-direct-events_variable_attrs.yaml |
Updates direct-event data_quality fill metadata. |
imap_processing/cdf/config/imap_codice_l2-hi-omni_variable_attrs.yaml |
Updates HI omni epoch delta fill/format metadata. |
imap_processing/cdf/config/imap_codice_l2-hi-sectored_variable_attrs.yaml |
Updates HI sectored epoch delta fill/format metadata. |
imap_processing/cdf/config/imap_codice_l2-lo-direct-events_variable_attrs.yaml |
Updates LO direct-event data_quality and spin_sector metadata. |
imap_processing/cdf/config/imap_glows_l2_variable_attrs.yaml |
Adds explicit unsigned CDF types/fills for GLOWS support variables. |
imap_processing/cdf/config/imap_hit_l2_variable_attrs.yaml |
Updates HIT dynamic threshold fill metadata. |
imap_processing/cdf/config/imap_idex_l1b_variable_attrs.yaml |
Updates IDEX spin phase metadata for floating output. |
imap_processing/cdf/config/imap_mag_l2_variable_attrs.yaml |
Updates MAG vector fill metadata. |
imap_processing/cdf/config/imap_swapi_variable_attrs.yaml |
Updates SWAPI ESA energy fill/format metadata. |
imap_processing/cdf/config/imap_swe_l2_variable_attrs.yaml |
Updates SWE acquisition duration fill metadata. |
imap_processing/codice/codice_l1b.py |
Casts CoDICE HI epoch deltas to float64. |
imap_processing/codice/codice_l2.py |
Restores CoDICE LO NSO support variables to uint8. |
imap_processing/glows/l2/glows_l2.py |
Preserves integer/floating dtypes during GLOWS padding. |
imap_processing/tests/codice/test_codice_hi_l2.py |
Adds regenerated HI L1B/L2 metadata regression tests. |
imap_processing/tests/codice/test_codice_l1b.py |
Adds CoDICE L1B epoch delta metadata checks. |
imap_processing/tests/codice/test_codice_l2.py |
Adds CoDICE L2 CDF dtype/fill checks. |
imap_processing/tests/glows/test_glows_l2.py |
Adds GLOWS dtype/fill regression coverage. |
imap_processing/tests/hit/test_hit_l2.py |
Adds HIT written-CDF metadata assertions. |
imap_processing/tests/idex/conftest.py |
Makes mocked spin phase floating-point. |
imap_processing/tests/idex/test_idex_l2a.py |
Adds IDEX spin phase CDF metadata assertions. |
imap_processing/tests/mag/test_mag_l2.py |
Adds MAG vector CDF metadata assertions. |
imap_processing/tests/swapi/test_swapi_l2.py |
Adds SWAPI ESA energy CDF metadata assertions. |
imap_processing/tests/swe/test_swe_l2.py |
Adds SWE acquisition duration CDF metadata assertions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Clean up the files touched by the ISTP fill-value work so they pass the current pre-commit hooks without changing the intended product behavior. - simplify the GLOWS L2 refactor to keep only the shared daily-lightcurve bin padding helper extracted - add complete numpydoc sections for the new CoDICE L1B helper - shorten test setup in the CoDICE L1B metadata regression to satisfy line-length checks This is a style and maintainability cleanup only. It keeps the underlying fill-value and dtype fixes intact while making the updated files pass ruff, numpydoc-validation, and mypy.
…local builds" This reverts commit c2d1cc7.
| DICT_KEY: SPASE>Support>SupportQuantity:DataQuality | ||
| FIELDNAM: Data Quality | ||
| FILLVAL: *uint8_fillval | ||
| FILLVAL: *uint16_fillval |
There was a problem hiding this comment.
VALIDMIN/MAX for this variable are for a uint8 data type, so are you sure that it has been changed to a uint16? If so, update VALIDMIN/MAX
There was a problem hiding this comment.
I think so, the written data_quality variable is currently CDF_UINT2, so the FILLVAL has to use the uint16 sentinel 65535. I left VALIDMIN/MAX as 0/255 intentionally because those describe the valid flag values, not the full storage range. I thought the the fill value should remain outside that valid range.
There was a problem hiding this comment.
I looked into the direct-events data_quality metadata mismatch as a possible follow-up, but I’m not including it in this PR as it is already getting a bit complex.
What is happening now:
- in the direct-events L1A builder,
data_qualityis initialized together withnum_eventsas a(epoch, priority)uint16array filled with65535 - later, the actual
data_qualityvalues assigned into that array are just the unpackedsuspectbit, so the real data values are only0or1 - because the backing array is
uint16, the written CDF type becomesCDF_UINT2, and the matching L2 metadata/tests on this branch were updated to useFILLVAL = 65535
Why this looks odd:
- in the rest of CoDICE,
data_qualityis treated as a simple0/1flag withuint8-style metadata (FILLVAL = 255,VALIDMAX = 1) - so direct-events are currently an outlier, most likely because
data_qualitywas grouped into the same shared 2D initialization path asnum_events, which actually does need a larger integer range
What the cleaner fix would be:
- keep
num_eventsasuint16 - split
data_qualityout so it is initialized asuint8with fill255 - update the direct-events L2 metadata/tests so
data_qualitywrites asCDF_UINT1again
I verified that the source suspect values are already uint8, so this would be a small structural cleanup rather than a science-algorithm change. I’m leaving it for a separate PR to keep the current branch focused. Also, maybe its better left as uint16 for something downstream? I think I will leave these YAML config changes as it though since we can be confident the VALIDMAX won't exceed 255.
| FILLVAL: *real_fillval | ||
| FORMAT: F8.1 |
There was a problem hiding this comment.
I would expect a variable that is an index to be an integer type.
There was a problem hiding this comment.
Agreed. This one is an index semantically, and representing missing/invalid sectors with NaN is what widened it to float. I’ll change this path to use the integer fill sentinel for invalid spin sectors and keep the written variable as an integer type instead of making the metadata follow the float promotion.
| histogram_flag_array: | ||
| <<: *lightcurve_defaults | ||
| CATDESC: Bad-angle flags for histogram bins | ||
| CDF_DATA_TYPE: CDF_UINT1 |
There was a problem hiding this comment.
So does adding CDF_DATA_TYPE cause cdflib to cast the variable to this specified type when writing the CDF? I believe that typically, the variable data type informs cdflib what to set the CDF_DATA_TYPE attribute to.
There was a problem hiding this comment.
It looks like these should also be updated.
| l1b_dataset = l1a_dataset.copy(deep=True) | ||
|
|
||
| if descriptor in ["hi-omni", "hi-sectored"]: | ||
| l1b_dataset = _cast_epoch_delta_vars_to_float64(l1b_dataset) |
There was a problem hiding this comment.
Ah, interesting... I see this comment. No need to respond to my individual comments regarding this.
Preserve spin_sector as an unsigned integer index in the LO direct-event L2 output by replacing invalid sectors with the uint8 fill sentinel instead of NaN. This prevents NumPy from widening the index variable to float and keeps the written CDF type/FILLVAL aligned with the variable semantics. Update the regression assertion to verify spin_sector is written as CDF_UINT1 with FILLVAL 255.
Update esa_energy VALIDMIN/VALIDMAX to floating-point values so the metadata is consistent with the CDF_DOUBLE variable type, real-valued FILLVAL, and float FORMAT. This preserves the existing 65535 upper bound for now. That value is probably a legacy/raw-storage bound rather than the most scientifically meaningful maximum for esa_energy, so a future update should use an instrument-approved value from the SWAPI scientists.
Remove the CoDICE Hi-only `epoch_delta_minus` / `epoch_delta_plus` float-casting changes from the ISTP fill-value branch. This branch had picked up an additional CoDICE-specific workaround that: - cast `epoch_delta_*` to `float64` in the Hi L1B path - changed the Hi omni and Hi sectored L2 attr templates to float-style `FILLVAL` / `FORMAT` - added float-specific regression tests around that behavior That work is no longer intended to ship as part of the fill-value PR. The final CoDICE `epoch_delta_*` datatype decision will be handled in a separate follow-up PR. Revert the branch-local workaround by: - removing `_cast_epoch_delta_vars_to_float64()` from `codice_l1b.py` - removing the `hi-omni` / `hi-sectored` call site in `process_codice_l1b()` - restoring the Hi omni and Hi sectored `epoch_delta_*` L2 metadata to the current `upstream/dev` state - deleting the float-specific CoDICE epoch-delta regression tests from `test_codice_l1b.py` and `test_codice_hi_l2.py` No new epoch-delta behavior is introduced here. This change only removes the extra branch complexity so `fix/istp_fillval` goes back to its intended scope. Validated locally with: - `pytest -q imap_processing/tests/codice/test_codice_l1b.py` - `pytest -q imap_processing/tests/codice/test_codice_hi_l2.py` - `pytest -q imap_processing/tests/codice/test_codice_l2.py -k 'lo_de or hi_de or sw_species'` - `pre-commit` on the touched CoDICE files
Margaret says that 21000 is a better VALIDMAX for esa_energy on SWAPI.
| UNITS: eV / q | ||
| VALIDMAX: 65535 | ||
| VALIDMIN: 0 | ||
| VALIDMAX: 21000.0 |
There was a problem hiding this comment.
@tmplummer, I think you had some concerns the VALIDMAX on this one being 65535 (int16 max). I checked in with @mmshaw and she said 21000 would be a better value.
Summary
This PR fixes the remaining known L2 ISTP
FILLVALfindings by aligning emitted CDF variable types,FILLVALs, andFORMATmetadata across several instruments and products. This is PR is in service of the general metadata ISTP cleanup issue #2577In most cases the underlying science values were already correct, but the written CDF metadata did not match the serialized variable type. In a few cases, the better fix was to restore the intended dtype before write-out instead of only changing attrs.
Why
Running the ISTP/SPDF CDF command line checker against all L2 files produced a set of non-standard
FILLVALfindings caused by a few recurring patterns:NaNis introduced for missing valuesThis PR fixes those mismatches so the emitted L2 products are ISTP-compliant at the file level.
What Changed
SWE
acq_durationto use the correctCDF_UINT4fill sentinel:4294967295.FILLVAL.SWAPI
esa_energyFILLVALandFORMATwith the emitted floating-point support variable. Set VALIDMAX to 21000.IDEX
spin_phaseattrs with the serializedCDF_DOUBLEoutput in L2A.HIT
dynamic_threshold_stateto useCDF_UINT1semantics withFILLVAL = 255.MAG
b_dsrfb_gseb_gsmb_rtnb_srfquality_flagsquality_bitmaskCoDICE
nso_esa_stepandnso_spin_sectortouint8before write-out.data_qualityfill metadata.spin_sectorfill/format to match the emitted type.GLOWS
number_of_binsexplicitlyuint16withFILLVAL = 65535.histogram_flag_arrayasuint8during L2 padding/re-expansion instead of silently widening based on the fill value.photon_fluxandecliptic_lonremain floating-point through the same shared padding path.Testing
Notes
FILLVALfindings