Check for silent lexicographic comparison against string-typed `value` by thodson-usgs · Pull Request #240 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-04-24T13:56:16Z

Summary (draft — for discussion)

Every queryable field on every OGC collection in the Water Data API is type=string server-side. So any unquoted numeric comparison in a CQL-text filter — value >= 1000, parameter_code = 60, district_code = 1 — isn't a valid numeric comparison: empirically the server returns HTTP 500 Internal Server Error. Even if the user does quote the literal, the comparison is lexicographic (e.g. value > '10' returns rows where value='34.52' because first-char '3' > '1'), and zero-padded codes like parameter_code = '60' silently match nothing because the real values are '00060'-shaped.

Either way, the user's intent was numeric, they get either 500 or silent wrong rows, and the failure is opaque. This PR catches the pattern client-side and raises a clear ValueError before the request fires.

Motivation

Flagged during review of DOI-USGS/dataRetrieval#880 (R side), where ldecicco-USGS pushed back on exposing a generic filter kwarg:

If I saw a generic filter argument, my first thought would be SWEET, I want to answer all these interesting questions about the data. So I'll set filter=value >1000. The problem with that is it works, but since value is a character, it's filtering all the values that are alphabetically above "1000" (like "12"). … My gut says there would be way more people trying stuff like than and then either get the wrong results unknowingly, or complain that dataRetrieval is broken…

This is the smallest change that addresses her concern without removing the filter feature.

What the check does

Runs once per cql-text filter, inside _plan_filter_chunks, before any HTTP traffic:

>>> waterdata.get_continuous(
...     monitoring_location_id="USGS-02238500",
...     parameter_code="00060",
...     filter="value >= 1000",
...     filter_lang="cql-text",
... )
ValueError: Filter compares 'value' to unquoted numeric 1000. Every queryable
on the Water Data API is typed as a string, so ``value >= 1000`` is not a
valid numeric comparison — empirically the server rejects unquoted numeric
literals with HTTP 500. Even if you quote the literal (``value >= '1000'``)
the comparison is lexicographic, which silently misses zero-padded codes
(e.g. ``parameter_code = '60'`` matches nothing because the real codes are
``'00060'``-shaped) and sorts ``value='12'`` above ``value='1000'``. For a
numeric filter, fetch a wider result and reduce in pandas after the call.

Scope: universal, not a watchlist

Every queryable property across every OGC endpoint is type=string — confirmed empirically:

endpoint	numeric-looking string fields
`continuous`	`value`, `parameter_code`, `statistic_id`
`daily`	`value`, `parameter_code`, `statistic_id`
`field-measurements`	`value`, `parameter_code`
`latest-continuous`	`value`, `parameter_code`, `statistic_id`
`latest-daily`	`value`, `parameter_code`, `statistic_id`
`time-series-metadata`	`parameter_code`, `statistic_id`, `hydrologic_unit_code`
`monitoring-locations`	`monitoring_location_number`, `district_code`, `state_code`, `county_code`
`channel-measurements`	`measurement_number`, `channel_flow`, `channel_width`, `channel_area`, `channel_velocity`

Since there's no such thing as a legitimate numeric comparison on this API, the regex flags any <identifier> <op> <unquoted numeric literal> (or the reverse), regardless of field. Quoted literals (value >= '1000') are not flagged — the caller has signalled they want sort-order semantics.

Live evidence

filter="parameter_code = '00060'"   → 200, 5 rows (correct)
filter="parameter_code = 60"        → 500 Internal Server Error  ← we catch this
filter="value > '10'"               → 200, 31 rows of '34.52', '63160', …  (lex)
filter="value > 10"                 → 500 Internal Server Error  ← we catch this

Test plan

ruff check / ruff format --check pass.
pytest tests/waterdata_utils_test.py — 66/66 pass (32 prior + 34 new). The new tests cover:
- 21 raise cases: every op (>=, >, <=, <, =, !=) × both orderings (x OP N and N OP x), floats, negatives, multiple real-world fields (value, parameter_code, statistic_id, district_code, county_code, hydrologic_unit_code, channel_flow, channel_velocity), and nested in AND/OR expressions.
- 14 allow cases: quoted literals for every watchlist-replaced field, pure string comparisons, IN lists, and false-positive guards (identifiers appearing only inside quoted string literals like name = 'see district_code = 1 in docs').
- End-to-end: the error surfaces through get_continuous before _construct_api_requests is ever called (mock-verified).
Full non-live suite — 136/136 pass.
Live-probed actual server behavior against USGS-02238500 / continuous — unquoted numeric RHS consistently returns 500; quoted literal returns 200 with lex-sorted results.

Open for discussion

Warn vs. raise? I went with raise because empirically the alternative is a 500 and silent opacity — a warning would be easy to miss, and "broken dataRetrieval" bug reports would follow.
Surface location? Currently in _plan_filter_chunks alongside the other filter validation (chunkability, URL-budget). Could hoist to the get_ogc_data entry.
False positives on CAST / function-call syntax (CAST(value AS FLOAT) > 1000): the regex is scoped to simple \b<ident>\b \s* <op> \s* <num> patterns and — empirically — the server doesn't support CAST or CQL2 functions on these endpoints anyway, so the false-positive rate should be near zero in practice.

Marked draft so it can ride along with the R-side discussion before landing.

🤖 Generated with Claude Code

Move all CQL filter and chunking logic out of api.py / utils.py into a dedicated dataretrieval/waterdata/filters.py module (with chunked as a decorator on the per-request fetch), and extract get_nearest_continuous into a sibling nearest.py — so the entire filter feature can be removed by deleting two source files, two test files, and two re-export lines. Adds a pre-flight check that raises on unquoted-numeric comparisons (value > 1000, parameter_code IN (60, 61), value BETWEEN 5 AND 10), since every Water Data API queryable is string-typed and the server either returns HTTP 500 or silently produces lexicographically-sorted wrong rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

thodson-usgs force-pushed the add-filter-pitfall-check branch from 559b466 to 68d1a81 Compare April 24, 2026 19:51

thodson-usgs force-pushed the add-filter-pitfall-check branch from fbd0f51 to d81dd33 Compare April 27, 2026 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for silent lexicographic comparison against string-typed `value`#240

Check for silent lexicographic comparison against string-typed `value`#240
thodson-usgs wants to merge 1 commit intoDOI-USGS:mainfrom
thodson-usgs:add-filter-pitfall-check

thodson-usgs commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary (draft — for discussion)

Motivation

What the check does

Scope: universal, not a watchlist

Live evidence

Test plan

Open for discussion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

thodson-usgs commented Apr 24, 2026 •

edited

Loading