Skip to content

Bootstrap idempotency check (find_by_uid) reads only first page — same single-page pattern repeated in find_datastream and _discover_system_ds; limit=1000 is a fragile workaround #4

@Sam-Bolling

Description

@Sam-Bolling

Summary

Three sites in this repo perform a "look up a resource by some identifier" operation by issuing a single GET against a CSAPI list endpoint and iterating the returned page client-side. None of them follow the next HATEOAS link that the OGC API — Connected Systems pagination contract requires. This is a latent correctness bug in the bootstrap idempotency layer that becomes a real correctness bug whenever the server (a) does not honor the query filter (?uid=, ?outputName=, etc.) and (b) holds more items than fit on the first page.

The current mitigation — &limit=1000 added in commit 92f584b5 — papers over the symptom for fleets with ≤1000 items per collection. It is documented in the commit message itself as a Go-server-pagination workaround. It does not fix the underlying issue and silently breaks at scale.

This issue captures the bug, the failure modes, the affected sites, and a recommended direction. It does not prescribe an implementation — that's the maintainer's call.

Background — why this matters in a publisher context

OSHConnect-Python is, primarily, a publisher fleet: long-running services that POST observations into a CSAPI server, fronted by an idempotent bootstrap phase that ensures procedures, systems, datastreams, and deployments exist before publishers start. The bootstrap is meant to be safely re-runnable on every deploy / docker compose up. That safety hinges on find_by_uid (and its siblings) correctly answering "does this resource already exist?".

When find_by_uid returns a false negative"no, the resource doesn't exist" — for a resource that in fact exists, the ensure_* family attempts to recreate it. On a strict server this returns HTTP 409 and api_post raises RuntimeError, aborting bootstrap. On a tolerant server it silently creates a duplicate UID and corrupts the deployment. Either outcome breaks the idempotency contract that the publisher fleet's deploy automation depends on.

This is therefore a deploy-time correctness bug in publisher infrastructure, not a read-side display bug. That distinction matters: the consequence of getting it wrong is duplicated systems / orphaned datastreams / non-deterministic re-deploys, not a missing row in some UI.

Affected sites — three places, one shape

# File Function Endpoint pattern Single page? Filter relied on
1 publishers/bootstrap_helpers.py find_by_uid(base_url, auth, collection, uid) {collection}?uid={uid}&limit=1000 Yes — single GET, client-side filter loop ?uid=
2 publishers/bootstrap_helpers.py find_datastream(system_id, output_name) systems/{id}/datastreams Yes — single GET, iterates result["items"] ?outputName= (not used)
3 src/oshconnect/base.py _discover_system_ds(...) retrieve_resource(APIResourceTypes.SYSTEM, ...) items Yes — walks raw_res.json().get("items", []) once none

All three sites:

  • Issue exactly one HTTP GET.
  • Iterate the returned page client-side to find the matching item.
  • Return None / raise not-found if the item isn't in that page.
  • Do not read or follow links[?(@.rel=='next')].

Repo-wide grep for next / rel="next" / rel='next' / paginate / pagination-link traversal: zero matches in any code path. The codebase has no concept of pagination today.

find_by_uid — verbatim current implementation

publishers/bootstrap_helpers.py:

_uid_cache: dict[str, str] = {}


def find_by_uid(base_url: str, auth: str, collection: str, uid: str) -> str | None:
    """Find a resource by UID in a collection. Returns server ID or None."""
    cache_key = f"{collection}:{uid}"
    if cache_key in _uid_cache:
        return _uid_cache[cache_key]

    result = api_get(base_url, f"{collection}?uid={uid}&limit=1000", auth)
    if result:
        # Support both GeoJSON (features) and flat JSON (items) collections
        items = result.get("items", []) or result.get("features", [])
        for item in items:
            props = item.get("properties", item)
            if props.get("uid") == uid:
                item_id = item.get("id") or props.get("id")
                if item_id:
                    _uid_cache[cache_key] = str(item_id)
                    return str(item_id)
    return None

The ?uid={uid} filter is intended to make the server return at most one match (in which case pagination is irrelevant). The &limit=1000 is the safety net for when the server ignores ?uid=. Both assumptions can fail simultaneously.

Failure-mode matrix

Server honors ?uid= filter? Collection size Result
Yes any ✅ Works correctly. Filter narrows to 0/1 items; pagination is moot.
No ≤ 1000 items ✅ Works because of the workaround. Current state of fleets running against the Go CSAPI server.
No > 1000 items Silent false-negative. find_by_uid returns None for resources that exist on the server. ensure_* then tries to recreate the resource, leading to either HTTP 409 → RuntimeError (strict server) or silent duplicate-UID creation (tolerant server). Bootstrap idempotency contract broken.
Yes, but server ignores it under load / for nested collections any ❌ Same as above.

The third row is dormant in current production deployments because no fleet has crossed 1000 items per collection. It is not absent — the publisher fleet pattern is designed to scale (Fort Huachuca v2.3 scenarios, multi-tenant deployments, bigger sensor manifests). The bug fires the moment a collection grows past the magic number.

The same matrix applies, mutatis mutandis, to find_datastream (collection: per-system datastreams; failure when a system has many outputs) and _discover_system_ds (collection: top-level systems; failure on busy multi-tenant servers).

How the workaround was introduced

Commit 92f584b5"fix: add limit=1000 to find_by_uid for Go server pagination" — 2026-04-17. Diff: +1 / -1, single line. The commit message is candid that the change is a workaround for a server-pagination behavior, not a correctness fix. This issue exists to record that fact and propose closing the gap properly.

Defense-in-depth — independent of connected-systems-go#5

A related server-side issue (connected-systems-go#5"Go server ignores ?uid=") covers the immediate trigger of the find_by_uid failure on the new Go CSAPI server. If/when that lands, find_by_uid becomes correct again for collections of any size on that one server, because the filter narrows to 0/1 items.

The right fix on the Python side is still to walk next links, for two reasons:

  1. Filter quirks are a per-server reality. Some other CSAPI server tomorrow will have its own filter coverage gap, throttling, partial filter-honoring under load, or simply different parsing of ?uid=. Without server-side filtering, pagination is the spec-defined path.
  2. The OGC pagination contract is the same regardless of filtering. OGC 23-001 §7.6 defines limit as optional with a server-defined default and next HATEOAS links as the conformance-required mechanism for retrieving subsequent pages. A correct OGC client walks links; it does not assume a single page.

So this fix is not contingent on the Go server fix. They're complementary; both should land, and either one alone is insufficient for full correctness.

Recommended direction

The shape of the fix is the maintainer's call. What follows is one direction that fits the existing module structure with minimal surface change.

Add a small page-iteration helper to publishers/bootstrap_helpers.py (and a sibling to src/oshconnect/base.py if the library should not depend on the publisher module — currently they don't share an HTTP layer; bootstrap_helpers.py uses stdlib urllib, src/oshconnect/api_helpers.py uses requests).

Sketch — urllib-side, for bootstrap_helpers.py:

def _iter_pages(base_url: str, path: str, auth: str, *, max_pages: int = 100):
    """
    Yield items from a CSAPI list endpoint, walking `next` HATEOAS links.

    Yields items one at a time across all pages. Caller is responsible for
    early termination once the desired item is found.

    Args:
        base_url: Server base URL.
        path:     Collection path (e.g. 'systems?uid=foo').
        auth:     Basic-auth header value.
        max_pages: Safety cap against pathological circular link chains.

    Raises:
        RuntimeError: If max_pages is exceeded.
    """
    url = path  # api_get composes with base_url
    pages_seen = 0
    seen_urls: set[str] = set()
    while url:
        if pages_seen >= max_pages:
            raise RuntimeError(
                f"_iter_pages exceeded {max_pages} pages for {path}; "
                "possible circular `next` chain"
            )
        if url in seen_urls:
            raise RuntimeError(f"_iter_pages saw a circular `next` link at {url}")
        seen_urls.add(url)
        result = api_get(base_url, url, auth)
        if not result:
            return
        items = result.get("items", []) or result.get("features", [])
        for item in items:
            yield item
        pages_seen += 1
        # Find the `next` link.
        next_link = next(
            (link for link in (result.get("links") or []) if link.get("rel") == "next"),
            None,
        )
        if not next_link or not next_link.get("href"):
            return
        # `next` href may be absolute or path-relative; normalize to a path the
        # existing api_get can consume.
        url = _normalize_next_url(base_url, next_link["href"])

Then find_by_uid collapses to:

def find_by_uid(base_url: str, auth: str, collection: str, uid: str) -> str | None:
    cache_key = f"{collection}:{uid}"
    if cache_key in _uid_cache:
        return _uid_cache[cache_key]

    # Keep `?uid={uid}` so a filter-aware server can short-circuit;
    # walk pages so a filter-ignoring server still works.
    for item in _iter_pages(base_url, f"{collection}?uid={uid}", auth):
        props = item.get("properties", item)
        if props.get("uid") == uid:
            item_id = item.get("id") or props.get("id")
            if item_id:
                _uid_cache[cache_key] = str(item_id)
                return str(item_id)
    return None

And find_datastream similarly switches to _iter_pages. The library-side _discover_system_ds either uses a requests-based sibling helper or is refactored to share a thin wrapper.

Notes on the sketch:

  • Drops the magic limit=1000 entirely. The server's default page size is fine; iteration handles whatever it returns.
  • max_pages and seen_urls are defense against pathological servers (circular next chains have been observed in non-OGC paginated APIs; cheap insurance).
  • Caller iterates lazily; can break out as soon as the target item is found, so for a filter-honoring server the cost is one HTTP request.
  • Negative-result caching: _uid_cache currently caches only successful lookups. Worth a comment that this is intentional — caching None would be incorrect across redeploys where the resource is created out-of-band.

Other things worth touching while we're here (optional)

  • find_datastream: same fix shape, same module, costs almost nothing extra to do in the same PR. Recommend doing it together so all three sites are consistent.
  • _discover_system_ds: library-side sibling; uses requests not urllib. Either (a) live with two _iter_pages implementations (one per HTTP layer) or (b) take this opportunity to move the publisher fleet onto requests (modest dependency change, library already takes a requests dep). Either is defensible; (a) is the smaller diff.
  • Comment on _uid_cache semantics explaining why only positive results are cached, so future contributors don't "fix" it.
  • Bootstrap-test fixture: consider adding a fixture / fake server (or a recorded HTTP cassette via vcrpy/responses) that returns multi-page responses so the iteration logic is exercised in unit tests. Without this, the bug is invisible to CI on a small fixture corpus — exactly how it slipped past in the first place.

What's intentionally NOT in scope for this issue

  • ❌ Adopting a new HTTP client (e.g. httpx) — out of scope; orthogonal architectural choice.
  • ❌ Adding async support to the publisher fleet — out of scope.
  • ❌ Auto-retry / exponential backoff at the page-walk level — api_get already retries via _with_retry; pagination is a separate concern.
  • ❌ A general-purpose CSAPI Python client library — the existing src/oshconnect is what it is; this issue only fixes the three concrete bugs.
  • ❌ Changes to the publishers themselves (iss_publisher.py, etc.) — they consume the bootstrap output; once bootstrap is correct, they're unaffected.

Reproduction / how to confirm

  1. Stand up (or point at) a CSAPI server that does not honor ?uid= filtering. The current Go CSAPI server fits — see connected-systems-go#5.
  2. Pre-populate the systems collection with > 1000 systems (or temporarily set the server's default limit to a small value, e.g. 10, and pre-populate > 10).
  3. Run python -m publishers.iss.bootstrap_iss against it. Observe that find_by_uid returns None for systems that exist beyond the first page, and ensure_system then either fails with HTTP 409 or silently creates a duplicate.

Once the fix lands, the same scenario should bootstrap idempotently with no duplicates and no 409s.

Severity / risk

Medium. Currently latent — papers over fine for current fleet sizes — but:

  • Three sites of the same shape, suggesting a missing concept rather than a one-off bug.
  • The workaround is documented as a workaround in the commit message itself.
  • The failure mode (silent duplicates / failed redeploys) is in deploy automation, where silent failures are especially scary.
  • Trivial to fix relative to consequence at scale.

References

# Source What it provides
1 publishers/bootstrap_helpers.pyfind_by_uid Site #1
2 publishers/bootstrap_helpers.pyfind_datastream Site #2
3 src/oshconnect/base.py_discover_system_ds Site #3
4 Commit 92f584b5 Origin of the limit=1000 workaround
5 connected-systems-go#5 Server-side complement: Go CSAPI server ignores ?uid=
6 OGC 23-001 §7.6 OGC API — Connected Systems pagination contract: limit is server-default; next link is the conformance-required mechanism
7 OS4CSAPI/ogc-client-CSAPI_2#167 TypeScript client companion: list methods will document pagination contract in JSDoc
8 OS4CSAPI/ogc-client-CSAPI_2#170 TypeScript client deferred enhancement: async-iterator helper that walks next links — analog of the _iter_pages sketch above, in TypeScript

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions