Bootstrap idempotency check (`find_by_uid`) reads only first page — same single-page pattern repeated in `find_datastream` and `_discover_system_ds`; `limit=1000` is a fragile workaround

## Summary

Three sites in this repo perform a "look up a resource by some identifier" operation by issuing a **single GET against a CSAPI list endpoint** and iterating the returned page client-side. None of them follow the `next` HATEOAS link that the OGC API — Connected Systems pagination contract requires. This is a latent correctness bug in the bootstrap idempotency layer that becomes a real correctness bug whenever the server (a) does not honor the query filter (`?uid=`, `?outputName=`, etc.) and (b) holds more items than fit on the first page.

The current mitigation — `&limit=1000` added in commit [`92f584b5`](https://github.com/OS4CSAPI/OSHConnect-Python/commit/92f584b5) — papers over the symptom for fleets with ≤1000 items per collection. It is documented in the commit message itself as a Go-server-pagination workaround. It does not fix the underlying issue and silently breaks at scale.

This issue captures the bug, the failure modes, the affected sites, and a recommended direction. It does not prescribe an implementation — that's the maintainer's call.

## Background — why this matters in a publisher context

OSHConnect-Python is, primarily, a **publisher fleet**: long-running services that POST observations into a CSAPI server, fronted by an idempotent bootstrap phase that ensures procedures, systems, datastreams, and deployments exist before publishers start. The bootstrap is meant to be safely re-runnable on every deploy / `docker compose up`. That safety hinges on `find_by_uid` (and its siblings) correctly answering *"does this resource already exist?"*.

When `find_by_uid` returns a **false negative** — *"no, the resource doesn't exist"* — for a resource that in fact exists, the `ensure_*` family attempts to recreate it. On a strict server this returns HTTP 409 and `api_post` raises `RuntimeError`, aborting bootstrap. On a tolerant server it silently creates a duplicate UID and corrupts the deployment. Either outcome breaks the idempotency contract that the publisher fleet's deploy automation depends on.

This is therefore a **deploy-time correctness bug** in publisher infrastructure, not a read-side display bug. That distinction matters: the consequence of getting it wrong is duplicated systems / orphaned datastreams / non-deterministic re-deploys, not a missing row in some UI.

## Affected sites — three places, one shape

| # | File | Function | Endpoint pattern | Single page? | Filter relied on |
|---|---|---|---|---|---|
| 1 | `publishers/bootstrap_helpers.py` | `find_by_uid(base_url, auth, collection, uid)` | `{collection}?uid={uid}&limit=1000` | Yes — single GET, client-side filter loop | `?uid=` |
| 2 | `publishers/bootstrap_helpers.py` | `find_datastream(system_id, output_name)` | `systems/{id}/datastreams` | Yes — single GET, iterates `result["items"]` | `?outputName=` (not used) |
| 3 | `src/oshconnect/base.py` | `_discover_system_ds(...)` | `retrieve_resource(APIResourceTypes.SYSTEM, ...)` items | Yes — walks `raw_res.json().get("items", [])` once | none |

All three sites:
- Issue exactly **one** HTTP GET.
- Iterate the returned page client-side to find the matching item.
- Return `None` / raise *not-found* if the item isn't in that page.
- Do **not** read or follow `links[?(@.rel=='next')]`.

Repo-wide grep for `next` / `rel="next"` / `rel='next'` / `paginate` / pagination-link traversal: **zero matches** in any code path. The codebase has no concept of pagination today.

## `find_by_uid` — verbatim current implementation

`publishers/bootstrap_helpers.py`:

```python
_uid_cache: dict[str, str] = {}


def find_by_uid(base_url: str, auth: str, collection: str, uid: str) -> str | None:
    """Find a resource by UID in a collection. Returns server ID or None."""
    cache_key = f"{collection}:{uid}"
    if cache_key in _uid_cache:
        return _uid_cache[cache_key]

    result = api_get(base_url, f"{collection}?uid={uid}&limit=1000", auth)
    if result:
        # Support both GeoJSON (features) and flat JSON (items) collections
        items = result.get("items", []) or result.get("features", [])
        for item in items:
            props = item.get("properties", item)
            if props.get("uid") == uid:
                item_id = item.get("id") or props.get("id")
                if item_id:
                    _uid_cache[cache_key] = str(item_id)
                    return str(item_id)
    return None
```

The `?uid={uid}` filter is intended to make the server return at most one match (in which case pagination is irrelevant). The `&limit=1000` is the safety net for when the server ignores `?uid=`. Both assumptions can fail simultaneously.

## Failure-mode matrix

| Server honors `?uid=` filter? | Collection size | Result |
|---|---|---|
| Yes | any | ✅ Works correctly. Filter narrows to 0/1 items; pagination is moot. |
| No | ≤ 1000 items | ✅ Works because of the workaround. Current state of fleets running against the Go CSAPI server. |
| No | > 1000 items | ❌ **Silent false-negative.** `find_by_uid` returns `None` for resources that exist on the server. `ensure_*` then tries to recreate the resource, leading to either HTTP 409 → `RuntimeError` (strict server) or silent duplicate-UID creation (tolerant server). Bootstrap idempotency contract broken. |
| Yes, but server ignores it under load / for nested collections | any | ❌ Same as above. |

The third row is dormant in current production deployments because no fleet has crossed 1000 items per collection. It is **not absent** — the publisher fleet pattern is designed to scale (Fort Huachuca v2.3 scenarios, multi-tenant deployments, bigger sensor manifests). The bug fires the moment a collection grows past the magic number.

The same matrix applies, mutatis mutandis, to `find_datastream` (collection: per-system datastreams; failure when a system has many outputs) and `_discover_system_ds` (collection: top-level systems; failure on busy multi-tenant servers).

## How the workaround was introduced

Commit [`92f584b5`](https://github.com/OS4CSAPI/OSHConnect-Python/commit/92f584b5) — *"fix: add limit=1000 to find_by_uid for Go server pagination"* — 2026-04-17. Diff: `+1 / -1`, single line. The commit message is candid that the change is a workaround for a server-pagination behavior, not a correctness fix. This issue exists to record that fact and propose closing the gap properly.

## Defense-in-depth — independent of [`connected-systems-go#5`](https://github.com/OS4CSAPI/connected-systems-go/issues/5)

A related server-side issue ([`connected-systems-go#5`](https://github.com/OS4CSAPI/connected-systems-go/issues/5) — *"Go server ignores `?uid=`"*) covers the immediate trigger of the `find_by_uid` failure on the new Go CSAPI server. If/when that lands, `find_by_uid` becomes correct again **for collections of any size on that one server**, because the filter narrows to 0/1 items.

The right fix on the Python side is still to walk `next` links, for two reasons:

1. **Filter quirks are a per-server reality.** Some other CSAPI server tomorrow will have its own filter coverage gap, throttling, partial filter-honoring under load, or simply different parsing of `?uid=`. Without server-side filtering, pagination is the spec-defined path.
2. **The OGC pagination contract is the same regardless of filtering.** [OGC 23-001 §7.6](https://docs.ogc.org/is/23-001/23-001.html) defines `limit` as optional with a server-defined default and `next` HATEOAS links as the conformance-required mechanism for retrieving subsequent pages. A correct OGC client walks links; it does not assume a single page.

So this fix is not contingent on the Go server fix. They're complementary; both should land, and either one alone is insufficient for full correctness.

## Recommended direction

> The shape of the fix is the maintainer's call. What follows is one direction that fits the existing module structure with minimal surface change.

Add a small page-iteration helper to `publishers/bootstrap_helpers.py` (and a sibling to `src/oshconnect/base.py` if the library should not depend on the publisher module — currently they don't share an HTTP layer; `bootstrap_helpers.py` uses stdlib `urllib`, `src/oshconnect/api_helpers.py` uses `requests`).

Sketch — `urllib`-side, for `bootstrap_helpers.py`:

```python
def _iter_pages(base_url: str, path: str, auth: str, *, max_pages: int = 100):
    """
    Yield items from a CSAPI list endpoint, walking `next` HATEOAS links.

    Yields items one at a time across all pages. Caller is responsible for
    early termination once the desired item is found.

    Args:
        base_url: Server base URL.
        path:     Collection path (e.g. 'systems?uid=foo').
        auth:     Basic-auth header value.
        max_pages: Safety cap against pathological circular link chains.

    Raises:
        RuntimeError: If max_pages is exceeded.
    """
    url = path  # api_get composes with base_url
    pages_seen = 0
    seen_urls: set[str] = set()
    while url:
        if pages_seen >= max_pages:
            raise RuntimeError(
                f"_iter_pages exceeded {max_pages} pages for {path}; "
                "possible circular `next` chain"
            )
        if url in seen_urls:
            raise RuntimeError(f"_iter_pages saw a circular `next` link at {url}")
        seen_urls.add(url)
        result = api_get(base_url, url, auth)
        if not result:
            return
        items = result.get("items", []) or result.get("features", [])
        for item in items:
            yield item
        pages_seen += 1
        # Find the `next` link.
        next_link = next(
            (link for link in (result.get("links") or []) if link.get("rel") == "next"),
            None,
        )
        if not next_link or not next_link.get("href"):
            return
        # `next` href may be absolute or path-relative; normalize to a path the
        # existing api_get can consume.
        url = _normalize_next_url(base_url, next_link["href"])
```

Then `find_by_uid` collapses to:

```python
def find_by_uid(base_url: str, auth: str, collection: str, uid: str) -> str | None:
    cache_key = f"{collection}:{uid}"
    if cache_key in _uid_cache:
        return _uid_cache[cache_key]

    # Keep `?uid={uid}` so a filter-aware server can short-circuit;
    # walk pages so a filter-ignoring server still works.
    for item in _iter_pages(base_url, f"{collection}?uid={uid}", auth):
        props = item.get("properties", item)
        if props.get("uid") == uid:
            item_id = item.get("id") or props.get("id")
            if item_id:
                _uid_cache[cache_key] = str(item_id)
                return str(item_id)
    return None
```

And `find_datastream` similarly switches to `_iter_pages`. The library-side `_discover_system_ds` either uses a `requests`-based sibling helper or is refactored to share a thin wrapper.

Notes on the sketch:
- Drops the magic `limit=1000` entirely. The server's default page size is fine; iteration handles whatever it returns.
- `max_pages` and `seen_urls` are defense against pathological servers (circular `next` chains have been observed in non-OGC paginated APIs; cheap insurance).
- Caller iterates lazily; can break out as soon as the target item is found, so for a filter-honoring server the cost is one HTTP request.
- Negative-result caching: `_uid_cache` currently caches only successful lookups. Worth a comment that this is intentional — caching `None` would be incorrect across redeploys where the resource is created out-of-band.

## Other things worth touching while we're here (optional)

- **`find_datastream`:** same fix shape, same module, costs almost nothing extra to do in the same PR. Recommend doing it together so all three sites are consistent.
- **`_discover_system_ds`:** library-side sibling; uses `requests` not `urllib`. Either (a) live with two `_iter_pages` implementations (one per HTTP layer) or (b) take this opportunity to move the publisher fleet onto `requests` (modest dependency change, library already takes a `requests` dep). Either is defensible; (a) is the smaller diff.
- **Comment on `_uid_cache` semantics** explaining why only positive results are cached, so future contributors don't "fix" it.
- **Bootstrap-test fixture:** consider adding a fixture / fake server (or a recorded HTTP cassette via `vcrpy`/`responses`) that returns multi-page responses so the iteration logic is exercised in unit tests. Without this, the bug is invisible to CI on a small fixture corpus — exactly how it slipped past in the first place.

## What's intentionally NOT in scope for this issue

- ❌ Adopting a new HTTP client (e.g. `httpx`) — out of scope; orthogonal architectural choice.
- ❌ Adding async support to the publisher fleet — out of scope.
- ❌ Auto-retry / exponential backoff at the page-walk level — `api_get` already retries via `_with_retry`; pagination is a separate concern.
- ❌ A general-purpose CSAPI Python client library — the existing `src/oshconnect` is what it is; this issue only fixes the three concrete bugs.
- ❌ Changes to the publishers themselves (`iss_publisher.py`, etc.) — they consume the bootstrap output; once bootstrap is correct, they're unaffected.

## Reproduction / how to confirm

1. Stand up (or point at) a CSAPI server that does not honor `?uid=` filtering. The current Go CSAPI server fits — see [`connected-systems-go#5`](https://github.com/OS4CSAPI/connected-systems-go/issues/5).
2. Pre-populate the `systems` collection with > 1000 systems (or temporarily set the server's default `limit` to a small value, e.g. 10, and pre-populate > 10).
3. Run `python -m publishers.iss.bootstrap_iss` against it. Observe that `find_by_uid` returns `None` for systems that exist beyond the first page, and `ensure_system` then either fails with HTTP 409 or silently creates a duplicate.

Once the fix lands, the same scenario should bootstrap idempotently with no duplicates and no 409s.

## Severity / risk

Medium. Currently latent — papers over fine for current fleet sizes — but:

- Three sites of the same shape, suggesting a missing concept rather than a one-off bug.
- The workaround is documented as a workaround in the commit message itself.
- The failure mode (silent duplicates / failed redeploys) is in deploy automation, where silent failures are especially scary.
- Trivial to fix relative to consequence at scale.

## References

| # | Source | What it provides |
|---|---|---|
| 1 | `publishers/bootstrap_helpers.py` — `find_by_uid` | Site #1 |
| 2 | `publishers/bootstrap_helpers.py` — `find_datastream` | Site #2 |
| 3 | `src/oshconnect/base.py` — `_discover_system_ds` | Site #3 |
| 4 | Commit [`92f584b5`](https://github.com/OS4CSAPI/OSHConnect-Python/commit/92f584b5) | Origin of the `limit=1000` workaround |
| 5 | [`connected-systems-go#5`](https://github.com/OS4CSAPI/connected-systems-go/issues/5) | Server-side complement: Go CSAPI server ignores `?uid=` |
| 6 | [OGC 23-001 §7.6](https://docs.ogc.org/is/23-001/23-001.html) | OGC API — Connected Systems pagination contract: `limit` is server-default; `next` link is the conformance-required mechanism |
| 7 | [`OS4CSAPI/ogc-client-CSAPI_2#167`](https://github.com/OS4CSAPI/ogc-client-CSAPI_2/issues/167) | TypeScript client companion: list methods will document pagination contract in JSDoc |
| 8 | [`OS4CSAPI/ogc-client-CSAPI_2#170`](https://github.com/OS4CSAPI/ogc-client-CSAPI_2/issues/170) | TypeScript client deferred enhancement: async-iterator helper that walks `next` links — analog of the `_iter_pages` sketch above, in TypeScript |

#	File	Function	Endpoint pattern	Single page?	Filter relied on
1	`publishers/bootstrap_helpers.py`	`find_by_uid(base_url, auth, collection, uid)`	`{collection}?uid={uid}&limit=1000`	Yes — single GET, client-side filter loop	`?uid=`
2	`publishers/bootstrap_helpers.py`	`find_datastream(system_id, output_name)`	`systems/{id}/datastreams`	Yes — single GET, iterates `result["items"]`	`?outputName=` (not used)
3	`src/oshconnect/base.py`	`_discover_system_ds(...)`	`retrieve_resource(APIResourceTypes.SYSTEM, ...)` items	Yes — walks `raw_res.json().get("items", [])` once	none

#	Source	What it provides
1	`publishers/bootstrap_helpers.py` — `find_by_uid`	Site #1
2	`publishers/bootstrap_helpers.py` — `find_datastream`	Site #2
3	`src/oshconnect/base.py` — `_discover_system_ds`	Site #3
4	Commit `92f584b5`	Origin of the `limit=1000` workaround
5	`connected-systems-go#5`	Server-side complement: Go CSAPI server ignores `?uid=`
6	OGC 23-001 §7.6	OGC API — Connected Systems pagination contract: `limit` is server-default; `next` link is the conformance-required mechanism
7	`OS4CSAPI/ogc-client-CSAPI_2#167`	TypeScript client companion: list methods will document pagination contract in JSDoc
8	`OS4CSAPI/ogc-client-CSAPI_2#170`	TypeScript client deferred enhancement: async-iterator helper that walks `next` links — analog of the `_iter_pages` sketch above, in TypeScript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap idempotency check (`find_by_uid`) reads only first page — same single-page pattern repeated in `find_datastream` and `_discover_system_ds`; `limit=1000` is a fragile workaround #4

Summary

Background — why this matters in a publisher context

Affected sites — three places, one shape

`find_by_uid` — verbatim current implementation

Failure-mode matrix

How the workaround was introduced

Defense-in-depth — independent of `connected-systems-go#5`

Recommended direction

Other things worth touching while we're here (optional)

What's intentionally NOT in scope for this issue

Reproduction / how to confirm

Severity / risk

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Server honors `?uid=` filter?	Collection size	Result
Yes	any	✅ Works correctly. Filter narrows to 0/1 items; pagination is moot.
No	≤ 1000 items	✅ Works because of the workaround. Current state of fleets running against the Go CSAPI server.
No	> 1000 items	❌ Silent false-negative. `find_by_uid` returns `None` for resources that exist on the server. `ensure_*` then tries to recreate the resource, leading to either HTTP 409 → `RuntimeError` (strict server) or silent duplicate-UID creation (tolerant server). Bootstrap idempotency contract broken.
Yes, but server ignores it under load / for nested collections	any	❌ Same as above.

Bootstrap idempotency check (find_by_uid) reads only first page — same single-page pattern repeated in find_datastream and _discover_system_ds; limit=1000 is a fragile workaround #4

Description

Summary

Background — why this matters in a publisher context

Affected sites — three places, one shape

find_by_uid — verbatim current implementation

Failure-mode matrix

How the workaround was introduced

Defense-in-depth — independent of connected-systems-go#5

Recommended direction

Other things worth touching while we're here (optional)

What's intentionally NOT in scope for this issue

Reproduction / how to confirm

Severity / risk

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Bootstrap idempotency check (`find_by_uid`) reads only first page — same single-page pattern repeated in `find_datastream` and `_discover_system_ds`; `limit=1000` is a fragile workaround #4

`find_by_uid` — verbatim current implementation

Defense-in-depth — independent of `connected-systems-go#5`