Skip to content

SDK Conformance Audit - Cases.json, evalCondition operator audit and partial feature-eval / experiment-assignment#122

Merged
madhuchavva merged 12 commits into
mainfrom
fix/multi-rule-shared-prereq-regression-test
Jun 18, 2026
Merged

SDK Conformance Audit - Cases.json, evalCondition operator audit and partial feature-eval / experiment-assignment#122
madhuchavva merged 12 commits into
mainfrom
fix/multi-rule-shared-prereq-regression-test

Conversation

@madhuchavva

@madhuchavva madhuchavva commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  1. NaN handling in compare() - Add an early math.isnan guard that raises ValueError
  2. Strict comparison for$eq / $ne for number-vs-string and bool-vs-int cases, resulting in eval behavior change.
  3. Additions to cases.json to reinforce the behavior.
Section Count Theme
evalCondition 24 case-insensitive operator granularity ($ini/$nini/$alli) plus case-sensitive failure variants for $in/$nin/$regex/$all
feature 2 hashVersion 2 force-rule inclusion/exclusion
stickyBucket 4 minBucketVersion / bucketVersion boundary semantics

compare(NaN, NaN) returned 0 because both > and < evaluate to false on
NaN, wrongly satisfying $eq. Raise instead — the existing exception
handlers in evalOperatorCondition give the correct truth values
($eq=False, $ne=True, $lt/$lte/$gt/$gte=False), matching JS and
the IEEE 754 spec. JSON can't represent NaN so the regression case
lives in test_growthbook.py rather than cases.json.
JS sdk-js/test/cases.json carries 30 cases that Python's copy was
missing: 24 evalCondition cases (case-sensitive failure + granular
case-insensitive variants), 2 hashVersion 2 feature cases, and 4
sticky-bucket minBucketVersion/bucketVersion boundary cases. All 30
pass against current Python — pure defensive coverage, no behavioral
change. Source: docs/sdk_parity/tier1.md (B2 catch-up section).
JS is source of truth for the conformance corpus. specVersion is a
label, not a contract — Python's copy fell 30 cases behind under 0.7.1
without any signal. The new check fetches the JS cases.json and diffs
case names per top-level key:

  - 'missing' (JS has, Python doesn't): fails CI unless listed in
    tests/scripts/corpus_skiplist.json
  - 'extra' (Python has, JS doesn't): reported but never fails;
    Python carries intentional extensions ($notRegex regression
    cases, locally-authored regressions)

Wired as a separate job in the build workflow so it surfaces clearly
in the PR UI and doesn't run 5x across the python-version matrix.

Source URL is configurable via --js-source or GB_JS_CASES_URL; default
is the JS SDK main-branch raw URL.
… ASCII-only case-fold

Two canonical behaviors per the tier-1 cross-SDK audit:

D6a (strict equality). JS's `$eq` is `===` (strict); only `<`/`>` coerce.
Python previously used compare() for $eq too, which coerced string→float
making `{$eq: 5}` against "5" return True. Real production divergence:
a customer with int feature rules and string frontend attributes saw
different variations between JS and Python clients. Routed $eq/$ne
through a new _js_strict_equal helper that requires getType() match.
Comparison operators ($lt/$lte/$gt/$gte) still call compare() and keep
JS coercion semantics (null→0, string→float).

D10 (ASCII-only case fold). Python's str.lower() is Unicode-aware
("İ".lower() → "i̇", "ß".lower() → "ß"), Rust's to_lowercase() is
Unicode-aware (differently), Go's strings.EqualFold is Unicode-aware
(differently still), JS's String.prototype.toLowerCase is naive ASCII.
Three SDKs producing three different non-ASCII behaviors is worse than
all four being naive. Locked the canonical to ASCII-only via a new
_ascii_lower translate table. ASCII still folds (US matches us);
non-ASCII chars pass through unchanged (Turkish İ no longer matches i).

tests/cases.json adds 16 cases:
- 5 for D6 (strict equality across number/string, bool/int, no-op equality)
- 2 for D10 (Turkish İ, German ß)
- 7 defensive pins for D1-D5, D7, D8 (Python already passed these but
  no case-locked them; pinning so JS-aligned behavior can't silently
  drift on future refactors)
- 2 for D14/D15 (parentCondition gate semantics — Python passes today,
  Rust will need fixes before it can land these upstream)
…-aware

Empirical verification via Node showed JS's String.prototype.toLowerCase()
is Unicode-aware ("Σ" → "σ", "А" → "а"), byte-identical to Python's
str.lower() on the test inputs. The earlier "ASCII-only" canonical
decision was based on three Explore-agent reports that mischaracterized
JS as "naive ASCII" — a common misconception that survived to the
deliverable.

Effect of the wrong fix: Python had been switched to a translate-table
ASCII-only lower, which made $ini/$nini/$alli return false for
Cyrillic/Greek pairs ("А" vs "а", "Σ" vs "σ") that JS treats as
matches. So the "fix" actually broke JS-alignment for those characters.

Restored str.lower() in core.py. Removed the misleading "ASCII case-fold
only" framing in the cases.json case names; the case bodies ("Turkish
İ vs i → false", "German ß vs SS → false") are still correct
regression cases because both JS toLowerCase and Python str.lower
produce "i̇" (with combining mark) for "İ" and don't fold ß. Added two
positive cases pinning the Unicode-aware path: "Σ" vs "σ" → true and
Cyrillic "А" vs "а" → true.
Empirical verification against JS via Node and against Rust via a
cargo-built test harness:

  new RegExp("^5$").test(5) → true     (JS .test() coerces non-string)
  Rust regex_comparison.rs:20   → true   (it.to_string() coerces)
  Python re.search(pat, 5)      → false  (TypeError caught)

So Python is the diverger; the case I had committed claimed Rust was
wrong (matching the agent's misread of regex_comparison.rs:28). The
case is removed. The real Python divergence is tracked in tier1.md
D8 (revised) but the fix is non-trivial (Python str(None) ≠ JS
String(null), so coercion would need a custom table) — deferred.
… reference identity)

JS mongrule.ts:218 uses `actual === expected` for $eq/$ne. For arrays
and objects, JS === is reference identity — two distinct literals never
match (verified empirically: `[1,2] === [1,2]` and `{a:1} === {a:1}`
both return false). Python's previous fallthrough was `a == b` which
does deep value equality, so distinct-but-equal containers DID match
— diverging from JS.

The simple (non-operator) equality path at mongrule.ts:100 uses
JSON.stringify deep comparison; Python's `==` on parsed-JSON dicts and
lists generally agrees with that, so it's left untouched.

Within feature evaluation we can't observe Python object identity in
any meaningful way — condition and attribute come from independent
JSON parses, never the same object. So container $eq is effectively
always-false in practice, matching JS. The fix encodes that directly.

Three new cases.json entries pin the behavior under tag [D6c].
Name-only diff missed real drift: if JS changed an expected outcome
under an existing case name (no rename), the old script exited 0 with
no signal. Verified the gap by flipping one expected boolean in a temp
JS corpus copy — old script said OK, new script flags it as body-drift.

Changes:
- _case_signatures(): canonical-JSON (sort_keys, compact separators)
  hash of each case body (everything after the name). SHA1, truncated
  to 16 hex chars — stable across dict insertion order.
- _diff() returns 5 categories now: actionable_missing, skipped_missing,
  extras, actionable_drift, skipped_drift. Same-name + same-hash → pass.
  Different hash → drift, fails CI.
- corpus_skiplist.json gains a "drift" bucket for the rare case where
  Python deliberately keeps a different body from JS.
- Report adds a "Body-drift" section; JSON output adds drift_actionable
  and drift_skipped keys.

Exit codes unchanged: 0 = clean, 1 = drift, 2 = infra error.
Previous version stored one body-hash per name in a dict, so duplicates
collapsed to the last occurrence. cases.json carries duplicate names
(e.g. $not - pass twice, $not - fail twice) and pytest's
pytest_generate_tests parametrizes the full list — so all duplicates
DO run. Drift on any earlier duplicate was silently missed.

Reproduced empirically: flipped an expected value on the FIRST
'$not - pass' case in a temp JS corpus copy; old checker exited 0.

Fix:
- _case_signatures_grouped() returns {name: [hash, hash, ...]} keeping
  every occurrence (renamed from _case_signatures).
- Drift compares Counter(js_hashes) != Counter(py_hashes) per name,
  catching:
    (a) single-occurrence body change,
    (b) one-of-many duplicates' body change,
    (c) duplicate-count mismatch (JS has 3, Python has 2).
- Multiset comparison so duplicate order within a name doesn't matter.

Verified across four scenarios — clean, drift-of-first-duplicate,
single-occurrence drift, duplicate count mismatch. All produce correct
exit codes and report sections.
@madhuchavva madhuchavva merged commit 8cc14e5 into main Jun 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant