SDK Conformance Audit - Cases.json, evalCondition operator audit and partial feature-eval / experiment-assignment#122
Merged
Conversation
compare(NaN, NaN) returned 0 because both > and < evaluate to false on NaN, wrongly satisfying $eq. Raise instead — the existing exception handlers in evalOperatorCondition give the correct truth values ($eq=False, $ne=True, $lt/$lte/$gt/$gte=False), matching JS and the IEEE 754 spec. JSON can't represent NaN so the regression case lives in test_growthbook.py rather than cases.json.
JS sdk-js/test/cases.json carries 30 cases that Python's copy was missing: 24 evalCondition cases (case-sensitive failure + granular case-insensitive variants), 2 hashVersion 2 feature cases, and 4 sticky-bucket minBucketVersion/bucketVersion boundary cases. All 30 pass against current Python — pure defensive coverage, no behavioral change. Source: docs/sdk_parity/tier1.md (B2 catch-up section).
JS is source of truth for the conformance corpus. specVersion is a
label, not a contract — Python's copy fell 30 cases behind under 0.7.1
without any signal. The new check fetches the JS cases.json and diffs
case names per top-level key:
- 'missing' (JS has, Python doesn't): fails CI unless listed in
tests/scripts/corpus_skiplist.json
- 'extra' (Python has, JS doesn't): reported but never fails;
Python carries intentional extensions ($notRegex regression
cases, locally-authored regressions)
Wired as a separate job in the build workflow so it surfaces clearly
in the PR UI and doesn't run 5x across the python-version matrix.
Source URL is configurable via --js-source or GB_JS_CASES_URL; default
is the JS SDK main-branch raw URL.
… ASCII-only case-fold
Two canonical behaviors per the tier-1 cross-SDK audit:
D6a (strict equality). JS's `$eq` is `===` (strict); only `<`/`>` coerce.
Python previously used compare() for $eq too, which coerced string→float
making `{$eq: 5}` against "5" return True. Real production divergence:
a customer with int feature rules and string frontend attributes saw
different variations between JS and Python clients. Routed $eq/$ne
through a new _js_strict_equal helper that requires getType() match.
Comparison operators ($lt/$lte/$gt/$gte) still call compare() and keep
JS coercion semantics (null→0, string→float).
D10 (ASCII-only case fold). Python's str.lower() is Unicode-aware
("İ".lower() → "i̇", "ß".lower() → "ß"), Rust's to_lowercase() is
Unicode-aware (differently), Go's strings.EqualFold is Unicode-aware
(differently still), JS's String.prototype.toLowerCase is naive ASCII.
Three SDKs producing three different non-ASCII behaviors is worse than
all four being naive. Locked the canonical to ASCII-only via a new
_ascii_lower translate table. ASCII still folds (US matches us);
non-ASCII chars pass through unchanged (Turkish İ no longer matches i).
tests/cases.json adds 16 cases:
- 5 for D6 (strict equality across number/string, bool/int, no-op equality)
- 2 for D10 (Turkish İ, German ß)
- 7 defensive pins for D1-D5, D7, D8 (Python already passed these but
no case-locked them; pinning so JS-aligned behavior can't silently
drift on future refactors)
- 2 for D14/D15 (parentCondition gate semantics — Python passes today,
Rust will need fixes before it can land these upstream)
…-aware
Empirical verification via Node showed JS's String.prototype.toLowerCase()
is Unicode-aware ("Σ" → "σ", "А" → "а"), byte-identical to Python's
str.lower() on the test inputs. The earlier "ASCII-only" canonical
decision was based on three Explore-agent reports that mischaracterized
JS as "naive ASCII" — a common misconception that survived to the
deliverable.
Effect of the wrong fix: Python had been switched to a translate-table
ASCII-only lower, which made $ini/$nini/$alli return false for
Cyrillic/Greek pairs ("А" vs "а", "Σ" vs "σ") that JS treats as
matches. So the "fix" actually broke JS-alignment for those characters.
Restored str.lower() in core.py. Removed the misleading "ASCII case-fold
only" framing in the cases.json case names; the case bodies ("Turkish
İ vs i → false", "German ß vs SS → false") are still correct
regression cases because both JS toLowerCase and Python str.lower
produce "i̇" (with combining mark) for "İ" and don't fold ß. Added two
positive cases pinning the Unicode-aware path: "Σ" vs "σ" → true and
Cyrillic "А" vs "а" → true.
Empirical verification against JS via Node and against Rust via a
cargo-built test harness:
new RegExp("^5$").test(5) → true (JS .test() coerces non-string)
Rust regex_comparison.rs:20 → true (it.to_string() coerces)
Python re.search(pat, 5) → false (TypeError caught)
So Python is the diverger; the case I had committed claimed Rust was
wrong (matching the agent's misread of regex_comparison.rs:28). The
case is removed. The real Python divergence is tracked in tier1.md
D8 (revised) but the fix is non-trivial (Python str(None) ≠ JS
String(null), so coercion would need a custom table) — deferred.
… reference identity)
JS mongrule.ts:218 uses `actual === expected` for $eq/$ne. For arrays
and objects, JS === is reference identity — two distinct literals never
match (verified empirically: `[1,2] === [1,2]` and `{a:1} === {a:1}`
both return false). Python's previous fallthrough was `a == b` which
does deep value equality, so distinct-but-equal containers DID match
— diverging from JS.
The simple (non-operator) equality path at mongrule.ts:100 uses
JSON.stringify deep comparison; Python's `==` on parsed-JSON dicts and
lists generally agrees with that, so it's left untouched.
Within feature evaluation we can't observe Python object identity in
any meaningful way — condition and attribute come from independent
JSON parses, never the same object. So container $eq is effectively
always-false in practice, matching JS. The fix encodes that directly.
Three new cases.json entries pin the behavior under tag [D6c].
Name-only diff missed real drift: if JS changed an expected outcome under an existing case name (no rename), the old script exited 0 with no signal. Verified the gap by flipping one expected boolean in a temp JS corpus copy — old script said OK, new script flags it as body-drift. Changes: - _case_signatures(): canonical-JSON (sort_keys, compact separators) hash of each case body (everything after the name). SHA1, truncated to 16 hex chars — stable across dict insertion order. - _diff() returns 5 categories now: actionable_missing, skipped_missing, extras, actionable_drift, skipped_drift. Same-name + same-hash → pass. Different hash → drift, fails CI. - corpus_skiplist.json gains a "drift" bucket for the rare case where Python deliberately keeps a different body from JS. - Report adds a "Body-drift" section; JSON output adds drift_actionable and drift_skipped keys. Exit codes unchanged: 0 = clean, 1 = drift, 2 = infra error.
Previous version stored one body-hash per name in a dict, so duplicates
collapsed to the last occurrence. cases.json carries duplicate names
(e.g. $not - pass twice, $not - fail twice) and pytest's
pytest_generate_tests parametrizes the full list — so all duplicates
DO run. Drift on any earlier duplicate was silently missed.
Reproduced empirically: flipped an expected value on the FIRST
'$not - pass' case in a temp JS corpus copy; old checker exited 0.
Fix:
- _case_signatures_grouped() returns {name: [hash, hash, ...]} keeping
every occurrence (renamed from _case_signatures).
- Drift compares Counter(js_hashes) != Counter(py_hashes) per name,
catching:
(a) single-occurrence body change,
(b) one-of-many duplicates' body change,
(c) duplicate-count mismatch (JS has 3, Python has 2).
- Multiset comparison so duplicate order within a name doesn't matter.
Verified across four scenarios — clean, drift-of-first-duplicate,
single-occurrence drift, duplicate count mismatch. All produce correct
exit codes and report sections.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NaNhandling incompare()- Add an earlymath.isnanguard that raisesValueError$eq/$nefor number-vs-string and bool-vs-int cases, resulting in eval behavior change.evalCondition$ini/$nini/$alli) plus case-sensitive failure variants for$in/$nin/$regex/$allfeaturehashVersion 2force-rule inclusion/exclusionstickyBucketminBucketVersion/bucketVersionboundary semantics