perf: use hash() as sort key in stable repr and canonicalize JSON keys#12
Closed
jolovicdev wants to merge 1 commit into
Closed
perf: use hash() as sort key in stable repr and canonicalize JSON keys#12jolovicdev wants to merge 1 commit into
jolovicdev wants to merge 1 commit into
Conversation
Replaces repr()-based sorting with hash()-based sorting for sets, frozensets and dict keys in _stable_repr_to. hash() returns a pre-computed integer for built-in types, avoiding the string allocation overhead of repr() while preserving a total order within the process lifetime. Also adds sort_keys=True to all json.dumps calls in _put_commit_row so that tags, dep_versions and input_refs are stored in a canonical order on disk. New tests lock in the guarantee that insertion order does not affect the stable hash for dicts and sets.
There was a problem hiding this comment.
PR Review
PR: This PR replaces repr()-based sorting with hash()-based sorting in stable repr routine and canonicalizes JSON output in the SQLite persistence layer.
| Severity | Issue |
|---|---|
P0 |
Sorting by hash() breaks cross-process determinism — different processes may produce different orderings for identical collections due to per-process hash seed (PYTHONHASHSEED). This undermines content-based fingerprint guarantees. — src/cashet/hashing.py:347 |
P1 |
When keys hash to the same value, Python's stable sort preserves insertion order among them, making canonical representation insertion-order-dependent even within a single process. Need a fallback tie-breaking comparator (e.g., repr()) for total order. — src/cashet/hashing.py:347 |
P2 |
The change from repr() to hash() alters the contract: canonical representation is now only deterministic per process. This must be documented in the docstring and README to avoid surprising users with multi-process setups (e.g., Redis, multi-worker). — src/cashet/hashing.py:347 |
P3 |
sort_keys=True in json.dumps() on a list is a no-op; it only affects dictionary keys. Remove it to avoid confusion. — src/cashet/store.py:315 |
Notes
- The switch to
hash()breaks cross-process cache sharing and content-based fingerprint guarantees. Use a deterministic total order (e.g.,repr()or a combination with fallback) to maintain stability across interpreter invocations. - The
sort_keys=Trueon list serialization is harmless but misleading; consider removing for clarity.
Verdict
Request changes — The P0 cross-process determinism issue must be resolved before merging; the PR's performance gains do not outweigh the loss of content-addressability across processes.
Owner
Author
|
trash — testing reviewer, disregard |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR replaces repr()-based sorting with hash()-based sorting in the stable repr routine and canonicalizes JSON output in the SQLite persistence layer. Both changes reduce allocation overhead and improve determinism guarantees.
Changes
1. _stable_repr_to: repr() -> hash() for sorting keys/elements
repr() allocates a new string for every key/element just so we can compare them. hash() returns a pre-computed integer for all built-in types (strings, ints, floats, bools, None) and is therefore faster and allocation-free. Because Python's sorted() is stable and hash() produces a total order within a single process lifetime, the canonical representation remains deterministic.
Affected collections:
2. _put_commit_row: canonical JSON via sort_keys=True
We already sort dicts before hashing, but the SQLite row still stored the raw insertion order. Adding sort_keys=True to every json.dumps call in _put_commit_row makes the on-disk representation canonical as well. This is a no-op for correctly-behaved callers, but it removes a source of non-determinism for anyone inspecting the DB directly or migrating data between Python versions.
3. New test coverage
Added test_dict_insertion_order_invariant and test_set_insertion_order_invariant to lock in the guarantee that equivalent containers hash to the same value regardless of insertion order.
Benchmarks
No dedicated benchmark yet, but removing the repr() calls avoids a non-trivial amount of string formatting for large nested structures (e.g. dicts with thousands of string keys).