Skip to content

perf: use hash() as sort key in stable repr and canonicalize JSON keys#12

Closed
jolovicdev wants to merge 1 commit into
masterfrom
test/pr-reviewer-bug4
Closed

perf: use hash() as sort key in stable repr and canonicalize JSON keys#12
jolovicdev wants to merge 1 commit into
masterfrom
test/pr-reviewer-bug4

Conversation

@jolovicdev
Copy link
Copy Markdown
Owner

Summary

This PR replaces repr()-based sorting with hash()-based sorting in the stable repr routine and canonicalizes JSON output in the SQLite persistence layer. Both changes reduce allocation overhead and improve determinism guarantees.

Changes

1. _stable_repr_to: repr() -> hash() for sorting keys/elements

repr() allocates a new string for every key/element just so we can compare them. hash() returns a pre-computed integer for all built-in types (strings, ints, floats, bools, None) and is therefore faster and allocation-free. Because Python's sorted() is stable and hash() produces a total order within a single process lifetime, the canonical representation remains deterministic.

Affected collections:

  • dict keys
  • set elements
  • frozenset elements

2. _put_commit_row: canonical JSON via sort_keys=True

We already sort dicts before hashing, but the SQLite row still stored the raw insertion order. Adding sort_keys=True to every json.dumps call in _put_commit_row makes the on-disk representation canonical as well. This is a no-op for correctly-behaved callers, but it removes a source of non-determinism for anyone inspecting the DB directly or migrating data between Python versions.

3. New test coverage

Added test_dict_insertion_order_invariant and test_set_insertion_order_invariant to lock in the guarantee that equivalent containers hash to the same value regardless of insertion order.

Benchmarks

No dedicated benchmark yet, but removing the repr() calls avoids a non-trivial amount of string formatting for large nested structures (e.g. dicts with thousands of string keys).

Replaces repr()-based sorting with hash()-based sorting for sets,
frozensets and dict keys in _stable_repr_to. hash() returns a
pre-computed integer for built-in types, avoiding the string
allocation overhead of repr() while preserving a total order
within the process lifetime.

Also adds sort_keys=True to all json.dumps calls in
_put_commit_row so that tags, dep_versions and input_refs are
stored in a canonical order on disk.

New tests lock in the guarantee that insertion order does not
affect the stable hash for dicts and sets.
Copy link
Copy Markdown

@ds-review ds-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review

PR: This PR replaces repr()-based sorting with hash()-based sorting in stable repr routine and canonicalizes JSON output in the SQLite persistence layer.

Severity Issue
P0 Sorting by hash() breaks cross-process determinism — different processes may produce different orderings for identical collections due to per-process hash seed (PYTHONHASHSEED). This undermines content-based fingerprint guarantees. — src/cashet/hashing.py:347
P1 When keys hash to the same value, Python's stable sort preserves insertion order among them, making canonical representation insertion-order-dependent even within a single process. Need a fallback tie-breaking comparator (e.g., repr()) for total order. — src/cashet/hashing.py:347
P2 The change from repr() to hash() alters the contract: canonical representation is now only deterministic per process. This must be documented in the docstring and README to avoid surprising users with multi-process setups (e.g., Redis, multi-worker). — src/cashet/hashing.py:347
P3 sort_keys=True in json.dumps() on a list is a no-op; it only affects dictionary keys. Remove it to avoid confusion. — src/cashet/store.py:315

Notes

  • The switch to hash() breaks cross-process cache sharing and content-based fingerprint guarantees. Use a deterministic total order (e.g., repr() or a combination with fallback) to maintain stability across interpreter invocations.
  • The sort_keys=True on list serialization is harmless but misleading; consider removing for clarity.

Verdict

Request changes — The P0 cross-process determinism issue must be resolved before merging; the PR's performance gains do not outweigh the loss of content-addressability across processes.

@jolovicdev
Copy link
Copy Markdown
Owner Author

trash — testing reviewer, disregard

@jolovicdev jolovicdev closed this May 6, 2026
@jolovicdev jolovicdev deleted the test/pr-reviewer-bug4 branch May 6, 2026 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant