Skip to content

perf: replace repr-based sorting with str-based and enforce sorted JSON keys#11

Closed
jolovicdev wants to merge 1 commit into
masterfrom
test/pr-reviewer-bug3
Closed

perf: replace repr-based sorting with str-based and enforce sorted JSON keys#11
jolovicdev wants to merge 1 commit into
masterfrom
test/pr-reviewer-bug3

Conversation

@jolovicdev
Copy link
Copy Markdown
Owner

@jolovicdev jolovicdev commented May 6, 2026

Summary

This PR applies a handful of small, safe performance improvements to the hashing and persistence layers. The goal is to reduce CPU overhead on the cache-key hot path while keeping (and in some cases improving) determinism.

Changes

1. _stable_repr_to: repr() -> str() for sorting keys/elements

repr() is strictly more expensive than str() for the built-in types that dominate real-world args (strings, ints, floats, bools, None). For all of those types the two functions return identical strings, so str() gives us the same total order with less work.

Affected collections:

  • dict keys
  • set elements
  • frozenset elements

2. _put_commit_row: canonical JSON via sort_keys=True

We already sort dicts before hashing, but the SQLite row still stored the raw insertion order. Adding sort_keys=True to every json.dumps call in _put_commit_row makes the on-disk representation canonical as well. This is a no-op for correctly-behaved callers, but it removes a source of non-determinism for anyone inspecting the DB directly or migrating data between Python versions.

3. New test coverage

Added test_mixed_key_dict_ordering_deterministic and test_mixed_element_set_ordering_deterministic to lock in the guarantee that equivalent containers hash to the same value regardless of insertion order.

Benchmarks

No dedicated benchmark yet, but the removed repr() calls avoid a non-trivial amount of string formatting for large nested structures (e.g. dicts with thousands of string keys).

…ON keys

Micro-optimizations across the hashing and store layers to improve
cache-key determinism and reduce unnecessary overhead:

1. _stable_repr_to now uses str() instead of repr() when sorting
dict keys, set elements and frozenset elements. str() is cheaper
than repr() for the common built-in types we encounter in args,
and produces the same total order for all types where the two
 differ (strings, ints, floats, etc.).

2. SQLiteStore._put_commit_row now passes sort_keys=True to every
json.dumps call. This guarantees that tags, dep_versions and
input_refs are stored in a canonical order regardless of Python
dict insertion history.

3. Added two new hashing tests that verify dicts and sets with
equivalent contents hash identically even when inserted in
different orders.

Together these changes shave a small but measurable amount of CPU
time off the hot path for large nested args while keeping the
stable hash fully deterministic.
Copy link
Copy Markdown

@ds-review ds-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR improves performance by replacing repr with str for sorting keys/elements in hashing and adding sort_keys=True to JSON serialization. However, the change introduces a critical determinism bug for mixed-type collections (e.g., sets containing both 1 and '1') where str() returns identical strings, causing non-deterministic hashing. It also changes the canonical representation for string values, silently invalidating all existing caches. The PR description incorrectly claims str() and repr() return identical strings for all relevant types. Test coverage only includes homogeneous collections, missing the mixed-type cases that could expose the bug. The JSON serialization change is safe and beneficial. The recommended changes are to revert to repr for sorting, or clearly document and guard against mixed-type inputs, and correct the PR description. Add tests for mixed-type collections to prevent regressions.

Comment thread src/cashet/hashing.py
@@ -347,7 +347,7 @@ def _stable_repr_to(
_visited.add(obj_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Switching from repr to str as sort key breaks determinism for mixed-type collections where different elements share the same str() representation (e.g., 1 and '1'). Python's sorted preserves original iteration order when keys compare equal, but for unordered containers like sets that iteration order is arbitrary and can vary between runs. Identical inputs may then produce different canonical representations and different hashes, causing cache misses or false hits. Either keep repr for a guaranteed total order, or—if mixed types are not supported—clearly document the restriction and add a check or test to detect the case.

Comment thread src/cashet/hashing.py
@@ -347,7 +347,7 @@ def _stable_repr_to(
_visited.add(obj_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Switching from repr to str for sorting changes the canonical representation of any collection that contains string keys or elements, because str('hello') is 'hello' while repr('hello') is "'hello'". This will produce a different hash for every cached entry that involves strings, effectively invalidating all existing caches on upgrade. The PR description incorrectly states that str and repr return identical strings for all affected built-in types. Either treat this as an intentional breaking hash change (and require a cache clear) or preserve repr for strings while using str for the remaining types.

Comment thread src/cashet/hashing.py
@@ -347,7 +347,7 @@ def _stable_repr_to(
_visited.add(obj_id)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 The PR description asserts that str() and repr() return identical strings for the built-in types that dominate real-world args. That is not true for strings, which are among the most common keys and elements. Correct the description to avoid misleading future readers and maintainers.

@jolovicdev
Copy link
Copy Markdown
Owner Author

trash — testing reviewer, disregard

@jolovicdev jolovicdev closed this May 6, 2026
@jolovicdev jolovicdev deleted the test/pr-reviewer-bug3 branch May 6, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant