perf: replace repr-based sorting with str-based and enforce sorted JSON keys by jolovicdev · Pull Request #11 · jolovicdev/cashet

jolovicdev · 2026-05-06T01:13:08Z

Summary

This PR applies a handful of small, safe performance improvements to the hashing and persistence layers. The goal is to reduce CPU overhead on the cache-key hot path while keeping (and in some cases improving) determinism.

Changes

1. _stable_repr_to: repr() -> str() for sorting keys/elements

repr() is strictly more expensive than str() for the built-in types that dominate real-world args (strings, ints, floats, bools, None). For all of those types the two functions return identical strings, so str() gives us the same total order with less work.

Affected collections:

dict keys
set elements
frozenset elements

2. _put_commit_row: canonical JSON via sort_keys=True

We already sort dicts before hashing, but the SQLite row still stored the raw insertion order. Adding sort_keys=True to every json.dumps call in _put_commit_row makes the on-disk representation canonical as well. This is a no-op for correctly-behaved callers, but it removes a source of non-determinism for anyone inspecting the DB directly or migrating data between Python versions.

3. New test coverage

Added test_mixed_key_dict_ordering_deterministic and test_mixed_element_set_ordering_deterministic to lock in the guarantee that equivalent containers hash to the same value regardless of insertion order.

Benchmarks

No dedicated benchmark yet, but the removed repr() calls avoid a non-trivial amount of string formatting for large nested structures (e.g. dicts with thousands of string keys).

…ON keys Micro-optimizations across the hashing and store layers to improve cache-key determinism and reduce unnecessary overhead: 1. _stable_repr_to now uses str() instead of repr() when sorting dict keys, set elements and frozenset elements. str() is cheaper than repr() for the common built-in types we encounter in args, and produces the same total order for all types where the two differ (strings, ints, floats, etc.). 2. SQLiteStore._put_commit_row now passes sort_keys=True to every json.dumps call. This guarantees that tags, dep_versions and input_refs are stored in a canonical order regardless of Python dict insertion history. 3. Added two new hashing tests that verify dicts and sets with equivalent contents hash identically even when inserted in different orders. Together these changes shave a small but measurable amount of CPU time off the hot path for large nested args while keeping the stable hash fully deterministic.

ds-review

This PR improves performance by replacing repr with str for sorting keys/elements in hashing and adding sort_keys=True to JSON serialization. However, the change introduces a critical determinism bug for mixed-type collections (e.g., sets containing both 1 and '1') where str() returns identical strings, causing non-deterministic hashing. It also changes the canonical representation for string values, silently invalidating all existing caches. The PR description incorrectly claims str() and repr() return identical strings for all relevant types. Test coverage only includes homogeneous collections, missing the mixed-type cases that could expose the bug. The JSON serialization change is safe and beneficial. The recommended changes are to revert to repr for sorting, or clearly document and guard against mixed-type inputs, and correct the PR description. Add tests for mixed-type collections to prevent regressions.

ds-review · 2026-05-06T01:21:12Z

@@ -347,7 +347,7 @@ def _stable_repr_to(
        _visited.add(obj_id)


P0 Switching from repr to str as sort key breaks determinism for mixed-type collections where different elements share the same str() representation (e.g., 1 and '1'). Python's sorted preserves original iteration order when keys compare equal, but for unordered containers like sets that iteration order is arbitrary and can vary between runs. Identical inputs may then produce different canonical representations and different hashes, causing cache misses or false hits. Either keep repr for a guaranteed total order, or—if mixed types are not supported—clearly document the restriction and add a check or test to detect the case.

ds-review · 2026-05-06T01:21:13Z

@@ -347,7 +347,7 @@ def _stable_repr_to(
        _visited.add(obj_id)


P1 Switching from repr to str for sorting changes the canonical representation of any collection that contains string keys or elements, because str('hello') is 'hello' while repr('hello') is "'hello'". This will produce a different hash for every cached entry that involves strings, effectively invalidating all existing caches on upgrade. The PR description incorrectly states that str and repr return identical strings for all affected built-in types. Either treat this as an intentional breaking hash change (and require a cache clear) or preserve repr for strings while using str for the remaining types.

ds-review · 2026-05-06T01:21:15Z

@@ -347,7 +347,7 @@ def _stable_repr_to(
        _visited.add(obj_id)


P3 The PR description asserts that str() and repr() return identical strings for the built-in types that dominate real-world args. That is not true for strings, which are among the most common keys and elements. Correct the description to avoid misleading future readers and maintainers.

jolovicdev · 2026-05-06T01:28:35Z

trash — testing reviewer, disregard

ds-review Bot reviewed May 6, 2026

View reviewed changes

jolovicdev closed this May 6, 2026

jolovicdev deleted the test/pr-reviewer-bug3 branch May 6, 2026 01:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace repr-based sorting with str-based and enforce sorted JSON keys#11

perf: replace repr-based sorting with str-based and enforce sorted JSON keys#11
jolovicdev wants to merge 1 commit into
masterfrom
test/pr-reviewer-bug3

jolovicdev commented May 6, 2026 •

edited

Loading

Uh oh!

ds-review Bot left a comment

Uh oh!

ds-review Bot May 6, 2026

Uh oh!

ds-review Bot May 6, 2026

Uh oh!

ds-review Bot May 6, 2026

Uh oh!

jolovicdev commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -347,7 +347,7 @@ def _stable_repr_to(
		_visited.add(obj_id)

Conversation

jolovicdev commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. _stable_repr_to: repr() -> str() for sorting keys/elements

2. _put_commit_row: canonical JSON via sort_keys=True

3. New test coverage

Benchmarks

Uh oh!

ds-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ds-review Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ds-review Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ds-review Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

jolovicdev commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jolovicdev commented May 6, 2026 •

edited

Loading