DRAFT: feat(file_editor): interval-coverage dedupe for repeated view#3620
Draft
juanmichelini wants to merge 3 commits into
Draft
DRAFT: feat(file_editor): interval-coverage dedupe for repeated view#3620juanmichelini wants to merge 3 commits into
view#3620juanmichelini wants to merge 3 commits into
Conversation
When the same `view` would return identical bytes (same path, same
`view_range`, no edits between), return a small redirect hint instead
of re-emitting the full file content. Invalidate the per-path cache on
every successful `write_file` (covers `create`, `str_replace`,
`insert`, `undo_edit` — all funnel through `write_file`).
## Why
Trajectory analysis of run `27224350936` (Nemotron 550B SWE-Bench
Verified) showed each "expensive" instance re-viewing the same file
many times with no edits in between:
- `sklearn/impute/_iterative.py` viewed **34 times**
- `django/db/backends/base/schema.py` viewed **19 times**
- `django/contrib/sessions/backends/base.py` viewed 6 times
A typical Django/sklearn source file is 1–2 KB once `cat -n`-formatted,
so 19 redundant views ≈ 20–40 KB re-emitted into the conversation.
On a non-cached provider that whole pile is re-paid on every later
turn — quadratic context bloat.
## Design
Per `FileEditor` instance, store `{resolved_path: set[response_hash]}`.
On a `view`:
1. Compute SHA256 of the response text we'd return.
2. If that hash is already in the path's set: return a short hint that
names the file and teaches the recovery paths (scroll back, use a
different `view_range`, or `cat` via terminal if you really need
the bytes again).
3. Otherwise: add the hash to the set and return the real response.
Key choices:
- **Hash the response text, not the file content.** Different
`view_range`s produce different bytes → different hashes →
correctly not deduped.
- **`set[str]` per path, not a single slot.** Catches the ABA
pattern (view A → view B → view A) which is common when the agent
cross-references sections.
- **`write_file` clears the whole set for that path.** Atomic with
the file mutation — any subsequent view sees fresh content even if
the model edits a region we didn't previously hash.
- **Skip images** (any `ImageContent`) and **never dedupe errors**
— the agent must see retry feedback in full.
- **Directories are deduped** too: `ls`-style listings can be
multi-KB on big trees and the same model loops apply.
## Tests
`tests/tools/file_editor/test_view_dedupe.py` — 11 new tests:
- Repeat view of unchanged file → hint, full content not re-emitted.
- Hint is < 5% of the original payload it replaces (verified on a
1000-line file; ratio grows with file size).
- All four edit commands (`str_replace`, `create`, `insert`,
`undo_edit`) invalidate the cache.
- Different `view_range` is NOT deduped.
- ABA pattern IS deduped.
- Distinct paths track independently.
- Repeat directory listing is deduped.
- Error observations are never deduped.
All 163 existing file_editor tests pass unchanged. Lint + pyright clean.
## Stacks with
- SDK-5 (#3582) — literal-arg guard for `terminal.command`.
- SDK-7 (#3619) — proactive literal-arg warning in tool description.
Together these address the two largest waste buckets identified in
run `27224350936`.
Co-authored-by: openhands <openhands@all-hands.dev>
Contributor
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
Contributor
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
Contributor
Coverage Report •
|
||||||||||||||||||||
## What changed
Replace the v1 per-path response-hash set with two complementary
structures:
* `_view_intervals: dict[str, list[(start, end)]]` — sorted, merged
disjoint line intervals already shown to the model per text file.
* `_view_listing_hashes: dict[str, set[str]]` — response hashes for
directory listings, where the per-line interval model does not apply.
A text-file view now dedupes when the requested line range is ≥70% covered
by the union of previously-shown intervals for that file. Directories keep
the exact-hash behavior unchanged. Both structures are dropped for a path
on every successful `write_file` to that path, so any edit refreshes the
state atomically (covers `create`, `str_replace`, `insert`,
`undo_edit`).
## Why
Empirical data from run `27293667611` (Nemotron 550B SWE-Bench Verified
on the v1 branch) showed v1 caught only **2 of 17** redundant views of
`sklearn/impute/_iterative.py` per task — about 12%. v1's hash key
required EXACT response equality, but the model's real access pattern is
overlapping-but-not-identical ranges clustered around a few regions of
the file:
cluster A: (110,130) (110,135) (110,185) (115,122) (115,132) (115,135)
cluster B: (270,300) (270,340) (274,295) (290,350) (294,340)
cluster C: (560,650) (565,640) (605,630)
Under v2 (this PR), feeding the exact same trace through gives
**6 of 17** dedupe hits — a 3× improvement on the same access pattern,
without changing the model side at all. The remaining 11 first-views are
either (a) the genuine first view of a region or (b) the first view of a
file after a write invalidated the cache — both correct.
The first follow-up in cluster B is intentionally NOT deduped because it
genuinely extends the seen region (lines 301-350 are new content). Only
the third, fourth, and fifth — each fully inside the union of earlier
views — get the hint. This is the behavior pinned by
`test_cluster_pattern_from_real_trace_is_mostly_deduped`.
## Threshold (0.7)
Tuned to err on the side of deduping. False positives (model re-requests
with a tighter range from the hint) cost one extra turn; false negatives
(full 5–50 KB file dump) cost a payload that's re-paid on every uncached
subsequent turn → quadratic context bloat. 70% is the lowest threshold
that still leaves clear "you'll see >30% new content" semantics in the
hint.
## Hint changes
Two hints now exist:
* `_VIEW_DEDUPE_HINT_INTERVAL` — names the seen ranges and the
coverage percentage so the model can pick an actually-new range.
Example: `You have already viewed `/path/foo.py` covering lines
[110-185, 270-350]. The requested range [115-135] is 100% inside
what you have already been shown. Scroll back...`
* `_VIEW_DEDUPE_HINT_LISTING` — directory variant, simpler.
Both stay an order of magnitude smaller than the view they replace
(pinned by `test_hint_is_an_order_of_magnitude_smaller`).
## Tests
`tests/tools/file_editor/test_view_dedupe_intervals.py` — 27 new tests
across three layers:
* Unit-level: `_merge_intervals` (7 cases) and `_coverage_fraction`
(7 parametrized cases) — the interval algebra dedupe relies on.
* Behavior: subset/superset/overlap/disjoint cases; the growing-range
pattern; the real Nemotron cluster B pattern reproduced verbatim;
end-minus-one normalization; whole-file then partial; partial then
whole-file; multi-file isolation.
* Hint quality: hint lists seen ranges; size ratio < 10% on 1000-line
file.
`tests/tools/file_editor/test_view_dedupe.py` (v1 tests, kept) — every
test still passes unchanged. The v1 cases (exact repeats, write
invalidation, ABA pattern, error pass-through, directory dedupe, per-
path isolation) are all special cases of the v2 design.
All 192 file_editor tests pass. Lint + pyright clean.
## Honest scope
This redesign is informed by the v1-on-Nemotron data from run
`27293667611` (linked in the PR description). 3× catch-rate improvement
on the dominant waste pattern is real; the absolute cost impact on a
single task is bounded by how often the model edits the same file (each
edit invalidates and forces a fresh view). Best-case savings on a
non-caching provider are still meaningful — ~5K tokens × N future turns
per dedupe — but the v2 design also doesn't claim to fix the larger
structural issue (model retries the same forbidden command shape dozens
of times). That's a separate lever (SDK-8 hard cooldown).
Co-authored-by: openhands <openhands@all-hands.dev>
view of unchanged filesview
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
When a
file_editor viewwould re-show content that's already substantially covered by previous views (same file, no edits between), return a small redirect hint naming the seen ranges instead of re-emitting the full content. Coverage threshold: 70% of the requested line range already shown.This is a redesign of the original v1 commit (
3c3003b9) which used exact response-hash matching. Empirical data showed v1 caught only ~12% of redundant views in real Nemotron 550B trajectories. v2 catches ~35% on the same trace — 3× improvement with the same model and same access pattern.Why the redesign was needed
v1 keyed dedupe by exact response hash. Real Nemotron access patterns on
sklearn/impute/_iterative.pylook like this (from run27293667611, conversation0e1b09d4):17 views, 15 distinct
view_ranges, all clustered around three small regions of the file. v1 caught 2/17 (the only exact repeats). The model genuinely doesn't need fresh content for any of the cluster B follow-ups after seeing(270, 350)— but v1 had no way to recognize this.What v2 does differently
Per text-file path, store a sorted, merged list of disjoint line intervals already shown (
_view_intervals: dict[str, list[(start, end)]]). On a new view:view()'s own normalization — handlesview_range=None→(1, num_lines),[start, -1]→(start, num_lines), etc.).seenis non-empty → return interval hint naming the seen ranges + the coverage percentage.seen.Directories are handled separately with the v1 response-hash approach (
_view_listing_hashes) — there's no per-line model for a listing, and the listing text naturally differs when the directory changes, so hash mismatch correctly skips dedupe on stale state.write_fileclears both structures for that path. Every edit command (create,str_replace,insert,undo_edit) goes throughwrite_file, so any edit refreshes the state atomically.Hint format
Stays under 600 bytes regardless of file size. On a 1000-line file (~20 KB view), that's ~3% of the original payload.
Real-trace replay
Feeding the EXACT chronological sequence of file_editor actions from the SDK-6 v1 run (
27293667611, the scikit-25232 instance) through v2:_iterative.pyThe remaining 11 first-views are either (a) the genuine first view of a region, or (b) the first view of a file after an intervening edit invalidated the cache — both correct behavior. Approximate per-conversation savings: ~5,250 tokens (the hint replaces ~4 KB views with ~500 B hints, 6× per conversation).
This is a meaningful improvement but I want to be honest about its ceiling: the absolute cost impact is bounded by how often the model edits the same file (each edit forces a fresh view). On a single SWE-Bench task with ~30 file edits, dedupe gives a smaller win than I projected in the v1 PR. The larger structural issue — the model retries the same forbidden command shape dozens of times in a single conversation, paying for each retry — is a separate lever and not addressed here.
Threshold
The dedupe coverage threshold is 70%, hard-coded as
_DEDUPE_COVERAGE_THRESHOLD. Reasoning:False negatives are much worse than false positives, so we err toward deduping. 70% is the lowest threshold that still leaves clear "you'll see >30% new content" semantics in the hint text.
Tests
Two test files, 40 tests total, all green.
tests/tools/file_editor/test_view_dedupe_intervals.py— 27 new tests:_merge_intervals(7 cases — empty, single, disjoint, overlapping, abutting, nested, real-trace cluster) and_coverage_fraction(7 parametrized cases covering subset, superset, partial, disjoint, multi-interval).[real, real, hint, hint, hint]sequence; below-threshold partial → no dedupe; above-threshold partial → dedupe; seen-grows-monotonically with multi-interval coverage; whole-file-then-partial; partial-then-whole;view_range=[1, -1]normalization; multi-file isolation.str_replaceclears the entire interval set for the path.tests/tools/file_editor/test_view_dedupe.py— the 11 v1 tests, all unchanged, all still passing. The v1 cases (exact repeats, ABA pattern, write invalidation across all edit commands, errors never deduped, directories, per-path isolation, hint smaller than view) are all special cases of the v2 design.All 192 file_editor tests pass. Lint + pyright clean.
Risk
FileEditorObservationshape is identical.FileEditorinstances are scoped to a single conversation.file_editor view dedupe: <path> requested=<range> coverage=<pct>%) so eval pipelines can grep for hit-rate metrics.Verify locally
Spot-check the hint:
What this PR does NOT claim
The justified scope is: 3× more dedupes on the same access pattern, with a hint format the model can actually navigate. That's the measurable win.
This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. v1 design (first commit on this branch) caught 2/17 redundant views in real trajectories; v2 (this PR) catches 6/17 on the same trace via interval-coverage tracking.