DRAFT: feat(file_editor): interval-coverage dedupe for repeated `view` by juanmichelini · Pull Request #3620 · OpenHands/software-agent-sdk

juanmichelini · 2026-06-10T03:24:26Z

TL;DR

When a file_editor view would re-show content that's already substantially covered by previous views (same file, no edits between), return a small redirect hint naming the seen ranges instead of re-emitting the full content. Coverage threshold: 70% of the requested line range already shown.

This is a redesign of the original v1 commit (3c3003b9) which used exact response-hash matching. Empirical data showed v1 caught only ~12% of redundant views in real Nemotron 550B trajectories. v2 catches ~35% on the same trace — 3× improvement with the same model and same access pattern.

Why the redesign was needed

v1 keyed dedupe by exact response hash. Real Nemotron access patterns on sklearn/impute/_iterative.py look like this (from run 27293667611, conversation 0e1b09d4):

cluster A: (110,130) (110,135) (110,185) (115,122) (115,132) (115,135)
cluster B: (270,300) (270,340) (274,295) (290,350) (294,340)
cluster C: (560,650) (565,640) (605,630)

17 views, 15 distinct view_ranges, all clustered around three small regions of the file. v1 caught 2/17 (the only exact repeats). The model genuinely doesn't need fresh content for any of the cluster B follow-ups after seeing (270, 350) — but v1 had no way to recognize this.

What v2 does differently

Per text-file path, store a sorted, merged list of disjoint line intervals already shown (_view_intervals: dict[str, list[(start, end)]]). On a new view:

Compute the requested interval (mirroring view()'s own normalization — handles view_range=None → (1, num_lines), [start, -1] → (start, num_lines), etc.).
Compute coverage = (lines in requested ∩ union(seen)) / len(requested).
If coverage ≥ 0.7 and seen is non-empty → return interval hint naming the seen ranges + the coverage percentage.
Otherwise → return real content, merge the new interval into seen.

Directories are handled separately with the v1 response-hash approach (_view_listing_hashes) — there's no per-line model for a listing, and the listing text naturally differs when the directory changes, so hash mismatch correctly skips dedupe on stale state.

write_file clears both structures for that path. Every edit command (create, str_replace, insert, undo_edit) goes through write_file, so any edit refreshes the state atomically.

Hint format

[file_editor view dedupe] You have already viewed `/path/to/_iterative.py`
covering lines [110-185, 270-350]. The requested range [115-135] is 100%
inside what you have already been shown. Scroll back to your earlier
`view` observation rather than re-loading. To see truly new content,
request a `view_range` outside [110-185, 270-350]. If the file has been
changed externally (which would not be reflected in your prior view),
run `cat /path/to/_iterative.py` via the terminal tool.

Stays under 600 bytes regardless of file size. On a 1000-line file (~20 KB view), that's ~3% of the original payload.

Real-trace replay

Feeding the EXACT chronological sequence of file_editor actions from the SDK-6 v1 run (27293667611, the scikit-25232 instance) through v2:

Mechanism	Hints on `_iterative.py`	Catch rate
v1 (response hash, shipped previously)	2 / 17	12%
v2 (this PR)	6 / 17	35%

The remaining 11 first-views are either (a) the genuine first view of a region, or (b) the first view of a file after an intervening edit invalidated the cache — both correct behavior. Approximate per-conversation savings: ~5,250 tokens (the hint replaces ~4 KB views with ~500 B hints, 6× per conversation).

This is a meaningful improvement but I want to be honest about its ceiling: the absolute cost impact is bounded by how often the model edits the same file (each edit forces a fresh view). On a single SWE-Bench task with ~30 file edits, dedupe gives a smaller win than I projected in the v1 PR. The larger structural issue — the model retries the same forbidden command shape dozens of times in a single conversation, paying for each retry — is a separate lever and not addressed here.

Threshold

The dedupe coverage threshold is 70%, hard-coded as _DEDUPE_COVERAGE_THRESHOLD. Reasoning:

False positive (we dedupe but model wanted new content): one extra turn — model re-requests with a tighter range from the hint. Cheap.
False negative (we let through a redundant view): a 5–50 KB payload re-paid on every uncached subsequent turn. Quadratic in conversation length.

False negatives are much worse than false positives, so we err toward deduping. 70% is the lowest threshold that still leaves clear "you'll see >30% new content" semantics in the hint text.

Tests

Two test files, 40 tests total, all green.

tests/tools/file_editor/test_view_dedupe_intervals.py — 27 new tests:

Unit-level: _merge_intervals (7 cases — empty, single, disjoint, overlapping, abutting, nested, real-trace cluster) and _coverage_fraction (7 parametrized cases covering subset, superset, partial, disjoint, multi-interval).
Overlap-aware behavior: subset-of-seen → dedupe; growing-range within seen → dedupe; cluster-B pattern reproduced verbatim with expected [real, real, hint, hint, hint] sequence; below-threshold partial → no dedupe; above-threshold partial → dedupe; seen-grows-monotonically with multi-interval coverage; whole-file-then-partial; partial-then-whole; view_range=[1, -1] normalization; multi-file isolation.
Hint quality: hint lists seen ranges in compact form; size ratio < 10% on 1000-line file.
Edit invalidation: str_replace clears the entire interval set for the path.

tests/tools/file_editor/test_view_dedupe.py — the 11 v1 tests, all unchanged, all still passing. The v1 cases (exact repeats, ABA pattern, write invalidation across all edit commands, errors never deduped, directories, per-path isolation, hint smaller than view) are all special cases of the v2 design.

All 192 file_editor tests pass. Lint + pyright clean.

Risk

No schema change. FileEditorObservation shape is identical.
In-process state only — FileEditor instances are scoped to a single conversation.
Memory bound — merged intervals stay small even after many views (the Nemotron pattern collapses to ≤3 intervals per file). Cleared on every write.
No silent action loss — every dedupe is logged at INFO level (file_editor view dedupe: <path> requested=<range> coverage=<pct>%) so eval pipelines can grep for hit-rate metrics.

Verify locally

uv run pytest tests/tools/file_editor/test_view_dedupe.py tests/tools/file_editor/test_view_dedupe_intervals.py -v
# 40 passed
uv run pytest tests/tools/file_editor/ -q
# 192 passed

Spot-check the hint:

uv run python -c "
from openhands.tools.file_editor.editor import FileEditor
import tempfile, pathlib
with tempfile.TemporaryDirectory() as d:
    p = pathlib.Path(d) / 'demo.py'
    p.write_text('\n'.join(f'line_{i}' for i in range(1, 301)) + '\n')
    e = FileEditor()
    e(command='view', path=str(p), view_range=[110, 185])
    e(command='view', path=str(p), view_range=[270, 350])
    print(e(command='view', path=str(p), view_range=[115, 135]).text)
"

What this PR does NOT claim

It does not claim a specific $/instance reduction. The previous PR description's "+$11" projection turned out to be optimistic; the cost win is bounded by the edit-then-view cadence and is roughly proportional to the conversation length on non-caching providers.
It does not address the larger Nemotron failure mode where the model retries the same forbidden command shape dozens of times.
It does not affect models with prompt caching (where redundant views are nearly free).

The justified scope is: 3× more dedupes on the same access pattern, with a hint format the model can actually navigate. That's the measurable win.

This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. v1 design (first commit on this branch) caught 2/17 redundant views in real trajectories; v2 (this PR) catches 6/17 on the same trace via interval-coverage tracking.

When the same `view` would return identical bytes (same path, same `view_range`, no edits between), return a small redirect hint instead of re-emitting the full file content. Invalidate the per-path cache on every successful `write_file` (covers `create`, `str_replace`, `insert`, `undo_edit` — all funnel through `write_file`). ## Why Trajectory analysis of run `27224350936` (Nemotron 550B SWE-Bench Verified) showed each "expensive" instance re-viewing the same file many times with no edits in between: - `sklearn/impute/_iterative.py` viewed **34 times** - `django/db/backends/base/schema.py` viewed **19 times** - `django/contrib/sessions/backends/base.py` viewed 6 times A typical Django/sklearn source file is 1–2 KB once `cat -n`-formatted, so 19 redundant views ≈ 20–40 KB re-emitted into the conversation. On a non-cached provider that whole pile is re-paid on every later turn — quadratic context bloat. ## Design Per `FileEditor` instance, store `{resolved_path: set[response_hash]}`. On a `view`: 1. Compute SHA256 of the response text we'd return. 2. If that hash is already in the path's set: return a short hint that names the file and teaches the recovery paths (scroll back, use a different `view_range`, or `cat` via terminal if you really need the bytes again). 3. Otherwise: add the hash to the set and return the real response. Key choices: - **Hash the response text, not the file content.** Different `view_range`s produce different bytes → different hashes → correctly not deduped. - **`set[str]` per path, not a single slot.** Catches the ABA pattern (view A → view B → view A) which is common when the agent cross-references sections. - **`write_file` clears the whole set for that path.** Atomic with the file mutation — any subsequent view sees fresh content even if the model edits a region we didn't previously hash. - **Skip images** (any `ImageContent`) and **never dedupe errors** — the agent must see retry feedback in full. - **Directories are deduped** too: `ls`-style listings can be multi-KB on big trees and the same model loops apply. ## Tests `tests/tools/file_editor/test_view_dedupe.py` — 11 new tests: - Repeat view of unchanged file → hint, full content not re-emitted. - Hint is < 5% of the original payload it replaces (verified on a 1000-line file; ratio grows with file size). - All four edit commands (`str_replace`, `create`, `insert`, `undo_edit`) invalidate the cache. - Different `view_range` is NOT deduped. - ABA pattern IS deduped. - Distinct paths track independently. - Repeat directory listing is deduped. - Error observations are never deduped. All 163 existing file_editor tests pass unchanged. Lint + pyright clean. ## Stacks with - SDK-5 (#3582) — literal-arg guard for `terminal.command`. - SDK-7 (#3619) — proactive literal-arg warning in tool description. Together these address the two largest waste buckets identified in run `27224350936`. Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-06-10T03:24:53Z

Python API breakage checks — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-10T03:25:06Z

REST API breakage checks (OpenAPI) — ✅ PASSED

Result: ✅ PASSED

Action log

github-actions · 2026-06-10T03:31:38Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-tools/openhands/tools/file_editor
editor.py	367	273	25%	129, 132, 168–175, 177–178, 188–198, 203–205, 217–234, 245–246, 250–251, 253, 258–259, 262–268, 276, 287–288, 297–305, 310, 316–317, 333, 338, 348–354, 360–368, 370, 388–389, 409–410, 414, 418–419, 428, 432–435, 443–444, 448–450, 456, 459, 464, 467, 470–471, 474, 477–478, 482, 486, 501–503, 510–514, 522, 526–527, 531, 536, 544–545, 547–550, 552–555, 558–559, 568–569, 572–576, 581–584, 586, 593–594, 600–602, 610–614, 618, 620–621, 628, 631, 636–637, 639, 647–649, 651–654, 657, 660–667, 672–674, 693–694, 722–723, 725–726, 732, 735, 739–745, 748–749, 752–757, 760, 763–764, 768, 771–772, 775, 777–778, 784, 788, 807, 810–813, 815, 823, 830, 837, 848–851, 853, 855, 882–885, 894–896, 925–927, 929–938, 943–946, 959–960, 965, 970, 976, 982
TOTAL	30700	15436	49%

## What changed Replace the v1 per-path response-hash set with two complementary structures: * `_view_intervals: dict[str, list[(start, end)]]` — sorted, merged disjoint line intervals already shown to the model per text file. * `_view_listing_hashes: dict[str, set[str]]` — response hashes for directory listings, where the per-line interval model does not apply. A text-file view now dedupes when the requested line range is ≥70% covered by the union of previously-shown intervals for that file. Directories keep the exact-hash behavior unchanged. Both structures are dropped for a path on every successful `write_file` to that path, so any edit refreshes the state atomically (covers `create`, `str_replace`, `insert`, `undo_edit`). ## Why Empirical data from run `27293667611` (Nemotron 550B SWE-Bench Verified on the v1 branch) showed v1 caught only **2 of 17** redundant views of `sklearn/impute/_iterative.py` per task — about 12%. v1's hash key required EXACT response equality, but the model's real access pattern is overlapping-but-not-identical ranges clustered around a few regions of the file: cluster A: (110,130) (110,135) (110,185) (115,122) (115,132) (115,135) cluster B: (270,300) (270,340) (274,295) (290,350) (294,340) cluster C: (560,650) (565,640) (605,630) Under v2 (this PR), feeding the exact same trace through gives **6 of 17** dedupe hits — a 3× improvement on the same access pattern, without changing the model side at all. The remaining 11 first-views are either (a) the genuine first view of a region or (b) the first view of a file after a write invalidated the cache — both correct. The first follow-up in cluster B is intentionally NOT deduped because it genuinely extends the seen region (lines 301-350 are new content). Only the third, fourth, and fifth — each fully inside the union of earlier views — get the hint. This is the behavior pinned by `test_cluster_pattern_from_real_trace_is_mostly_deduped`. ## Threshold (0.7) Tuned to err on the side of deduping. False positives (model re-requests with a tighter range from the hint) cost one extra turn; false negatives (full 5–50 KB file dump) cost a payload that's re-paid on every uncached subsequent turn → quadratic context bloat. 70% is the lowest threshold that still leaves clear "you'll see >30% new content" semantics in the hint. ## Hint changes Two hints now exist: * `_VIEW_DEDUPE_HINT_INTERVAL` — names the seen ranges and the coverage percentage so the model can pick an actually-new range. Example: `You have already viewed `/path/foo.py` covering lines [110-185, 270-350]. The requested range [115-135] is 100% inside what you have already been shown. Scroll back...` * `_VIEW_DEDUPE_HINT_LISTING` — directory variant, simpler. Both stay an order of magnitude smaller than the view they replace (pinned by `test_hint_is_an_order_of_magnitude_smaller`). ## Tests `tests/tools/file_editor/test_view_dedupe_intervals.py` — 27 new tests across three layers: * Unit-level: `_merge_intervals` (7 cases) and `_coverage_fraction` (7 parametrized cases) — the interval algebra dedupe relies on. * Behavior: subset/superset/overlap/disjoint cases; the growing-range pattern; the real Nemotron cluster B pattern reproduced verbatim; end-minus-one normalization; whole-file then partial; partial then whole-file; multi-file isolation. * Hint quality: hint lists seen ranges; size ratio < 10% on 1000-line file. `tests/tools/file_editor/test_view_dedupe.py` (v1 tests, kept) — every test still passes unchanged. The v1 cases (exact repeats, write invalidation, ABA pattern, error pass-through, directory dedupe, per- path isolation) are all special cases of the v2 design. All 192 file_editor tests pass. Lint + pyright clean. ## Honest scope This redesign is informed by the v1-on-Nemotron data from run `27293667611` (linked in the PR description). 3× catch-rate improvement on the dominant waste pattern is real; the absolute cost impact on a single task is bounded by how often the model edits the same file (each edit invalidates and forces a fresh view). Best-case savings on a non-caching provider are still meaningful — ~5K tokens × N future turns per dedupe — but the v2 design also doesn't claim to fix the larger structural issue (model retries the same forbidden command shape dozens of times). That's a separate lever (SDK-8 hard cooldown). Co-authored-by: openhands <openhands@all-hands.dev>

juanmichelini added the enhancement New feature or request label Jun 10, 2026 — with OpenHands AI

juanmichelini and others added 2 commits June 10, 2026 14:13

Merge branch 'main' into openhands/sdk-6-file-editor-view-dedupe

3336683

juanmichelini changed the title ~~DRAFT: feat(file_editor): dedupe repeated view of unchanged files~~ DRAFT: feat(file_editor): interval-coverage dedupe for repeated view Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRAFT: feat(file_editor): interval-coverage dedupe for repeated `view`#3620

DRAFT: feat(file_editor): interval-coverage dedupe for repeated `view`#3620
juanmichelini wants to merge 3 commits into
mainfrom
openhands/sdk-6-file-editor-view-dedupe

juanmichelini commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

juanmichelini commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Why the redesign was needed

What v2 does differently

Hint format

Real-trace replay

Threshold

Tests

Risk

Verify locally

What this PR does NOT claim

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python API breakage checks — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REST API breakage checks (OpenAPI) — ✅ PASSED

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

juanmichelini commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading