Skip to content

DRAFT: feat(file_editor): interval-coverage dedupe for repeated view#3620

Draft
juanmichelini wants to merge 3 commits into
mainfrom
openhands/sdk-6-file-editor-view-dedupe
Draft

DRAFT: feat(file_editor): interval-coverage dedupe for repeated view#3620
juanmichelini wants to merge 3 commits into
mainfrom
openhands/sdk-6-file-editor-view-dedupe

Conversation

@juanmichelini

@juanmichelini juanmichelini commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

TL;DR

When a file_editor view would re-show content that's already substantially covered by previous views (same file, no edits between), return a small redirect hint naming the seen ranges instead of re-emitting the full content. Coverage threshold: 70% of the requested line range already shown.

This is a redesign of the original v1 commit (3c3003b9) which used exact response-hash matching. Empirical data showed v1 caught only ~12% of redundant views in real Nemotron 550B trajectories. v2 catches ~35% on the same trace — 3× improvement with the same model and same access pattern.

Why the redesign was needed

v1 keyed dedupe by exact response hash. Real Nemotron access patterns on sklearn/impute/_iterative.py look like this (from run 27293667611, conversation 0e1b09d4):

cluster A: (110,130) (110,135) (110,185) (115,122) (115,132) (115,135)
cluster B: (270,300) (270,340) (274,295) (290,350) (294,340)
cluster C: (560,650) (565,640) (605,630)

17 views, 15 distinct view_ranges, all clustered around three small regions of the file. v1 caught 2/17 (the only exact repeats). The model genuinely doesn't need fresh content for any of the cluster B follow-ups after seeing (270, 350) — but v1 had no way to recognize this.

What v2 does differently

Per text-file path, store a sorted, merged list of disjoint line intervals already shown (_view_intervals: dict[str, list[(start, end)]]). On a new view:

  1. Compute the requested interval (mirroring view()'s own normalization — handles view_range=None(1, num_lines), [start, -1](start, num_lines), etc.).
  2. Compute coverage = (lines in requested ∩ union(seen)) / len(requested).
  3. If coverage ≥ 0.7 and seen is non-empty → return interval hint naming the seen ranges + the coverage percentage.
  4. Otherwise → return real content, merge the new interval into seen.

Directories are handled separately with the v1 response-hash approach (_view_listing_hashes) — there's no per-line model for a listing, and the listing text naturally differs when the directory changes, so hash mismatch correctly skips dedupe on stale state.

write_file clears both structures for that path. Every edit command (create, str_replace, insert, undo_edit) goes through write_file, so any edit refreshes the state atomically.

Hint format

[file_editor view dedupe] You have already viewed `/path/to/_iterative.py`
covering lines [110-185, 270-350]. The requested range [115-135] is 100%
inside what you have already been shown. Scroll back to your earlier
`view` observation rather than re-loading. To see truly new content,
request a `view_range` outside [110-185, 270-350]. If the file has been
changed externally (which would not be reflected in your prior view),
run `cat /path/to/_iterative.py` via the terminal tool.

Stays under 600 bytes regardless of file size. On a 1000-line file (~20 KB view), that's ~3% of the original payload.

Real-trace replay

Feeding the EXACT chronological sequence of file_editor actions from the SDK-6 v1 run (27293667611, the scikit-25232 instance) through v2:

Mechanism Hints on _iterative.py Catch rate
v1 (response hash, shipped previously) 2 / 17 12%
v2 (this PR) 6 / 17 35%

The remaining 11 first-views are either (a) the genuine first view of a region, or (b) the first view of a file after an intervening edit invalidated the cache — both correct behavior. Approximate per-conversation savings: ~5,250 tokens (the hint replaces ~4 KB views with ~500 B hints, 6× per conversation).

This is a meaningful improvement but I want to be honest about its ceiling: the absolute cost impact is bounded by how often the model edits the same file (each edit forces a fresh view). On a single SWE-Bench task with ~30 file edits, dedupe gives a smaller win than I projected in the v1 PR. The larger structural issue — the model retries the same forbidden command shape dozens of times in a single conversation, paying for each retry — is a separate lever and not addressed here.

Threshold

The dedupe coverage threshold is 70%, hard-coded as _DEDUPE_COVERAGE_THRESHOLD. Reasoning:

  • False positive (we dedupe but model wanted new content): one extra turn — model re-requests with a tighter range from the hint. Cheap.
  • False negative (we let through a redundant view): a 5–50 KB payload re-paid on every uncached subsequent turn. Quadratic in conversation length.

False negatives are much worse than false positives, so we err toward deduping. 70% is the lowest threshold that still leaves clear "you'll see >30% new content" semantics in the hint text.

Tests

Two test files, 40 tests total, all green.

tests/tools/file_editor/test_view_dedupe_intervals.py — 27 new tests:

  • Unit-level: _merge_intervals (7 cases — empty, single, disjoint, overlapping, abutting, nested, real-trace cluster) and _coverage_fraction (7 parametrized cases covering subset, superset, partial, disjoint, multi-interval).
  • Overlap-aware behavior: subset-of-seen → dedupe; growing-range within seen → dedupe; cluster-B pattern reproduced verbatim with expected [real, real, hint, hint, hint] sequence; below-threshold partial → no dedupe; above-threshold partial → dedupe; seen-grows-monotonically with multi-interval coverage; whole-file-then-partial; partial-then-whole; view_range=[1, -1] normalization; multi-file isolation.
  • Hint quality: hint lists seen ranges in compact form; size ratio < 10% on 1000-line file.
  • Edit invalidation: str_replace clears the entire interval set for the path.

tests/tools/file_editor/test_view_dedupe.py — the 11 v1 tests, all unchanged, all still passing. The v1 cases (exact repeats, ABA pattern, write invalidation across all edit commands, errors never deduped, directories, per-path isolation, hint smaller than view) are all special cases of the v2 design.

All 192 file_editor tests pass. Lint + pyright clean.

Risk

  • No schema change. FileEditorObservation shape is identical.
  • In-process state onlyFileEditor instances are scoped to a single conversation.
  • Memory bound — merged intervals stay small even after many views (the Nemotron pattern collapses to ≤3 intervals per file). Cleared on every write.
  • No silent action loss — every dedupe is logged at INFO level (file_editor view dedupe: <path> requested=<range> coverage=<pct>%) so eval pipelines can grep for hit-rate metrics.

Verify locally

uv run pytest tests/tools/file_editor/test_view_dedupe.py tests/tools/file_editor/test_view_dedupe_intervals.py -v
# 40 passed
uv run pytest tests/tools/file_editor/ -q
# 192 passed

Spot-check the hint:

uv run python -c "
from openhands.tools.file_editor.editor import FileEditor
import tempfile, pathlib
with tempfile.TemporaryDirectory() as d:
    p = pathlib.Path(d) / 'demo.py'
    p.write_text('\n'.join(f'line_{i}' for i in range(1, 301)) + '\n')
    e = FileEditor()
    e(command='view', path=str(p), view_range=[110, 185])
    e(command='view', path=str(p), view_range=[270, 350])
    print(e(command='view', path=str(p), view_range=[115, 135]).text)
"

What this PR does NOT claim

  • It does not claim a specific $/instance reduction. The previous PR description's "+$11" projection turned out to be optimistic; the cost win is bounded by the edit-then-view cadence and is roughly proportional to the conversation length on non-caching providers.
  • It does not address the larger Nemotron failure mode where the model retries the same forbidden command shape dozens of times.
  • It does not affect models with prompt caching (where redundant views are nearly free).

The justified scope is: 3× more dedupes on the same access pattern, with a hint format the model can actually navigate. That's the measurable win.


This PR was created by an AI agent (OpenHands) on behalf of the user investigating per-instance cost on the Nemotron 550B SWE-Bench Verified eval. v1 design (first commit on this branch) caught 2/17 redundant views in real trajectories; v2 (this PR) catches 6/17 on the same trace via interval-coverage tracking.

When the same `view` would return identical bytes (same path, same
`view_range`, no edits between), return a small redirect hint instead
of re-emitting the full file content. Invalidate the per-path cache on
every successful `write_file` (covers `create`, `str_replace`,
`insert`, `undo_edit` — all funnel through `write_file`).

## Why

Trajectory analysis of run `27224350936` (Nemotron 550B SWE-Bench
Verified) showed each "expensive" instance re-viewing the same file
many times with no edits in between:

  - `sklearn/impute/_iterative.py` viewed **34 times**
  - `django/db/backends/base/schema.py` viewed **19 times**
  - `django/contrib/sessions/backends/base.py` viewed 6 times

A typical Django/sklearn source file is 1–2 KB once `cat -n`-formatted,
so 19 redundant views ≈ 20–40 KB re-emitted into the conversation.
On a non-cached provider that whole pile is re-paid on every later
turn — quadratic context bloat.

## Design

Per `FileEditor` instance, store `{resolved_path: set[response_hash]}`.
On a `view`:

  1. Compute SHA256 of the response text we'd return.
  2. If that hash is already in the path's set: return a short hint that
     names the file and teaches the recovery paths (scroll back, use a
     different `view_range`, or `cat` via terminal if you really need
     the bytes again).
  3. Otherwise: add the hash to the set and return the real response.

Key choices:

  - **Hash the response text, not the file content.** Different
    `view_range`s produce different bytes → different hashes →
    correctly not deduped.
  - **`set[str]` per path, not a single slot.** Catches the ABA
    pattern (view A → view B → view A) which is common when the agent
    cross-references sections.
  - **`write_file` clears the whole set for that path.** Atomic with
    the file mutation — any subsequent view sees fresh content even if
    the model edits a region we didn't previously hash.
  - **Skip images** (any `ImageContent`) and **never dedupe errors**
    — the agent must see retry feedback in full.
  - **Directories are deduped** too: `ls`-style listings can be
    multi-KB on big trees and the same model loops apply.

## Tests

`tests/tools/file_editor/test_view_dedupe.py` — 11 new tests:

  - Repeat view of unchanged file → hint, full content not re-emitted.
  - Hint is < 5% of the original payload it replaces (verified on a
    1000-line file; ratio grows with file size).
  - All four edit commands (`str_replace`, `create`, `insert`,
    `undo_edit`) invalidate the cache.
  - Different `view_range` is NOT deduped.
  - ABA pattern IS deduped.
  - Distinct paths track independently.
  - Repeat directory listing is deduped.
  - Error observations are never deduped.

All 163 existing file_editor tests pass unchanged. Lint + pyright clean.

## Stacks with

  - SDK-5 (#3582) — literal-arg guard for `terminal.command`.
  - SDK-7 (#3619) — proactive literal-arg warning in tool description.
    Together these address the two largest waste buckets identified in
    run `27224350936`.

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini added the enhancement New feature or request label Jun 10, 2026 — with OpenHands AI
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Python API breakage checks — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

REST API breakage checks (OpenAPI) — ✅ PASSED

Result:PASSED

Action log

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/file_editor
   editor.py36727325%129, 132, 168–175, 177–178, 188–198, 203–205, 217–234, 245–246, 250–251, 253, 258–259, 262–268, 276, 287–288, 297–305, 310, 316–317, 333, 338, 348–354, 360–368, 370, 388–389, 409–410, 414, 418–419, 428, 432–435, 443–444, 448–450, 456, 459, 464, 467, 470–471, 474, 477–478, 482, 486, 501–503, 510–514, 522, 526–527, 531, 536, 544–545, 547–550, 552–555, 558–559, 568–569, 572–576, 581–584, 586, 593–594, 600–602, 610–614, 618, 620–621, 628, 631, 636–637, 639, 647–649, 651–654, 657, 660–667, 672–674, 693–694, 722–723, 725–726, 732, 735, 739–745, 748–749, 752–757, 760, 763–764, 768, 771–772, 775, 777–778, 784, 788, 807, 810–813, 815, 823, 830, 837, 848–851, 853, 855, 882–885, 894–896, 925–927, 929–938, 943–946, 959–960, 965, 970, 976, 982
TOTAL307001543649% 

juanmichelini and others added 2 commits June 10, 2026 14:13
## What changed

Replace the v1 per-path response-hash set with two complementary
structures:

  * `_view_intervals: dict[str, list[(start, end)]]` — sorted, merged
    disjoint line intervals already shown to the model per text file.
  * `_view_listing_hashes: dict[str, set[str]]` — response hashes for
    directory listings, where the per-line interval model does not apply.

A text-file view now dedupes when the requested line range is ≥70% covered
by the union of previously-shown intervals for that file. Directories keep
the exact-hash behavior unchanged. Both structures are dropped for a path
on every successful `write_file` to that path, so any edit refreshes the
state atomically (covers `create`, `str_replace`, `insert`,
`undo_edit`).

## Why

Empirical data from run `27293667611` (Nemotron 550B SWE-Bench Verified
on the v1 branch) showed v1 caught only **2 of 17** redundant views of
`sklearn/impute/_iterative.py` per task — about 12%. v1's hash key
required EXACT response equality, but the model's real access pattern is
overlapping-but-not-identical ranges clustered around a few regions of
the file:

    cluster A: (110,130) (110,135) (110,185) (115,122) (115,132) (115,135)
    cluster B: (270,300) (270,340) (274,295) (290,350) (294,340)
    cluster C: (560,650) (565,640) (605,630)

Under v2 (this PR), feeding the exact same trace through gives
**6 of 17** dedupe hits — a 3× improvement on the same access pattern,
without changing the model side at all. The remaining 11 first-views are
either (a) the genuine first view of a region or (b) the first view of a
file after a write invalidated the cache — both correct.

The first follow-up in cluster B is intentionally NOT deduped because it
genuinely extends the seen region (lines 301-350 are new content). Only
the third, fourth, and fifth — each fully inside the union of earlier
views — get the hint. This is the behavior pinned by
`test_cluster_pattern_from_real_trace_is_mostly_deduped`.

## Threshold (0.7)

Tuned to err on the side of deduping. False positives (model re-requests
with a tighter range from the hint) cost one extra turn; false negatives
(full 5–50 KB file dump) cost a payload that's re-paid on every uncached
subsequent turn → quadratic context bloat. 70% is the lowest threshold
that still leaves clear "you'll see >30% new content" semantics in the
hint.

## Hint changes

Two hints now exist:

  * `_VIEW_DEDUPE_HINT_INTERVAL` — names the seen ranges and the
    coverage percentage so the model can pick an actually-new range.
    Example: `You have already viewed `/path/foo.py` covering lines
    [110-185, 270-350]. The requested range [115-135] is 100% inside
    what you have already been shown. Scroll back...`
  * `_VIEW_DEDUPE_HINT_LISTING` — directory variant, simpler.

Both stay an order of magnitude smaller than the view they replace
(pinned by `test_hint_is_an_order_of_magnitude_smaller`).

## Tests

`tests/tools/file_editor/test_view_dedupe_intervals.py` — 27 new tests
across three layers:

  * Unit-level: `_merge_intervals` (7 cases) and `_coverage_fraction`
    (7 parametrized cases) — the interval algebra dedupe relies on.
  * Behavior: subset/superset/overlap/disjoint cases; the growing-range
    pattern; the real Nemotron cluster B pattern reproduced verbatim;
    end-minus-one normalization; whole-file then partial; partial then
    whole-file; multi-file isolation.
  * Hint quality: hint lists seen ranges; size ratio < 10% on 1000-line
    file.

`tests/tools/file_editor/test_view_dedupe.py` (v1 tests, kept) — every
test still passes unchanged. The v1 cases (exact repeats, write
invalidation, ABA pattern, error pass-through, directory dedupe, per-
path isolation) are all special cases of the v2 design.

All 192 file_editor tests pass. Lint + pyright clean.

## Honest scope

This redesign is informed by the v1-on-Nemotron data from run
`27293667611` (linked in the PR description). 3× catch-rate improvement
on the dominant waste pattern is real; the absolute cost impact on a
single task is bounded by how often the model edits the same file (each
edit invalidates and forces a fresh view). Best-case savings on a
non-caching provider are still meaningful — ~5K tokens × N future turns
per dedupe — but the v2 design also doesn't claim to fix the larger
structural issue (model retries the same forbidden command shape dozens
of times). That's a separate lever (SDK-8 hard cooldown).

Co-authored-by: openhands <openhands@all-hands.dev>
@juanmichelini juanmichelini changed the title DRAFT: feat(file_editor): dedupe repeated view of unchanged files DRAFT: feat(file_editor): interval-coverage dedupe for repeated view Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants