Skip to content

Function-call edits (structured replacements, not full rewrite)#31

Closed
bddap-bot wants to merge 4 commits into
bddap:mainfrom
bddap-bot:tool-edits
Closed

Function-call edits (structured replacements, not full rewrite)#31
bddap-bot wants to merge 4 commits into
bddap:mainfrom
bddap-bot:tool-edits

Conversation

@bddap-bot

Copy link
Copy Markdown
Contributor

Makes refac express changes as function-call edits instead of re-emitting the whole input, as requested.

Stacks on #30 (Anthropic backend) — merge that first, or merge this to get both.

How it works

  • The model is given the selected text + transform and must call an apply_edits tool returning a list of exact-substring replacements ({old, new}).
  • refac applies them in order to the original text (apply_edits in main.rs), erroring if an old isn't found verbatim — so a stale/hallucinated edit fails loudly instead of silently no-op'ing.
  • New default on Anthropic: edit_mode = "tool". Set edit_mode = "rewrite" (or REFAC_EDIT_MODE=rewrite) for the previous full-rewrite behavior. The OpenAI path always rewrites.

Details

  • anthropic.rs: request_edits() adds the tool + forces tool_choice, parses the tool_use input into Vec<Edit>; HTTP factored into a shared send().
  • prompt.rs: an edit-focused system prompt (keeps refac's personality; no rewrite few-shot).
  • Edit-mode skips the full-text few-shot, so requests are much smaller.

Testing (live, with the provided key)

  • refac tor "Me like toast." "Correct grammar."I like toast. (one substring edit)
  • refac tor "<def add…>" "add a docstring" → inserts a docstring, rest untouched:
    def add(a, b):
        """Return the sum of a and b. Groundbreaking stuff, truly. --refac"""
        return a + b
    
  • cargo test → 6 passing (edit application order, insert-via-anchor, delete-via-empty, missing-target error, request shaping).

🤖 Generated with Claude Code

bddap-bot and others added 4 commits May 29, 2026 10:44
refac now defaults to the Claude Messages API (model claude-opus-4-8); OpenAI
remains available via `provider = "openai"`. New `src/anthropic.rs` talks to the
REST API with reqwest (no official Rust SDK): x-api-key + anthropic-version
headers, top-level `system`, and required `max_tokens`. It adapts refac's flat
message list to Anthropic's shape — lifts the system prompt out of messages and
merges the consecutive user turns to satisfy user/assistant alternation — and
marks the static system prompt + few-shot examples `cache_control: ephemeral`
so repeat calls only pay for the varying input. Config gains `provider`, optional
`model` (defaulted per provider), and `max_tokens`; secrets hold either/both keys.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Some few-shot samples have an empty `selected`; Anthropic rejects empty text
content blocks (OpenAI tolerated them), so skip empty messages when building the
request. Add a REFAC_DEBUG env that dumps the request JSON. Verified end-to-end:
'Me like toast.' / 'Correct grammar.' -> 'I like toast.'
Adds an `apply_edits` tool the model must call, returning a list of exact
substring replacements that refac applies to the input — so it no longer
re-emits the entire selected text. New default on the Anthropic provider
(edit_mode = "tool"); set edit_mode = "rewrite" (or REFAC_EDIT_MODE=rewrite)
for the old behavior. OpenAI always rewrites.

- anthropic.rs: `request_edits` forces `tool_choice` to `apply_edits`, parses
  the tool_use input into `Vec<Edit>`. HTTP factored into a shared `send`.
- prompt.rs: edit-mode system prompt (no rewrite few-shot).
- main.rs: `apply_edits` applies edits in order, erroring if a target isn't found.

Live-tested against the API: "Me like toast."/"Correct grammar." -> "I like
toast."; "add a docstring" inserts one without touching the rest. cargo test
green (6 tests incl edit application + missing-target error).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two rounds of concurrent reviewers (taste/Niko, correctness/API, Rust idiom);
findings triaged and the real ones fixed:

- apply_edits: resolve every `old` against the ORIGINAL text and reject empty,
  missing, ambiguous (>1 occurrence), or overlapping edits; apply right-to-left.
  Edits are now order-independent — closes a silent buffer-corruption hazard.
- Check `stop_reason`: error on `max_tokens` truncation in complete + request_edits
  (a truncated rewrite or half-finished tool call was silently returned/mis-parsed).
- login: `--provider` flag + state which provider's key is being saved (avoids an
  upgrade footgun where an OpenAI key gets stored as the Anthropic one).
- Unknown REFAC_PROVIDER / REFAC_EDIT_MODE now error instead of silently defaulting.
- Unify the tool name to APPLY_EDITS; rewrite the tool/system prompt to match the
  new unique-and-independent edit contract. Delete dead OpenAI edits structs
  (clears all build warnings). Extract DEFAULT_ANTHROPIC_MODEL.
- Tests: ambiguous/empty/overlapping edits + resolve-against-original. 10 passing.

Re-tested live against the API after the changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@bddap

bddap commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Time for a rewrite atop the merged anthropic pr. Consider including a wide-scope refactor before this pr to pave the way for this change. Common interface to both providers might be needed, put your software design hat on.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

Plan, software-design hat on. Splitting into two stacked PRs as you suggested:

PR A — refactor: a common Backend interface (no behavior change).
Today main.rs::refactor() matches on Provider, re-derives the key, and calls a free anthropic::complete / openai::complete. That match-and-duplicate is exactly what grows ugly when a second call shape (tool edits) lands. So first:

  • trait Backend { fn complete(&self, &[Message]) -> Result<String>; } in a new backend.rs.
  • Each provider module owns its struct (Anthropic{key,model} / Openai{key,model}) and impl Backend.
  • backend::resolve(provider, model, &secrets) -> Result<Box<dyn Backend>> holds the key-extraction + "no key" errors in one place. refactor() collapses to resolve(...)?.complete(&messages).
  • Pure refactor: existing tests stay green, wire formats unchanged.

PR B — rewrite #31 atop A. Function-call edits become the natural second capability on the interface (Anthropic-only request_editsVec<Edit>; the substring-apply logic is provider-agnostic and moves to its own module). OpenAI keeps rewriting. This drops the #31-vs-#30 overlap since anthropic.rs already lives on main now.

Starting on A now. Will run my reviewer loop before flagging either for your review.

@bddap

bddap commented Jun 2, 2026

Copy link
Copy Markdown
Owner

I don't think you'll want to write the funtion-call loop twice. Consider a common abstraction that takes function-calling into account.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

PR A is up: #32 (the paving refactor, no behavior change). PR B = this branch rewritten atop it.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

Right, and it corrects a wrong assumption I baked into A. Function-calling isn't Anthropic-only — OpenAI has tools too. So the abstraction shouldn't be an "Anthropic-only edits" capability; it should make a single tool-call round-trip a first-class Backend operation that both providers implement, and the function-call loop lives once on top of it.

Revised design for B:

trait Backend {
    fn complete(&self, &[Message]) -> Result<String>;
    /// One forced tool-call round-trip: hand the model a tool schema, get back
    /// the parsed call arguments. Each provider does this in its own wire format
    /// (Anthropic tools + tool_choice; OpenAI tools + tool_choice "required").
    fn call_tool(&self, &[Message], tool: &Tool) -> Result<serde_json::Value>;
}
  • The edit flow — build the apply_edits tool schema, call it, parse Vec<Edit>, apply to the original (erroring on a missing old) — is provider-agnostic and written once, atop call_tool. No per-provider loop.
  • Both providers get edit mode; edit_mode = tool|rewrite stops being provider-gated.
  • A's backend.rs doc comment currently justifies Box<dyn> over an enum by calling edits "Anthropic-only." That's now wrong — both implement call_tool, so it's a uniform trait. Fixing that comment on A (the dyn-vs-enum call stands on plain extensibility, not capability split).

Noodle away on A. This is the shape I'll build B in once you're settled.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

Ran the debate club — three advocates (sans-io step-enum / minimal two-method / protocol-transport split). Synthesis + recommendation.

Verdict: protocol/transport split (true sans-io) + an Outcome { ToolCall | Done } sum type, scoped one-shot.

Your two words settled it:

  • "sans-io" is a term of art (h11, quinn): no IO in the protocol layer. Only the split delivers it — render(conv, tools) -> Request and parse(resp) -> Outcome are pure; a generic Transport does the HTTP. The rival step(conv,tools)->Step returns the right type but still opens the socket inside the method — sans-io in name only, and still un-unit-testable.
  • "tool-call or done as output" = enum Outcome { ToolCall { id, name, input }, Done(String) }. Illegal states gone. Rewrite = Done; edits = ToolCall. One path.
pub enum Outcome { ToolCall { id: String, input: ApplyEdits }, Done(String) }

pub trait Provider {
    fn render(&self, conv: &[Turn], tools: &[Tool]) -> HttpRequest; // pure
    fn parse(&self, status: u16, body: &[u8]) -> Result<Outcome>;   // pure
}
pub trait Transport { fn execute(&self, req: HttpRequest) -> Result<(u16, Vec<u8>)>; }
// one generic driver: render -> execute -> parse. function-call handling written ONCE.

Payoff: two providers with divergent, fiddly tool wire-formats (Anthropic tool_use blocks + x-api-key/version; OpenAI tool_calls + bearer) become testable with zero network — assert exact request JSON, feed canned response bytes to parse. And it's mostly finishing a pattern the repo already half-has: openai.rs already goes through Endpoint/Req (pure render) + Client (transport); anthropic.rs hand-rolls its own reqwest. The split generalizes ReqHttpRequest, promotes the response side to a typed parse, and deletes anthropic.rs's duplicated transport. Net less code, one consistent path.

Where I overrule the advocates

  • vs two-method (complete+call_tool): its YAGNI instinct is right but mis-aimed — it saves a ~15-line driver by burying IO in 2 methods × 2 providers = four copies of client/auth/timeout boilerplate, and it drops your "or done" output type (forced tool, no Done arm). Rejected on your own stated output shape.
  • But I keep its core discipline: no multi-turn agent loop. refac is one-shot. The driver does render→execute→parse→(ToolCall: apply edits, terminal / Done: return text). The function-call round-trip is written once, provider-agnostic ("don't write it twice" ✓) — without a turn-feedback engine wired to nothing.

Typed tool

refac has exactly one tool (apply_edits, known schema). Parse straight into typed ApplyEdits { edits: Vec<Edit { old, new }> } at the wire seam; keep serde_json::Value only as the transient decode buffer, never in the loop. Malformed call = typed error at the seam, not a panic in apply.

One decision for you — apply-error policy

When an edit's old isn't found verbatim (model paraphrased / whitespace drift):

The split supports either; (a) keeps refac's one-shot determinism, (b) needs the bounded loop. I lean (a) for a fast editor binding, option (b) held open. Your call.

Knock-on for PR A (#32)

The complete()->String trait there becomes a stepping stone — B reshapes Backend into this sans-io Provider. A still merges cleanly as the dispatch/key-resolution cleanup, or I fold it into B. Say which.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

Round 2 — open ideation (3 agents: question-the-premise / type-leverage / orchestration). It found real design-space we'd skipped. The headline: round 1 argued about the Backend trait; round 2 says the higher-leverage axes are the edit mechanism, the apply robustness, and making the seam data. The trait shape is downstream of those.

Three ideas worth more than the trait debate

1. Maybe don't use function-calling at all. Two alternatives to "model calls an apply_edits tool":

  • SEARCH/REPLACE blocks (aider-style) in the normal text channel: model emits <<<<<<< SEARCH / ======= / >>>>>>> REPLACE fenced pairs; refac parses + applies. Zero Backend/wire changes, works on every model including no-tool ones, and leans on the few-shot mechanism that already drives refac (the model imitates Sample.result verbatim — far more reliable than a JSON-schema description). Smallest possible diff to the actual goal ("stop re-emitting the blob").
  • Locate-then-edit (offsets). Call 1: model returns char/line ranges relevant to the transform. refac slices them (refac holds the authoritative offsets). Call 2: rewrite only those slices, splice back. This eliminates the entire "old substring didn't match" failure class — the model never reproduces existing text verbatim. Most robust, biggest token win; cost is 2 round-trips.

Both dodge the thing that makes tool-edits fragile: the model has to echo old byte-exact, and whitespace drift / paraphrase breaks it.

2. Correctness lives in the apply step, not the trait — and we all ignored it. Whatever produces {old,new}, the bug surface is: old not found / found multiple times / two edits overlap. Push that into a type:

struct Plan<'src> { src: &'src str, spans: Vec<(Range<usize>, String)> }
impl<'src> Plan<'src> {
    fn resolve(src: &'src str, edits: Vec<RawEdit>) -> Result<Plan, EditError>; // only constructor; fails on NotFound/Ambiguous/Overlap
    fn apply(self) -> String;   // total, no Result — illegal states already gone
}

Highest correctness-per-line, orthogonal to every transport decision. Adopt regardless.

3. The seam should be data, not a behavior — and the cleanest sans-io is a resumable state machine. Instead of render/parse/Transport (3 concepts) or a complete() method, a value you step:

enum Step { Call(HttpRequest), Done(String) }
impl Refac {
    fn start(sel, instr) -> (Self, Step);
    fn resume(self, resp: HttpResponse) -> Result<(Self, Step)>;
}
// driver: while let Step::Call(req) = step { step = m.resume(send(req)?)?.1 }

Single-shot today = one Call then Done. But "apply edits, and if an old missed, re-ask" becomes expressible without committing to a loop now, and record/replay testing falls out for free (feed canned HttpResponses, assert the HttpRequest sequence + final text — zero network). This generalizes round-1's split to N steps and supersedes it.

Two sharper consequences

  • Typed tools force enum-over-dyn. If we keep function-calling and make it typed (call<T: Tool> with schema derived from the Rust type via schemars — kills schema-vs-parse drift), generic methods aren't object-safe, so Box<dyn Backend> can't work — you need enum Backend { Anthropic, Openai }. That directly revises PR A's dyn choice.
  • Several existing few-shot samples (roast, advice, what does this do?) emit whole-file commentary, not localized edits. Any structured-edit mode MUST keep a whole-rewrite fallback or those core behaviors regress. (Outcome::Done / a WholeText format already covers this — but it means rewrite and edit must coexist, not replace.)
  • refac already logs every (selected, transform, output) as JSONL (main.rs). That's a ready-made eval corpus to A/B edit formats and capture req/resp cassettes — the empirical answer to "which edit language do models actually apply reliably."

Where this lands the recommendation

The round-1 verdict (sans-io + Outcome{ToolCall|Done}) survives but gets sharpened and de-prioritized: the trait/transport shape matters less than (a) picking the edit mechanism and (b) the validated-Plan apply step. My updated lean:

  • Make the seam data (resumable Step machine, idea 3) — keeps record/replay + future-retry without lock-in.
  • Adopt the validated Plan (idea 2) no matter what.
  • Prototype the mechanism question empirically rather than decide it in the abstract: SEARCH/REPLACE blocks vs typed tool-call, scored on the existing JSONL log corpus for apply-success rate. The few-shot-friendliness of SEARCH/REPLACE makes me suspect it beats tool-calls on reliability and is a smaller diff — but that's a measurement, not an opinion.

Want me to (i) build a quick apply-success bake-off (SEARCH/REPLACE vs tool-call) over the logs, or (ii) just commit to one mechanism and build it? And the apply-error (a)/(b) + PR-A-fold questions from the last comment still stand.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

Edit-tool design (stolen from opencode + claude-code)

Researched both. The split is clean: claude-code owns the tool contract (exact match, unique-or-error, replace_all), opencode owns the apply robustness (a fallback chain of progressively fuzzier matchers so the model's old doesn't have to be byte-perfect). refac wants both.

The contract (from claude-code's str_replace / Edit)

  • old must match the source exactly as written (whitespace + indentation included) — before fuzzy fallback.
  • old must resolve to exactly one location. Multiple → error, unless replace_all. The model disambiguates by including more surrounding context, exactly as claude-code trains it to.
  • new must differ from old (no-op = error). Empty new = deletion. Insertion = anchor on an existing substring and echo it back inside new (no separate insert op).
  • A hallucinated/edited old that doesn't match fails loudly — never silently mis-applies.

The apply algorithm (from opencode's replace())

opencode tries a chain of "replacers," each a generator that yields candidate old strings to look for; the first candidate that lands a unique hit wins. Ordered exact → fuzzy so a precise match always beats a loose one:

  1. Simpleold verbatim.
  2. LineTrimmed — match line-by-line ignoring each line's leading/trailing whitespace; splice the original lines.
  3. BlockAnchor — for 3+ line blocks, anchor on first & last line, accept the middle by Levenshtein similarity (≥0.3 when ambiguous). Survives a reworded interior.
  4. WhitespaceNormalized — collapse all runs of whitespace to one space, then match.
  5. IndentationFlexible — strip common leading indent from both sides, match dedented.
  6. (opencode also has EscapeNormalized / TrimmedBoundary / ContextAware / MultiOccurrence — marginal for us; MultiOccurrence is just the replace_all path.)

Resolution per claude-code semantics: within the first replacer that yields any hit, require index == lastIndexOf (unique) else skip → next replacer; if a replacer yields a unique hit, apply and stop. Exhaust all → "could not find old" error.

For refac I'd ship the core 5 (1–5); they cover the real failure modes (indent drift, reflowed whitespace, reworded block) and skip the long tail.

Rust shape

/// A replacer yields candidate substrings of `src` to try matching `old` against,
/// fuzzy-to-exact. First candidate with a UNIQUE occurrence wins.
type Replacer = fn(src: &str, old: &str) -> Vec<String>;

const CHAIN: &[Replacer] = &[simple, line_trimmed, block_anchor, ws_normalized, indent_flexible];

fn find_unique<'s>(src: &'s str, old: &str) -> Result<Range<usize>, EditError> {
    for replacer in CHAIN {
        for cand in replacer(src, old) {
            let mut it = src.match_indices(&cand);
            if let Some((i, _)) = it.next() {
                if it.next().is_none() {                  // unique
                    return Ok(i..i + cand.len());
                }                                          // else ambiguous → keep trying
            }
        }
    }
    Err(EditError::NotFound(old.into()))
}

This is exactly the round-2 Plan idea with opencode's matchers as the resolver. replace_all swaps the unique check for "replace every occurrence of the matched candidate."

refac-specific calls (need your read)

  • Batch vs single. claude-code's Edit is one {old,new} per call; refac does one model round-trip, so I'd take a batch: apply_edits({ edits: [{old, new, replace_all?}] }) — all changes in one tool call.
  • Sequential vs offset-Plan. opencode applies edits sequentially (each sees the prior result) — simplest, matches model expectations, and lets a later edit target text an earlier one produced. The round-2 Plan (resolve all vs the original, require non-overlapping, apply by descending offset) instead catches conflicting/overlapping edits up front but forbids dependent edits. I lean sequential (opencode) for refac; it's what the models are tuned for. Overlap just surfaces as a later not-found.
  • Error policy = the open (a)/(b). After the chain exhausts: (a) abort loudly [orig Function-call edits (structured replacements, not full rewrite) #31] or (b) feed EditError back as a tool_result and let the model retry once. The fuzzy chain shrinks how often (b) is even needed.

Tool description (the prompt the model sees) — steal claude-code's precision

Replace exact substrings in the selected text. For each edit, old must appear in the text; copy it verbatim including whitespace and indentation. Make old long enough to be unique, or set replace_all. new is what replaces it — empty to delete; to insert, include surrounding text in both old and new. Edits apply in order.

Sources: opencode edit.ts, claude-code / Anthropic text-editor tool.

This composes straight onto the round-2 plan: the apply_edits tool's args parse into the typed Edits, find_unique is the validated-Plan resolver, and the whole thing is mechanism-agnostic — it works the same whether the {old,new} list arrives via a tool-call or SEARCH/REPLACE blocks. So we can lock the edit semantics now and still defer the tool-vs-blocks mechanism bake-off.

@bddap-bot

Copy link
Copy Markdown
Contributor Author

PR B is up — #33 (tool/function-call edit mode), built on the design from this thread and the opencode/claude-code edit semantics. Stacks on A (#32). 38 tests, clippy clean, live-tested on Anthropic; two reviewer-loop rounds applied. Details in the PR.

@bddap

bddap commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Remind me why we have two prs who's descriptions seem to indicate they do the same thing. #33 and #31

@bddap-bot

Copy link
Copy Markdown
Contributor Author

#33 supersedes this — they're a prototype and its finished version, not two of the same thing.

The lineage:

So #33 fully subsumes #31. Closing this in favor of #33 — reopen if you'd rather land the simpler batch-edit version instead.

@bddap-bot bddap-bot closed this Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants