feat(retrieval): PageIndex-style page-based agentic strategy (PR-B)#25
Conversation
Add a new retrieval Strategy modelled on PageIndex's 3-tool
reasoning protocol (get_document_structure, get_pages, done). The
model navigates by inclusive page range rather than by section ID
— a tighter interface for paginated documents (SEC filings,
academic PDFs) where the prior "pick a section ID from a 500-node
outline" surface was too noisy.
The loop:
- get_document_structure() returns the document's TOC as JSON
(titles + page ranges, no body text). Wires to a TOCProvider
that reads documents.toc_tree when present; falls back to a
synthesised view derived from the section tree when not, so
the strategy works even before the TOC-builder PR lands.
- get_pages(start_page, end_page) returns concatenated content
of every section whose [PageStart, PageEnd] overlaps the
requested range, clipped to PageContentLimit chars.
- done(answer, cited_pages, reasoning) terminates with the
final answer + the page ranges the answer relies on.
SelectWithCost surfaces both the agent's literal answer string
(via Result.Reasoning) and the set of section IDs whose page
range overlaps any cited range (via Result.SelectedIDs), so the
existing /v1/query + /v1/answer callers can consume the strategy
without changes. A new PagesRead field on Result captures every
get_pages call (start/end/section IDs/char count) for cost
debugging and the reasoning-trace surface.
Protocol uses the same JSON-action text shape AgenticStrategy
proved (llmgate v0.2.0's Tools field is still scaffolding-only);
when llmgate wires native tool calling the surface here is
unchanged. The parser tolerates "tool" vs "action" keys, a
"5-7"-string Pages alternative, and string-shaped cited_pages.
Trace-token reuses ComputeTraceToken but folds the strategy name
into the model position so page-based and section-based runs on
the same doc/model don't collide, and tags the page ranges with
"p:" so they share namespace with section IDs without colliding.
15 unit tests cover: the happy 3-tool sequence, multi-range
citations, MaxHops force-done (both with and without recovery),
TOC fallback, persisted-TOC precedence, persistent bad JSON,
out-of-range and partial-overlap page clamping, empty tree,
loader-less degradation, content clipping, empty-citations
refusal, trace-token stability + order invariance, and parser
tolerance for every documented input shape.
Wire the PageIndex strategy through a dedicated answer endpoint
on the existing /v1 router. The endpoint:
- Owns the full RAG round-trip in one request: retrieval +
answer + citations come back from a single agentic loop.
No separate synthesis call — the model emits its answer
inside the done action and we surface it as `answer` on
the response.
- Emits page-grounded citations. One citation per page range
the agent fetched (deduplicated), each carrying
start_page / end_page / section_ids plus an answer-span
quote pulled via the existing SpanExtractor over the cited
content. Falls back gracefully when the LLM declines a
quote.
- Persists every successful response to the existing replay
store under the strategy's deterministic trace_token. The
token's input set is sorted cited page ranges (not section
IDs), and the strategy name is folded into the hash so
page-based and section-based tokens for the same doc/model
never collide.
- Supports an opt-in reasoning trace (body field
`reasoning:true` or query param `?reasoning=true`) that
surfaces per-hop tool calls + args + tool-result chars +
sections touched, captured via a new OnEvent hook on
PageIndexStrategy.
- Streams via Server-Sent Events when `stream:true` is set
on the body. One event per tool call (get_document_structure,
get_pages, done) so callers WATCH the navigation in real
time, terminated by an `answer` event carrying the full
JSON response payload.
- Honors per-request overrides for max_hops and
max_pages_per_fetch without mutating shared Deps. Disabled
deployments (retrieval.pageindex.enabled=false or no LLM
client) return 501; missing documents 404; bad bodies 400.
Adds `RetrievalConfig.PageIndex` (PageIndexBlock) with defaults
(Enabled=true, MaxHops=8, PageContentLimit=16000) and matching
VLE_RETRIEVAL_PAGEINDEX_* env overrides. Validation rejects
negative knobs and accepts "pageindex" as a retrieval strategy.
cmd/engine/main.go registers the strategy via buildStrategy
when retrieval.strategy=pageindex, AND wires a standalone
PageIndexStrategy instance into the api.Deps used by the
answer endpoint — so the endpoint is available regardless of
which selection strategy the deployment runs by default.
Test coverage: 12 end-to-end handler tests (happy path,
reasoning trace via body field + query param, bad request,
not found, disabled in two modes, no LLM, replay persistence
verifying byte-equal response bytes, SSE event stream shape,
per-request override caps the loop, TOC fallback). Plus 5
config tests for defaults + env overrides + validation.
A PageIndexTreeLoader function field on Deps acts as a test
seam so handler tests can run end-to-end via httptest with
an in-memory tree, without a real Postgres backend.
OpenAPI 3.1 spec for the new endpoint:
- POST /v1/answer/pageindex documented with the
PageIndexAnswerRequest body shape (document_id, query,
optional model, max_hops, max_pages_per_fetch, stream,
reasoning) and PageIndexAnswerResponse (answer,
citations, hops_taken, usage, trace_token, pages_read,
reasoning_trace).
- PageIndexCitation, PageReadEntry, and
PageIndexTraceEntry component schemas describe the
page-grounded citation shape, the per-call navigation
footprint, and per-hop reasoning trace entries.
- The 200 response carries content for BOTH
application/json (non-streaming) and text/event-stream
(when stream:true) with documentation of the SSE event
types: `started`, one event per tool call
(get_document_structure / get_pages / done), and a
terminal `answer` event carrying the full payload.
- 501 covers both "no LLM client" and
"retrieval.pageindex.enabled=false" so operators
looking at the spec see the toggle that disables the
endpoint.
- QueryResponse's strategy enum gains "pageindex" so
/v1/query responses returned by a pageindex-default
deployment validate against the schema.
- ?reasoning=true query parameter is documented as an
alternative to the body's reasoning field.
config.example.yaml:
- retrieval.strategy comment lists every available
strategy with a one-line description of each, so an
operator picking a strategy can see what they're
choosing between without reading code.
- New retrieval.pageindex block with enabled / max_hops /
page_content_limit / model knobs, default values
matching the engine defaults, and a comment block
explaining the three-tool loop, the trace_token /
reasoning_trace / streaming differentiators, and the
graceful-degradation behaviour when no TOC tree is
persisted yet (the synthesised view fallback).
|
Warning Review limit reached
More reviews will be available in 45 minutes and 2 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (11)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in PageIndex-style page-range retrieval/answering path alongside the existing section-based retrieval APIs. It introduces a new page-based agentic strategy, a dedicated /v1/answer/pageindex endpoint, config/wiring, tests, and OpenAPI documentation.
Changes:
- Added
PageIndexStrategywith JSON tool-call loop, page reads, trace token support, TOC fallback, and tests. - Added
/v1/answer/pageindexhandler with JSON/SSE responses, reasoning trace, citations, and replay integration. - Added PageIndex config, engine wiring, OpenAPI schemas, and example configuration.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/retrieval/strategy.go |
Adds PagesRead metadata to retrieval results. |
pkg/retrieval/pageindex_strategy.go |
Implements the new PageIndex page-based strategy. |
pkg/retrieval/pageindex_strategy_test.go |
Adds unit coverage for strategy behavior and parsing. |
pkg/config/config.go |
Adds PageIndex config defaults, env overrides, and validation. |
pkg/config/config_test.go |
Adds config tests for PageIndex settings. |
openapi.yaml |
Documents the new endpoint and schemas. |
internal/api/server.go |
Wires the new route and API dependencies. |
internal/api/pageindex.go |
Implements the PageIndex answer endpoint and SSE path. |
internal/api/pageindex_test.go |
Adds handler tests for JSON/SSE/replay/error paths. |
config.example.yaml |
Documents PageIndex configuration. |
cmd/engine/main.go |
Wires PageIndex as a selectable strategy and dedicated endpoint strategy. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Build a citation per UNIQUE page range present in PagesRead. | ||
| // The set of pages the model "read" is a superset of what it | ||
| // cited — some get_pages calls don't end up in the final | ||
| // cited_pages list — but the union is the right cone of trust | ||
| // to surface as evidence. The trace token is computed over | ||
| // only the strictly-cited ranges, which the strategy already | ||
| // has, so citation drift doesn't break replay. | ||
| seen := make(map[[2]int]struct{}, len(res.PagesRead)) |
| citations := d.buildPageIndexCitations(r.Context(), t, res, body.Query, body.Model) | ||
| final := map[string]any{ | ||
| "document_id": body.DocumentID, | ||
| "query": body.Query, | ||
| "answer": res.Reasoning, | ||
| "citations": citations, | ||
| "strategy": strat.Name(), | ||
| "model": budget.ModelName, | ||
| "hops_taken": res.HopsTaken, | ||
| "usage": map[string]any{ | ||
| "input_tokens": res.Usage.InputTokens, | ||
| "output_tokens": res.Usage.OutputTokens, | ||
| "total_tokens": res.Usage.TotalTokens, | ||
| "cost_usd": res.Usage.CostUSD, | ||
| "llm_calls": res.Usage.LLMCalls, | ||
| }, | ||
| "elapsed_ms": time.Since(started).Milliseconds(), | ||
| "trace_token": res.TraceToken, | ||
| "pages_read": res.PagesRead, | ||
| } | ||
| emitSSE("answer", final) |
| resp := map[string]any{ | ||
| "document_id": body.DocumentID, | ||
| "query": body.Query, | ||
| "answer": res.Reasoning, // strategy stores the agent's answer here | ||
| "citations": citations, | ||
| "strategy": perReq.Name(), | ||
| "model": budget.ModelName, | ||
| "hops_taken": res.HopsTaken, | ||
| "usage": map[string]any{ | ||
| "input_tokens": res.Usage.InputTokens, | ||
| "output_tokens": res.Usage.OutputTokens, | ||
| "total_tokens": res.Usage.TotalTokens, | ||
| "cost_usd": res.Usage.CostUSD, | ||
| "llm_calls": res.Usage.LLMCalls, | ||
| }, | ||
| "elapsed_ms": time.Since(started).Milliseconds(), | ||
| "trace_token": res.TraceToken, | ||
| "pages_read": res.PagesRead, |
| final := map[string]any{ | ||
| "document_id": body.DocumentID, | ||
| "query": body.Query, | ||
| "answer": res.Reasoning, | ||
| "citations": citations, | ||
| "strategy": strat.Name(), | ||
| "model": budget.ModelName, | ||
| "hops_taken": res.HopsTaken, | ||
| "usage": map[string]any{ | ||
| "input_tokens": res.Usage.InputTokens, | ||
| "output_tokens": res.Usage.OutputTokens, | ||
| "total_tokens": res.Usage.TotalTokens, | ||
| "cost_usd": res.Usage.CostUSD, | ||
| "llm_calls": res.Usage.LLMCalls, | ||
| }, | ||
| "elapsed_ms": time.Since(started).Milliseconds(), | ||
| "trace_token": res.TraceToken, | ||
| "pages_read": res.PagesRead, |
| if s.TOC != nil { | ||
| raw, err := s.TOC.GetTOC(ctx, t.DocumentID) | ||
| if err == nil && len(raw) > 0 { | ||
| return string(raw) | ||
| } | ||
| // Log and degrade — the strategy must keep going. | ||
| if err != nil { | ||
| log.Printf("retrieval: pageindex TOC fetch failed (degrading to synthesised view): %v", err) | ||
| } |
Why
FinanceBench's debt-registration question scores 0/1 on our current section-based retrieval against a 508-node 10-K outline — the "pick a section_id" surface is too noisy. PageIndex hits 98.7% on the same benchmark with a smaller interface: 3 tools, page-range navigation, no embeddings.
This PR ports that interface to vectorless-engine as a new strategy + dedicated answer endpoint. The existing endpoints are unchanged; PageIndex is an opt-in, additive surface.
What ships
1.
PageIndexStrategy(pkg/retrieval/pageindex_strategy.go)A new
Strategy+CostStrategyimplementing a faithful port of PageIndex's three-tool reasoning loop:get_document_structure()— returns the TOC tree as JSON (titles + page ranges, no body text).get_pages(start_page, end_page)— returns the concatenated content of every section whose[PageStart, PageEnd]overlaps the requested range, clipped atPageContentLimit.done(answer, cited_pages, reasoning)— terminates with the natural-language answer and the inclusive page ranges the answer relies on.The system prompt is a port of the reference PageIndex demo (
PageIndex/examples/agentic_vectorless_rag_demo.py:44-52) adapted to vle's JSON-action protocol (llmgate v0.2.0'sToolsfield is still scaffolding-only). When llmgate wires native tool calling, the action surface is unchanged.Graceful degradation: the strategy uses a
TOCProviderinterface forget_document_structureobservations. When the persisteddocuments.toc_treecolumn isNULL(pre-PR-A state), the provider'sErrNoTOCsignal triggers a synthesised view derived from the section tree. Pre-merge of PR-A, every request degrades through this path — and that's fine. The strategy works without it.Result.Reasoningcarries the agent's final answer (/v1/answer/pageindexreads it directly).Result.SelectedIDsis the union of every section whose page range overlaps any cited range, so the existing/v1/querycallers still get a section list. A newResult.PagesRead []PageReadEntryrecords everyget_pagescall (start/end/section_ids/char_count) for cost debugging and the reasoning trace.2.
POST /v1/answer/pageindex(internal/api/pageindex.go)doneaction.computePageIndexTraceTokenhashesdoc_id || "pageindex:" model || sorted cited page ranges, folding the strategy name into the model position so page-based and section-based tokens never collide. Stored in the existing replay store;/v1/replayreturns byte-identical responses.SpanExtractorover the concatenated cited content (offsets back into that content).reasoning_trace(opt-in via bodyreasoning:trueor?reasoning=true) lists every tool call with hop/tool/args/result_chars/sections_touched. Captured via a newOnEventhook onPageIndexStrategy.stream:true) via Server-Sent Events. One event per tool call so callers watch the navigation in real time, terminated by ananswerevent carrying the full payload.max_hopsandmax_pages_per_fetchwithout mutating sharedDeps.3. Config (
pkg/config/config.go)RetrievalConfig.PageIndexblock:enabled(default true),max_hops(8),page_content_limit(16000),model(inherit).VLE_RETRIEVAL_PAGEINDEX_*env overrides (Enabled/MaxHops/PageContentLimit/Model).Validate()acceptspageindexas a strategy name and rejects negative knobs.4. Wiring (
cmd/engine/main.go)buildStrategyregisterspageindexas a selection strategy choice.PageIndexStrategyinstance is always wired intoapi.Deps.PageIndexStrategy(gated byretrieval.pageindex.enabled) regardless of which strategy is selected as default. So a deployment runningchunked-treefor/v1/querystill gets/v1/answer/pageindex.5. OpenAPI + config.example.yaml
Full spec for the new endpoint:
PageIndexAnswerRequest,PageIndexAnswerResponse,PageIndexCitation,PageReadEntry,PageIndexTraceEntry. Bothapplication/jsonandtext/event-streamcontent types under 200, with SSE event type documentation. Example config block with operator-readable comments.Test plan
pkg/retrieval/pageindex_strategy_test.go— 15 unit tests: canonical 3-tool sequence, multi-range citations, MaxHops force-done (with and without recovery), TOC fallback (and persisted-TOC precedence), persistent bad JSON, out-of-range + partial-overlap page clamping, empty tree, loader-less degradation, content clipping, empty-citations refusal, trace-token stability + order invariance, parser tolerance.internal/api/pageindex_test.go— 12 end-to-end handler tests viahttptestwith a mock LLM, mock storage, and aPageIndexTreeLoadertest seam: happy path, reasoning trace (body + query param), bad request, document not found, disabled (config + nil strategy), no LLM, replay persistence verifying byte-equal response bytes, SSE event stream shape, per-request override caps the loop, TOC fallback.pkg/config/config_test.go— 5 config tests: defaults, env overrides (all four knobs), enable-toggle from disabled, garbage env rejection, validation negatives.go test ./...andgo build ./...clean.config.example.yamlparses cleanly viaconfig.Load.Risk envelope
/v1/query,/v1/answer,/v1/replay) are unchanged. The new/v1/answer/pageindexis purely additive.documents.toc_treeisNULL. Even if PR-A is never merged, this PR delivers value.Out of scope (NOT in this PR)
pkg/tree/tree.goTOCNode + ingest stage). PR-A owns that. TheTOCProviderinterface is the integration point — when PR-A lands, the engine wires a DB-backed implementation readingdocuments.toc_tree.