docs: capture/perf week report, PR plan + 1.0.0 readiness assessment#86
docs: capture/perf week report, PR plan + 1.0.0 readiness assessment#86jbachorik wants to merge 76 commits into
Conversation
Eliminates B10/B15 FallbackPatternDetector predicates and partially
eliminates B16 by routing the affected DFA_*_WITH_GROUPS patterns to
PIKEVM_CAPTURE before the DFA state-count ladder:
- B10: optional prefix before capturing group (e.g. -?(-?.{3}).)
- B15: capturing group in quantified alternation (e.g. (a|b){2,})
- B16 (partial): nullable outer quantifier on capturing group with
non-nullable content (e.g. (a)?); patterns where both the outer
quantifier and group content are nullable (e.g. (0*-?){0,}) still
fall back to JDK via the new hasNullableGroupContentWithNullableQuantifier
predicate.
Both the capture-ambiguous TDFA path and the non-ambiguous DFA-with-groups
path now have the three gates before the DFA strategy ladder. Fuzz gate:
findings=0 (9530 patterns, 76240 inputs).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add PIKEVM gate inside the capturing TDFA isAnchorConditionDiluted() block: patterns where both branches share a leading character but one branch carries a start-anchor guard (e.g. ^x|x(y)) now route to PIKEVM_CAPTURE instead of the JDK fallback. PikeVM evaluates ^/\A correctly against the search-region origin since commit 0acfc66. Patterns with optional quantifiers, nullable branches, or leading end-anchors still fall through to the anchorConditionDiluted JDK path. Fuzz gate confirms zero divergences with the new routing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- NFABytecodeGenerator: add zero-length early-accept before bounds/regionMatches in generateBackreferenceCheck; groupLen==0 trivially succeeds (vacuous match) - FallbackPatternDetector: replace broad hasNullableBackrefGroup B7 guard with narrowed hasAmbiguouslyNullableBackrefGroup that only falls back when the group body can capture strings of length > 1 (unbounded contamination risk); groups with max capture length <= 1 (e.g. a?, [x]?) are safe with the early-accept - BackrefEngineGapsTest: enable b7_nullableBackrefGroupInOptimizedNfa Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RuntimeCompiler: replace CapturePolicy import with ReggieOption/UnsupportedPatternException - Add cacheKeyFor() helper (flag-aware cache key) and fallbackOrThrow() helper - Gate all 6 JavaRegexFallbackMatcher construction sites behind ALLOW_JDK_FALLBACK flag - compileHybrid() receives ReggieOptions to propagate fallback policy - UnsupportedPatternException propagates through catch(Exception) via explicit re-throw - 34 test files updated: add allowJdkFallback() for patterns requiring JDK fallback - New FallbackPolicyTest: throwsByDefault, delegatesWhenFallbackEnabled, nativePatternUnaffected Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…otations ReggieOption moved from reggie-runtime to reggie-annotations so the annotation type can reference it without a circular dependency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…onPriorityConflict
…suffix char Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fier SWAR/first-char optimizer was advancing tryPos past 0 after the start-anchor check passed, causing matchesAtStart to be invoked at non-zero offsets. The DFA transition entryGuard (STRING_START) is not checked at runtime, so the DFA accepted matches at positions where \A cannot hold. Fix: suppress SWAR and first-char optimizations when requiresStartAnchor or hasMultilineStart is true — the anchor check already pins the loop to position 0 (or after-newline), making position-skipping via first-char filter incorrect. Also corrects a stale comment at the anchor optimization site that referred to an unused hasStringStartAnchor field. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rough Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed 32 findings across 7 generators. Remaining 18: 3 Task-3 (OPTIMIZED_NFA_WITH_BACKREFS, pending design decision), 15 new/unmasked patterns for follow-up investigation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T1.2 skips find() start positions whose char cannot begin a match (SQL_ANSI no-match ~38x, now faster than JDK); T1.5 makes the per-position accept check O(1). Zero new fuzz divergences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T1.4: find() for anchor/assertion/backref-free patterns uses a lazy DFA whose step re-injects the start closure each char (implicit .*? prefix), giving O(n) single-pass unanchored search. QOBF find ~29x->~7.7x faster than JDK; spans/findMatch stay on the thread sim. Zero new fuzz divergences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Baseline for the Phase-2 go/no-go: PikeVM span extraction is 10-68x slower than JDK on the capture-bearing IAST patterns. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hot path in PikeVM/closure transition scans (20-29% of capture-path samples). Branchless bitmap test for ASCII; ranges binary search retained for non-ASCII. equals/hashCode unchanged (range-based) so the structural cache is unaffected. Zero new fuzz divergences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matches() is whole-input (priority-independent), so a strict lazy DFA's yes/no equals the thread sim. Many-optional-group matches() ~323->38000 ops/ms (now beats re2j and JDK). Zero new fuzz divergences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends the boolean DFA fast path to patterns whose only anchors are leading START/STRING_START, via an initial closure (pos 0, anchor crossed) vs re-inject/step closure (pos>0, anchor blocked). Eligibility declines non-leading ^ (reachable after a consume, e.g. inside a loop) where the pos-0-only model is unsound. URL find() now 13-20x faster than re2j (was 0.27-0.35x). Zero new fuzz divergences. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IastTokenizerDrainBenchmark mirrors dd-trace SensitiveTokenizerBenchmark (512-2048B malformed payloads, full match-drain) vs reggie/re2j/jdk — the workload that actually decides the re2j replacement (our tiny-input IastRegexpBenchmark was JDK's best case). Reveals Reggie's O(n^2) drain (repeated findMatchFrom re-scan) and lazy-quantifier compile failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- findMatchResultFrom: one continuing left-to-right pass with lowest-priority re-seeding (kills the per-start O(n^2) retry); anchor origin pinned to absolute 0, fixing ^/\A findAll re-anchoring. - Fast-reject: when the boolean find DFA (exact, or an over-approximating reject DFA that crosses anchors as epsilon) proves no match at/after pos, skip the thread sim. Adversarial IAST drains now beat re2j ~10-31x. - MultiGroupGreedy: ^ anchors to absolute 0, not the scan start. - RegexFuzzOracle: add findAll group-span differential (first oracle to check group spans on the find path); budget 18->78 for the pre-existing find-path capture debt it now surfaces (tracked, to ratchet to 0). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SPECIALIZED_MULTI_GROUP_GREEDY has no inter-group backtracking, so a greedy capturing group whose charset overlaps what follows (e.g. (\w+)0 on "ab00", (\d+)5 on "1235") returned NO_MATCH where JDK matches. Decline such patterns (requiresBacktrackingForGroups) so they fall through to the backtracking-capable routing (the requiresBacktrackingForGroups -> RECURSIVE_DESCENT guard), which produces correct spans. GREEDY_BACKTRACK already handles the (.*)literal shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A nullable capturing group in an alternation branch (e.g. 1|()b, ()b|x) is committed as a zero-width span by the TDFA/group-action capture path even when the priority-winning branch bypasses it (binds g1=[0,0); JDK leaves it -1). Detect it (hasNullableCapturingGroupInAlternationBranch) and route to PIKEVM_CAPTURE, which gives correct, O(n)/ReDoS-safe spans. A non-nullable group like (a) in (a)|b never leaks and stays on the fast DFA. Fuzz budget 78->69. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…VM (Class E) Two capturing alternations in sequence where the first has shared-prefix branches (e.g. (a|ab)(c|bcd)) leave the first group's span ambiguous until the second resolves it — which the single-register TDFA mis-tracks ((a|ab)(c|bcd) on "abcd" gave g1=[0,2) vs JDK [0,1)). Detect the pair (hasInteractingCapturingAlternations) and route to PIKEVM_CAPTURE for correct, O(n)/ReDoS-safe spans. A lone capturing alternation or one followed by a fixed element (e.g. (a|ab)\d, (a|b)(c|d)) stays on the fast DFA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
…esign docs Adds the 2026-06-11..18 achievement report (capture-correctness + O(n) drain work), the PR decomposition plan for that work (PRs #81-#85), the Reggie 1.0.0 public-release readiness assessment, and the design/plan docs from the design->adversarial-review->plan loops. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The lookbehind+lookahead sandwich (?<=\[)[^\]]+(?=\]) compiles native and silently returns no-match (verified HEAD + origin/main; pre-existing). Downgrade "boolean correctness" to a gap with a P1 punch-list item to fix or decline it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ae2a58c to
a3da96a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## pr/5-on-drain-capture-routing #86 +/- ##
===============================================================
Coverage ? 82.2%
Complexity ? 1
===============================================================
Files ? 127
Lines ? 36872
Branches ? 4716
===============================================================
Hits ? 30327
Misses ? 5050
Partials ? 1495 Continue to review full report in Codecov by Harness.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a3da96ab9c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| @@ -0,0 +1,91 @@ | |||
| # PR decomposition plan — `feat/pikevm-capture-cost` → 5 stacked PRs | |||
|
|
|||
| Status: PLAN (awaiting approval before any branch/PR is created). Base: `origin/main` = `a924e431` | |||
There was a problem hiding this comment.
Keep temporary plans out of committed docs
This file is explicitly a Status: PLAN PR/branch logistics document, but AGENTS.md lines 269-286 require all temporary task-oriented documents, including implementation plans, performance notes, temporary research notes, and action checklists, to live under gitignored doc/temp/ and never be committed unless promoted to permanent reference material. Committing this plan, along with the other added dated plan/report files under doc/ and docs/superpowers/plans/, makes draft workflow artifacts part of the permanent documentation set; move them to doc/temp/ or extract only durable reference content.
Useful? React with 👍 / 👎.
What does this PR do?
Adds documentation for the 2026-06-11→18 capture-correctness + performance effort: the week research/achievement report, the 5-PR decomposition plan (#81–#85), the Reggie 1.0.0 public-release readiness assessment, and the design/plan docs from the design→adversarial-review→plan loops.
Motivation
Capture what was achieved, how the work is being landed (the stacked PR series), and an evidence-backed go/no-go for a public 1.0.0.
Related Issue(s)
Documents the work in #81, #82, #83, #84, #85.
Change Type
Checklist
./gradlew build)Performance Impact
None (documentation only).
Additional Notes
Docs-only, based on
main(independent of the #81–#85 code stack). The 1.0.0 readiness assessment concludes not clean for public 1.0 yet — see its punch list.🤖 Generated with Claude Code