feat: improve search quality with new semantic+keyword search (#343)#346

Open

kiyotis wants to merge 185 commits into

mainfrom

343-improve-search-quality

Contributor

kiyotis commented May 19, 2026

Closes #343

Approach

2-stage search architecture replacing the legacy full-text-search approach:

Semantic search: 2-stage (Stage1: page selection from index.md, Stage2: section selection from knowledge JSON) using LLM judgment
Keyword search: deterministic term→section_id lookup via RBKC-generated terms.json
QA workflow: hearing → semantic search → read sections → answer generation → hallucination verify
Code analysis workflow: keyword search → read sections → report

Phase A built and benchmarked each component independently. Phase B deployed to skill and ran E2E benchmarks.

Tasks

Expert Review

AI-driven expert reviews conducted before PR creation (see .claude/rules/expert-review.md):

Software Engineer (Phase A) - 0 Findings
QA Engineer (Phase A) - 0 Findings
Prompt Engineer (Answer) - 0 Findings (after fixes)
Software Engineer (Answer) - 0 Findings

Success Criteria Check

Criterion	Status	Evidence
E2E benchmark runs without errors (all 30 scenarios)	❌ Not Met	13 errors in run-1 (B-4-1 in progress)
New search accuracy ≥ baseline-current (83.7%)	❌ Not Met	Pending error fix + 3 runs
Hallucination PASS ≥ baseline-current (14.4%)	❌ Not Met	Pending error fix + 3 runs

🤖 Generated with Claude Code

kiyohome and others added 30 commits

May 15, 2026 13:21


          feat: search quality improvement — Phase A infrastructure (#343) (#344)

8d95af5

Co-authored-by: kiyobot <kiyohito.itoh+bot@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>


          docs: update tasks.md — mark A-6 merge as done

296da3e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: add E2E benchmark runner (run_e2e.py) with 35 tests

a4d4403

Runs qa.md workflow end-to-end via claude -p, extracting structured
benchmark markers (BENCHMARK_HEARING/SEARCH/ANSWER) from the response.
Saves results in evaluate.py-compatible format (hearing.json, search.json,
answer.md, metrics.json). Enables current vs new search comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-1 run_e2e.py implementation complete

568c86c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: pass prompt via stdin to avoid claude CLI arg limit

78b3ed8

Claude CLI rejects long prompts passed as command-line args.
Use input= (stdin) instead of appending prompt to argv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add acceptance criteria and concrete steps to…

c214399

… all B tasks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — fix 4 QA findings, add intent and acceptance …

aef224c

…criteria

- answer.md check: 空でない → 参照:セクションあり（正常回答の確認）
- B-3手動確認: 主観判断 → run_e2e.py + 終了コード0で機械的判定
- B-4新検索確認: B-1と同一 → 新検索固有の差分確認を追加
- B-1全件確認: 4ファイル存在 → model_usage非空の整合性確認を追加

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — session end state

359b522

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: add verification step rules to tasks.md — intent and machine-ch…

398a317

…eckable criteria

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: translate tasks.md rules to English, consistent with other rule…

b2f1d0d

…s files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: use correct camelCase key modelUsage in run_e2e.py

7e30bc8

claude -p --output-format json returns modelUsage (camelCase), not
model_usage (snake_case). The old key caused model_usage to always be {}
in metrics.json, breaking cost analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: add evaluate_scenario step to run_e2e_all (design doc step 4)

c46bb22

Design doc E2E flow step 4 requires evaluate.py to be called after
run_e2e_scenario. run_e2e_all was missing this step, leaving evaluation
results absent from all E2E benchmark runs.

Added --knowledge-dir argument to enable evaluation. When provided,
evaluation.json is saved per scenario alongside hearing/search/answer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add no-confirmation rule

10ad4e5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: add --timeout CLI arg to run_e2e.py

3e7d8ab

Default 180s was too short for complex scenarios. Allows override per run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          chore: add E2E baseline results (28 scenarios, current search)

e8b2da2

accuracy avg=0.96, hallucination PASS rate=0.19 (5/26)
Total cost: $19.04, avg 7.1 turns, avg 6.4 sections/scenario

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-1 complete

441d1f3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add baseline validity check task

b04b9a1

accuracy=0.96 may be too high for a no-hearing baseline.
Must verify must/fact design and PRESENT verdicts are grounded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — baseline validity check: all 28 scenarios

fc7f83a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: add generate_index_md to RBKC (index.md for semantic search Sta…

0aefe71

…ge1)

Adds generate_index_md() alongside existing generate_index() (index.toon).
Both are generated on every create/update/delete run.
SE recommendation: Option A — co-locate in index.py, no blast radius on
existing index.toon / QO4 verify / current skill.

12 new tests GREEN, all 567 rbkc tests pass, v6 create+verify FAIL=0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — session end state

B-1 complete, B-2 index.md done, terms.json pending.
Next: baseline validity check (all 28 scenarios user review).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          revert: roll back baseline results, RBKC B-2, and tasks.md updates

32c7164

Reverts commits e8b2da2 (baseline results), 0aefe71 (generate_index_md),
and tasks.md update commits fc7f83a/b04b9a136/441d1f35e/8268638b7.

Reason: baseline-current was run with hearing_answer injected into the prompt,
giving the current skill an unfair context advantage. Scores do not reflect
the actual current skill capability. Re-run required with no hearing_answer.

Also adds three new rules and pre-run QA expert review steps to B-1 and B-4
to prevent the same mistake:
- 現行検索ベンチはヒアリング結果なし
- 新検索ベンチはヒアリング結果あり
- ベンチマーク実行前にQAエキスパートレビュー必須

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: add mode parameter to run_e2e for current vs new skill benchmar…

02d1949

…king

Add mode="current"|"new" to build_e2e_prompt, run_e2e_scenario, run_e2e_all,
and --mode CLI arg. In "current" mode, hearing_answer is not injected into
the prompt so the current skill is measured as-is (no oracle context).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-1 pre-run checks complete, ready for full run

eee6b40

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-1 run-1 in progress, add 3-run baseline plan

7e98d5f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: auto-generate timestamped output dir when --output-dir is omitted

52c9f78

run_e2e.py now creates tools/benchmark/results/YYYYMMDD-HHMMSS/ automatically
when --output-dir is not specified. Simplifies repeated benchmark runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — run-1/run-2 complete, run-3 in progress

a3b4264

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: add baseline-current E2E benchmark results (3 runs)

efbc320

28 QA scenarios × 3 runs for current-skill baseline.
All runs: total_scenarios=28, no empty model_usage.

Run dirs:
- 20260515-171300 (run-1)
- 20260515-181817 (run-2)
- 20260515-194124 (run-3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-1 all 3 runs complete, baseline report done

3436c8f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add missing evaluation.json step to B-1

8928a35

evaluation.json (accuracy/hallucination) was not generated because
--knowledge-dir was omitted from the run commands. Added post-hoc
evaluation step before baseline report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: align benchmark scripts with design spec (14 violations fixed)

ef82f0c

- run_e2e.py: --knowledge-dir required (was optional, caused missing
  evaluation.json); --model removed (design spec: sonnet fixed)
- evaluate.py: calculate_accuracy_score returns None when any UNCERTAIN
  present (design spec: UNCERTAIN scenarios excluded from aggregation);
  --model removed
- report.py: performance summary now outputs all 7 spec metrics
  (duration_api_ms, num_turns, input/output tokens, cache, cost, P50/P95);
  format_comparison_report implemented (was missing entirely)
- run.py: aggregate_all_metrics now matches metrics.json schema
  (adds duration_api_ms, num_turns, usage.cache_*, model_usage); --model removed
- simulate_hearing/semantic_search/answer/verify/answer_verify:
  --model removed from all 5 scripts (sonnet hardcoded in call_llm)
- Tests updated to match corrected spec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kiyotis and others added 30 commits

May 19, 2026 18:34


          docs: update tasks.md — session save 2026-05-19

a381462

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          feat: remove Critical Constraints, Error Handling, Knowledge Structur…

23cc98f

…e from SKILL.md

These sections belong in the individual workflow files (answer.md,
verify.md, etc.) where the LLM executes them. Keeping them in SKILL.md
gives the LLM a shortcut to respond without executing the workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add B-0-4-H workflow boilerplate removal

bb2b35b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: remove boilerplate from verify.md workflow

f0268df

Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators.
Prompt content is preserved as direct step instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: remove boilerplate from answer.md workflow

36ac296

Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators.
Prompt content is preserved as direct step instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: remove boilerplate from hearing.md workflow

258ff40

Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators
from Steps 2 and 3. Prompt content is preserved as direct step
instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: remove boilerplate from semantic-search.md workflow

1de7c04

Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators
from Steps 2 and 3. Prompt content is preserved as direct step
instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-0-4-H complete

8301cdf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          revert: restore workflow MDs to pre-B-0-4-H state

b313d3a

Revert verify.md, answer.md, hearing.md, semantic-search.md to
bb2b35b. The boilerplate removal approach was wrong — files need
a proper rewrite, not simple deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: rewrite qa workflow as single flat instruction file

76268a9

Merge hearing.md, answer.md, verify.md into qa.md as sequential
steps. Remove 'You are...', 'Call LLM with the following prompt',
'Parse the JSON response' meta-instructions. Also rewrite
semantic-search.md in the same style.

Delete qa/hearing.md, qa/answer.md, qa/verify.md, qa/ directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-0-4-H complete (qa rewrite)

55f8cad

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: remove undefined greeting and LLM-step hint from SKILL.md

a3b5236

'Show greeting' had no definition — removed to prevent unpredictable
behavior. Also removed the step hint '(hearing → semantic search →
answer → verify)' from the qa.md dispatch line to avoid the agent
inferring sub-steps from the comment rather than reading the workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          fix: call semantic-search.md from qa.md instead of inlining logic

cf026e4

Replace Steps 4-5 (inlined page/section selection) with a single
call to workflows/semantic-search.md. Removes duplicated logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: hardcode processing types in qa.md, document in rules

189b514

Replace index.md LLM parse with a hardcoded list. Processing types
are version-specific so qa.md cannot be mechanically copied across
versions. Document this constraint in nabledge-skill.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          refactor: inline processing types into Step 2, remove Step 1 indirection

119ff53

Step 1 was just a variable assignment for processing_types. Moved the
list directly into the AskUserQuestion options in Step 2 and renumbered
all steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — session 3 complete (B-0-4-H done)

2d56ed2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-0-4 review approved, start B-4-1

5c0acdf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1 E2E running (brq8fxkin)

9f2f066

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1 steps revised (measure+improve cycle)

0e970df

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add trace.json review step to B-4-1

014020e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — simplify B-4-1 report step

c93c4b1

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — reference HOW-TO-RUN.md step 3 for report

9e253d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1 cycle: report after each scenario

04bf794

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1: bulk run, report per scenario

3c12c49

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1: report per scenario as they complete

f850bed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: HOW-TO-RUN.md step 3 — report per scenario as they complete

73c3629

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: HOW-TO-RUN.md step 3 — add hallucination verification step (3c)

c7cc6ec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add evaluate.py investigation to B-4-1

22048cb

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — add B-4-1-A investigation tasks (21 items)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>


          docs: update tasks.md — B-4-1 complete, B-4-1-A added (session 5)

c357838

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels