Skip to content

feat: improve search quality with new semantic+keyword search (#343)#346

Open
kiyotis wants to merge 185 commits into
mainfrom
343-improve-search-quality
Open

feat: improve search quality with new semantic+keyword search (#343)#346
kiyotis wants to merge 185 commits into
mainfrom
343-improve-search-quality

Conversation

@kiyotis
Copy link
Copy Markdown
Contributor

@kiyotis kiyotis commented May 19, 2026

Closes #343

Approach

2-stage search architecture replacing the legacy full-text-search approach:

  • Semantic search: 2-stage (Stage1: page selection from index.md, Stage2: section selection from knowledge JSON) using LLM judgment
  • Keyword search: deterministic term→section_id lookup via RBKC-generated terms.json
  • QA workflow: hearing → semantic search → read sections → answer generation → hallucination verify
  • Code analysis workflow: keyword search → read sections → report

Phase A built and benchmarked each component independently. Phase B deployed to skill and ran E2E benchmarks.

Tasks

See tasks.md.

Expert Review

AI-driven expert reviews conducted before PR creation (see .claude/rules/expert-review.md):

Success Criteria Check

Criterion Status Evidence
E2E benchmark runs without errors (all 30 scenarios) ❌ Not Met 13 errors in run-1 (B-4-1 in progress)
New search accuracy ≥ baseline-current (83.7%) ❌ Not Met Pending error fix + 3 runs
Hallucination PASS ≥ baseline-current (14.4%) ❌ Not Met Pending error fix + 3 runs

🤖 Generated with Claude Code

kiyohome and others added 30 commits May 15, 2026 13:21
Co-authored-by: kiyobot <kiyohito.itoh+bot@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs qa.md workflow end-to-end via claude -p, extracting structured
benchmark markers (BENCHMARK_HEARING/SEARCH/ANSWER) from the response.
Saves results in evaluate.py-compatible format (hearing.json, search.json,
answer.md, metrics.json). Enables current vs new search comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude CLI rejects long prompts passed as command-line args.
Use input= (stdin) instead of appending prompt to argv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… all B tasks

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…criteria

- answer.md check: 空でない → 参照:セクションあり(正常回答の確認)
- B-3手動確認: 主観判断 → run_e2e.py + 終了コード0で機械的判定
- B-4新検索確認: B-1と同一 → 新検索固有の差分確認を追加
- B-1全件確認: 4ファイル存在 → model_usage非空の整合性確認を追加

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eckable criteria

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
claude -p --output-format json returns modelUsage (camelCase), not
model_usage (snake_case). The old key caused model_usage to always be {}
in metrics.json, breaking cost analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design doc E2E flow step 4 requires evaluate.py to be called after
run_e2e_scenario. run_e2e_all was missing this step, leaving evaluation
results absent from all E2E benchmark runs.

Added --knowledge-dir argument to enable evaluation. When provided,
evaluation.json is saved per scenario alongside hearing/search/answer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default 180s was too short for complex scenarios. Allows override per run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
accuracy avg=0.96, hallucination PASS rate=0.19 (5/26)
Total cost: $19.04, avg 7.1 turns, avg 6.4 sections/scenario

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
accuracy=0.96 may be too high for a no-hearing baseline.
Must verify must/fact design and PRESENT verdicts are grounded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge1)

Adds generate_index_md() alongside existing generate_index() (index.toon).
Both are generated on every create/update/delete run.
SE recommendation: Option A — co-locate in index.py, no blast radius on
existing index.toon / QO4 verify / current skill.

12 new tests GREEN, all 567 rbkc tests pass, v6 create+verify FAIL=0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
B-1 complete, B-2 index.md done, terms.json pending.
Next: baseline validity check (all 28 scenarios user review).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reverts commits e8b2da2 (baseline results), 0aefe71 (generate_index_md),
and tasks.md update commits fc7f83a/b04b9a136/441d1f35e/8268638b7.

Reason: baseline-current was run with hearing_answer injected into the prompt,
giving the current skill an unfair context advantage. Scores do not reflect
the actual current skill capability. Re-run required with no hearing_answer.

Also adds three new rules and pre-run QA expert review steps to B-1 and B-4
to prevent the same mistake:
- 現行検索ベンチはヒアリング結果なし
- 新検索ベンチはヒアリング結果あり
- ベンチマーク実行前にQAエキスパートレビュー必須

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…king

Add mode="current"|"new" to build_e2e_prompt, run_e2e_scenario, run_e2e_all,
and --mode CLI arg. In "current" mode, hearing_answer is not injected into
the prompt so the current skill is measured as-is (no oracle context).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_e2e.py now creates tools/benchmark/results/YYYYMMDD-HHMMSS/ automatically
when --output-dir is not specified. Simplifies repeated benchmark runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
28 QA scenarios × 3 runs for current-skill baseline.
All runs: total_scenarios=28, no empty model_usage.

Run dirs:
- 20260515-171300 (run-1)
- 20260515-181817 (run-2)
- 20260515-194124 (run-3)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
evaluation.json (accuracy/hallucination) was not generated because
--knowledge-dir was omitted from the run commands. Added post-hoc
evaluation step before baseline report.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- run_e2e.py: --knowledge-dir required (was optional, caused missing
  evaluation.json); --model removed (design spec: sonnet fixed)
- evaluate.py: calculate_accuracy_score returns None when any UNCERTAIN
  present (design spec: UNCERTAIN scenarios excluded from aggregation);
  --model removed
- report.py: performance summary now outputs all 7 spec metrics
  (duration_api_ms, num_turns, input/output tokens, cache, cost, P50/P95);
  format_comparison_report implemented (was missing entirely)
- run.py: aggregate_all_metrics now matches metrics.json schema
  (adds duration_api_ms, num_turns, usage.cache_*, model_usage); --model removed
- simulate_hearing/semantic_search/answer/verify/answer_verify:
  --model removed from all 5 scripts (sonnet hardcoded in call_llm)
- Tests updated to match corrected spec

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
kiyotis and others added 30 commits May 19, 2026 18:34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e from SKILL.md

These sections belong in the individual workflow files (answer.md,
verify.md, etc.) where the LLM executes them. Keeping them in SKILL.md
gives the LLM a shortcut to respond without executing the workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators.
Prompt content is preserved as direct step instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators.
Prompt content is preserved as direct step instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators
from Steps 2 and 3. Prompt content is preserved as direct step
instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the
following prompt, substituting the variables:', and '---' separators
from Steps 2 and 3. Prompt content is preserved as direct step
instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert verify.md, answer.md, hearing.md, semantic-search.md to
bb2b35b. The boilerplate removal approach was wrong — files need
a proper rewrite, not simple deletion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge hearing.md, answer.md, verify.md into qa.md as sequential
steps. Remove 'You are...', 'Call LLM with the following prompt',
'Parse the JSON response' meta-instructions. Also rewrite
semantic-search.md in the same style.

Delete qa/hearing.md, qa/answer.md, qa/verify.md, qa/ directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'Show greeting' had no definition — removed to prevent unpredictable
behavior. Also removed the step hint '(hearing → semantic search →
answer → verify)' from the qa.md dispatch line to avoid the agent
inferring sub-steps from the comment rather than reading the workflow.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Steps 4-5 (inlined page/section selection) with a single
call to workflows/semantic-search.md. Removes duplicated logic.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace index.md LLM parse with a hardcoded list. Processing types
are version-specific so qa.md cannot be mechanically copied across
versions. Document this constraint in nabledge-skill.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 1 was just a variable assignment for processing_types. Moved the
list directly into the AskUserQuestion options in Step 2 and renumbered
all steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

As a nabledge user, I want search quality held to the same standard as RBKC so that answers are accurate and hallucination-free

2 participants