feat: improve search quality with new semantic+keyword search (#343)#346
Open
kiyotis wants to merge 185 commits into
Open
feat: improve search quality with new semantic+keyword search (#343)#346kiyotis wants to merge 185 commits into
kiyotis wants to merge 185 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs qa.md workflow end-to-end via claude -p, extracting structured benchmark markers (BENCHMARK_HEARING/SEARCH/ANSWER) from the response. Saves results in evaluate.py-compatible format (hearing.json, search.json, answer.md, metrics.json). Enables current vs new search comparison. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude CLI rejects long prompts passed as command-line args. Use input= (stdin) instead of appending prompt to argv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… all B tasks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…criteria - answer.md check: 空でない → 参照:セクションあり(正常回答の確認) - B-3手動確認: 主観判断 → run_e2e.py + 終了コード0で機械的判定 - B-4新検索確認: B-1と同一 → 新検索固有の差分確認を追加 - B-1全件確認: 4ファイル存在 → model_usage非空の整合性確認を追加 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eckable criteria Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
claude -p --output-format json returns modelUsage (camelCase), not
model_usage (snake_case). The old key caused model_usage to always be {}
in metrics.json, breaking cost analysis.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Design doc E2E flow step 4 requires evaluate.py to be called after run_e2e_scenario. run_e2e_all was missing this step, leaving evaluation results absent from all E2E benchmark runs. Added --knowledge-dir argument to enable evaluation. When provided, evaluation.json is saved per scenario alongside hearing/search/answer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Default 180s was too short for complex scenarios. Allows override per run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
accuracy avg=0.96, hallucination PASS rate=0.19 (5/26) Total cost: $19.04, avg 7.1 turns, avg 6.4 sections/scenario Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
accuracy=0.96 may be too high for a no-hearing baseline. Must verify must/fact design and PRESENT verdicts are grounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge1) Adds generate_index_md() alongside existing generate_index() (index.toon). Both are generated on every create/update/delete run. SE recommendation: Option A — co-locate in index.py, no blast radius on existing index.toon / QO4 verify / current skill. 12 new tests GREEN, all 567 rbkc tests pass, v6 create+verify FAIL=0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
B-1 complete, B-2 index.md done, terms.json pending. Next: baseline validity check (all 28 scenarios user review). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reverts commits e8b2da2 (baseline results), 0aefe71 (generate_index_md), and tasks.md update commits fc7f83a/b04b9a136/441d1f35e/8268638b7. Reason: baseline-current was run with hearing_answer injected into the prompt, giving the current skill an unfair context advantage. Scores do not reflect the actual current skill capability. Re-run required with no hearing_answer. Also adds three new rules and pre-run QA expert review steps to B-1 and B-4 to prevent the same mistake: - 現行検索ベンチはヒアリング結果なし - 新検索ベンチはヒアリング結果あり - ベンチマーク実行前にQAエキスパートレビュー必須 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…king Add mode="current"|"new" to build_e2e_prompt, run_e2e_scenario, run_e2e_all, and --mode CLI arg. In "current" mode, hearing_answer is not injected into the prompt so the current skill is measured as-is (no oracle context). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run_e2e.py now creates tools/benchmark/results/YYYYMMDD-HHMMSS/ automatically when --output-dir is not specified. Simplifies repeated benchmark runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
28 QA scenarios × 3 runs for current-skill baseline. All runs: total_scenarios=28, no empty model_usage. Run dirs: - 20260515-171300 (run-1) - 20260515-181817 (run-2) - 20260515-194124 (run-3) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
evaluation.json (accuracy/hallucination) was not generated because --knowledge-dir was omitted from the run commands. Added post-hoc evaluation step before baseline report. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- run_e2e.py: --knowledge-dir required (was optional, caused missing evaluation.json); --model removed (design spec: sonnet fixed) - evaluate.py: calculate_accuracy_score returns None when any UNCERTAIN present (design spec: UNCERTAIN scenarios excluded from aggregation); --model removed - report.py: performance summary now outputs all 7 spec metrics (duration_api_ms, num_turns, input/output tokens, cache, cost, P50/P95); format_comparison_report implemented (was missing entirely) - run.py: aggregate_all_metrics now matches metrics.json schema (adds duration_api_ms, num_turns, usage.cache_*, model_usage); --model removed - simulate_hearing/semantic_search/answer/verify/answer_verify: --model removed from all 5 scripts (sonnet hardcoded in call_llm) - Tests updated to match corrected spec Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e from SKILL.md These sections belong in the individual workflow files (answer.md, verify.md, etc.) where the LLM executes them. Keeping them in SKILL.md gives the LLM a shortcut to respond without executing the workflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the following prompt, substituting the variables:', and '---' separators. Prompt content is preserved as direct step instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the following prompt, substituting the variables:', and '---' separators. Prompt content is preserved as direct step instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the following prompt, substituting the variables:', and '---' separators from Steps 2 and 3. Prompt content is preserved as direct step instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove '**Tool**: In-memory (LLM generation)', 'Call LLM with the following prompt, substituting the variables:', and '---' separators from Steps 2 and 3. Prompt content is preserved as direct step instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Revert verify.md, answer.md, hearing.md, semantic-search.md to bb2b35b. The boilerplate removal approach was wrong — files need a proper rewrite, not simple deletion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Merge hearing.md, answer.md, verify.md into qa.md as sequential steps. Remove 'You are...', 'Call LLM with the following prompt', 'Parse the JSON response' meta-instructions. Also rewrite semantic-search.md in the same style. Delete qa/hearing.md, qa/answer.md, qa/verify.md, qa/ directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
'Show greeting' had no definition — removed to prevent unpredictable behavior. Also removed the step hint '(hearing → semantic search → answer → verify)' from the qa.md dispatch line to avoid the agent inferring sub-steps from the comment rather than reading the workflow. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Steps 4-5 (inlined page/section selection) with a single call to workflows/semantic-search.md. Removes duplicated logic. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace index.md LLM parse with a hardcoded list. Processing types are version-specific so qa.md cannot be mechanically copied across versions. Document this constraint in nabledge-skill.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step 1 was just a variable assignment for processing_types. Moved the list directly into the AskUserQuestion options in Step 2 and renumbered all steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #343
Approach
2-stage search architecture replacing the legacy full-text-search approach:
Phase A built and benchmarked each component independently. Phase B deployed to skill and ran E2E benchmarks.
Tasks
See tasks.md.
Expert Review
AI-driven expert reviews conducted before PR creation (see
.claude/rules/expert-review.md):Success Criteria Check
🤖 Generated with Claude Code