Skip to content

Add A/A benchmark documentation#974

Open
YuanyuanTian-hh wants to merge 3 commits intomainfrom
tianyuanyuan/add-aa-benchmark-docs
Open

Add A/A benchmark documentation#974
YuanyuanTian-hh wants to merge 3 commits intomainfrom
tianyuanyuan/add-aa-benchmark-docs

Conversation

@YuanyuanTian-hh
Copy link
Copy Markdown
Contributor

Summary

Add diskann-benchmark/AA_BENCHMARK.md documenting how the daily A/A benchmark stability test is conducted and scheduled.

Content Covered

  • Purpose: Detecting environment noise on CI runners (not code regressions)
  • Schedule: Daily at 9 AM UTC via cron, plus manual \workflow_dispatch\
  • Datasets: \wikipedia-100K\ and \openai-100K\
  • Execution flow: Step-by-step description of the benchmark pipeline
  • Tolerance thresholds: Build time, QPS, recall, mean I/Os, mean comparisons, mean/P95 latency
  • Failure notification: Auto-created GitHub issue tagging @microsoft/diskann-disk-maintainers\
  • Comparison: How A/A differs from the A/B benchmark workflow

Motivation

This makes it easier for new contributors and maintainers to understand how the A/A stability test works without having to read the YAML workflow directly.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds standalone documentation for the daily DiskANN A/A benchmark stability workflow, so maintainers can understand what it does and how to interpret failures without reading the workflow YAML.

Changes:

  • Add diskann-benchmark/AA_BENCHMARK.md describing the A/A benchmark purpose, schedule, datasets, and pipeline steps.
  • Document the tolerance thresholds used for baseline vs target comparisons.
  • Document failure notification behavior and how A/A differs from the A/B benchmark workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.43%. Comparing base (cb52a9f) to head (64c1e48).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #974      +/-   ##
==========================================
- Coverage   89.43%   89.43%   -0.01%     
==========================================
  Files         449      449              
  Lines       83779    83779              
==========================================
- Hits        74928    74926       -2     
- Misses       8851     8853       +2     
Flag Coverage Δ
miri 89.43% <ø> (-0.01%) ⬇️
unittests 89.27% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread diskann-benchmark/AA_BENCHMARK.md Outdated
Comment thread diskann-benchmark/AA_BENCHMARK.md Outdated
Comment thread .github/docs/disk-benchmarks-aa.md Outdated
Comment thread .github/docs/disk-benchmarks-aa.md Outdated
@YuanyuanTian-hh YuanyuanTian-hh force-pushed the tianyuanyuan/add-aa-benchmark-docs branch from 419230d to 49554d3 Compare April 28, 2026 08:25
Address review feedback from arrayka:
- Place doc in .github/docs/disk-benchmarks-aa.md (lowercase, next to workflows)
- Remove redundant Steps section to keep the doc concise
- Add back-link from disk-benchmarks-aa.yml to the doc
- Keep tolerance thresholds, failure notification, and A/B comparison sections
@YuanyuanTian-hh YuanyuanTian-hh force-pushed the tianyuanyuan/add-aa-benchmark-docs branch from 49554d3 to d53e542 Compare April 28, 2026 08:31
Yuanyuan Tian (from Dev Box) and others added 2 commits April 28, 2026 16:52
Address review feedback: the A/A benchmark has a 95% reliability promise,
meaning 1 failure in 20 runs is expected. The notify-on-failure job now
checks the last 20 completed runs and only creates a GitHub issue if the
failure rate exceeds 5%. This avoids unnecessary noise from expected
environment variability.

Also update disk-benchmarks-aa.md to document this behavior.
@YuanyuanTian-hh YuanyuanTian-hh requested a review from arrayka April 28, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants