Skip to content

Add request-level B2 Zipf-aware experiments#1

Draft
chr331 wants to merge 5 commits into
mainfrom
codex/b2-zipf-phase1
Draft

Add request-level B2 Zipf-aware experiments#1
chr331 wants to merge 5 commits into
mainfrom
codex/b2-zipf-phase1

Conversation

@chr331

@chr331 chr331 commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Summary

This PR implements Phase 1.1 for the edge-cache fallback study and adds a follow-up latency-difference audit after noticing that B2 and B1 are nearly identical in the original all-request latency bars.

  • Organizes the project with docs/project_map.md, docs/experiment_guide.md, CHANGELOG.md, and results/phase1_b2_zipf/manifest.json.
  • Changes B2 from a scenario-level fixed decision to a request-level Zipf-aware expected-delay decision.
  • Adds rank-aware neighbor cache probability and records request-level interpretability fields such as content_rank, missing_chunks, neighbor_cache_probability, b2_expected_neighbor_delay, and b2_neighbor_selected.
  • Adds fallback-stage metrics and a decision_boundary_neighbor diagnostic scenario to show where B2 actually differs from B1 in latency.
  • Adds Zipf/cache sensitivity experiments and rank-bucket analysis for hot, mid, and cold content.
  • Rebuilds Phase 1.1 figures in a Nature-style matplotlib workflow. SVG/PDF are the versioned primary artifacts; PNG/TIFF are local QA outputs and are ignored.
  • Adds docs/phase1_b2_zipf_report.md, a detailed Chinese experiment report with embedded figures, result tables, claim-evidence mapping, and reviewer-style limitations.

B2 model

For a local recovery miss:

  • missing_chunks = K - local_chunks
  • p_cache(rank) = cold + (hot - cold) * rank ** (-zipf_alpha * cache_rank_gamma)
  • p_chunk = neighbor_es_availability * p_cache(rank)
  • P_success = Pr(Binomial(neighbor_group_size, p_chunk) >= missing_chunks)
  • E_neighbor = P_success * neighbor_recovery_delay + (1 - P_success) * (neighbor_probe_delay + origin_delay)

B2 searches the neighbor group only when E_neighbor <= origin_delay.

Latency-difference audit

The original three research-plan scenarios do not show a strong all-request latency gap between B1 and B2:

  • steady: B2 mean advantage over B1 = 0.2746 ms
  • low-reliability neighbor: 1.5561 ms
  • origin-delay increase: 0.0769 ms

That is expected because B2 often chooses the same neighbor-first action as B1 in steady/origin-delay-increase settings, and because local-success requests dilute fallback-policy differences.

This update adds fallback-stage metrics and a decision-boundary diagnostic scenario:

  • fallback_mean_response_time and fallback_p95_response_time only count requests with missing_chunks > 0.
  • neighbor_skip_rate shows how often a local miss skips neighbor probing.
  • b2_fallback_advantage_vs_b1_mean reports B1 fallback mean minus B2 fallback mean.
  • In decision_boundary_neighbor, B2 mean advantage over B1 is 6.5971 ms.
  • In the same diagnostic scenario, B2 fallback-stage mean advantage is 14.7278 ms.
  • B2 neighbor choice rate in the diagnostic scenario is 0.15595, meaning B2 skips most low-value neighbor probes instead of behaving like B1.

So the bounded claim is: B2 is not universally faster than B1 in every overall latency bar; its value is clearest after local recovery fails and the neighbor search value is conditional.

Experiment commands

Formal result generation:

python scripts/run_scenarios.py --trials 10 --num-requests 10000 --output-dir results/phase1_b2_zipf
python scripts/run_b2_zipf_sweep.py --trials 10 --num-requests 10000 --output-dir results/phase1_b2_zipf
python scripts/build_figures.py --results-dir results/phase1_b2_zipf
python scripts/write_manifest.py --output-dir results/phase1_b2_zipf --command "python scripts/run_scenarios.py --trials 10 --num-requests 10000 --output-dir results/phase1_b2_zipf" --command "python scripts/run_b2_zipf_sweep.py --trials 10 --num-requests 10000 --output-dir results/phase1_b2_zipf" --command "python scripts/build_figures.py --results-dir results/phase1_b2_zipf"

Smoke-test commands are documented in docs/experiment_guide.md:

python scripts/run_scenarios.py --trials 2 --num-requests 1000 --output-dir results/phase1_b2_zipf
python scripts/run_b2_zipf_sweep.py --trials 2 --num-requests 1000 --output-dir results/phase1_b2_zipf
python scripts/build_figures.py --results-dir results/phase1_b2_zipf

Checks

  • python -m unittest discover -s tests -> 18 tests OK
  • python -m compileall src scripts tests -> OK
  • Report figure-link check -> OK
  • git diff --check -> OK, with Windows line-ending warnings only
  • Visual QA checked for scenario mean response time and fallback-stage mean response time figures.

Main observations

  • Original all-request scenario bars should be described as modest B2 improvements, not strong latency gains.
  • The new fallback-stage figure makes the decision-boundary difference visible.
  • The neighbor/origin heatmap still shows a max B2 advantage over B1 of about 6.626 ms at neighbor availability 0.20 and origin delay 80 ms.
  • Zipf/cache sensitivity shows up to 3.904 ms B2 advantage, especially when cache probability is more concentrated by rank.
  • Rank-bucket analysis shows the intended behavior clearly: B2 neighbor choice rate is 0.6823 for hot content and 0.0 for mid/cold content under the chosen decision-boundary setting.

Scope

This remains a Phase 1 Monte Carlo simulation. It does not yet model real arrival processes, queues, service capacity, cache replacement, online trust learning, or true CDN/origin congestion. The internal scenario name origin_congestion is documented as an origin-delay increase scenario.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant