Add request-level B2 Zipf-aware experiments#1
Draft
chr331 wants to merge 5 commits into
Draft
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements Phase 1.1 for the edge-cache fallback study and adds a follow-up latency-difference audit after noticing that B2 and B1 are nearly identical in the original all-request latency bars.
docs/project_map.md,docs/experiment_guide.md,CHANGELOG.md, andresults/phase1_b2_zipf/manifest.json.content_rank,missing_chunks,neighbor_cache_probability,b2_expected_neighbor_delay, andb2_neighbor_selected.decision_boundary_neighbordiagnostic scenario to show where B2 actually differs from B1 in latency.docs/phase1_b2_zipf_report.md, a detailed Chinese experiment report with embedded figures, result tables, claim-evidence mapping, and reviewer-style limitations.B2 model
For a local recovery miss:
missing_chunks = K - local_chunksp_cache(rank) = cold + (hot - cold) * rank ** (-zipf_alpha * cache_rank_gamma)p_chunk = neighbor_es_availability * p_cache(rank)P_success = Pr(Binomial(neighbor_group_size, p_chunk) >= missing_chunks)E_neighbor = P_success * neighbor_recovery_delay + (1 - P_success) * (neighbor_probe_delay + origin_delay)B2 searches the neighbor group only when
E_neighbor <= origin_delay.Latency-difference audit
The original three research-plan scenarios do not show a strong all-request latency gap between B1 and B2:
0.2746 ms1.5561 ms0.0769 msThat is expected because B2 often chooses the same neighbor-first action as B1 in steady/origin-delay-increase settings, and because local-success requests dilute fallback-policy differences.
This update adds fallback-stage metrics and a decision-boundary diagnostic scenario:
fallback_mean_response_timeandfallback_p95_response_timeonly count requests withmissing_chunks > 0.neighbor_skip_rateshows how often a local miss skips neighbor probing.b2_fallback_advantage_vs_b1_meanreports B1 fallback mean minus B2 fallback mean.decision_boundary_neighbor, B2 mean advantage over B1 is6.5971 ms.14.7278 ms.0.15595, meaning B2 skips most low-value neighbor probes instead of behaving like B1.So the bounded claim is: B2 is not universally faster than B1 in every overall latency bar; its value is clearest after local recovery fails and the neighbor search value is conditional.
Experiment commands
Formal result generation:
Smoke-test commands are documented in
docs/experiment_guide.md:Checks
python -m unittest discover -s tests-> 18 tests OKpython -m compileall src scripts tests-> OKgit diff --check-> OK, with Windows line-ending warnings onlyMain observations
6.626 msat neighbor availability0.20and origin delay80 ms.3.904 msB2 advantage, especially when cache probability is more concentrated by rank.0.6823for hot content and0.0for mid/cold content under the chosen decision-boundary setting.Scope
This remains a Phase 1 Monte Carlo simulation. It does not yet model real arrival processes, queues, service capacity, cache replacement, online trust learning, or true CDN/origin congestion. The internal scenario name
origin_congestionis documented as an origin-delay increase scenario.