This repository contains a first-stage simulation study for fallback-control policies in a low-trust edge-cache environment. The goal is to make the experiment understandable, reproducible, and ready for a first progress report, even for a reader who has not seen the research proposal.
Modern video and content-delivery systems often place edge servers close to users so that every request does not have to go back to the origin server. This reduces latency and lowers the load on the core network. However, edge servers may be less reliable than a well-managed origin server because they can run on cheaper hardware, be deployed in weaker environments, or fail temporarily.
This project studies the following question:
When the local edge-server group cannot reconstruct the requested content, should the system go directly to the origin server, or should it first ask neighboring edge servers for help?
The study assumes an erasure-coded cache. A file is split into chunks, and the client needs at least K chunks to reconstruct the file. In the current baseline, K = 3. If the local edge-server group cannot provide enough chunks, the system must choose a fallback action.
The core idea is that neighboring edge servers may already store useful chunks because content requests are skewed by popularity, approximately following a Zipf distribution. Searching neighbors can therefore help, but it can also hurt: if the neighbors are unreliable, the system may waste time probing them and then still need to contact the origin.
This repository implements a preliminary Monte Carlo simulation. It is intentionally simple and is meant to test the fallback-control logic before building a full discrete-event simulator.
The current phase does implement:
- request generation from a Zipf-like popularity distribution;
- local edge-server recovery checks;
- neighbor edge-server recovery checks;
- three fallback policies,
B0,B1, andB2; - repeated trials with fixed seeds for reproducibility;
- confidence intervals for reported metrics;
- one-dimensional sensitivity sweeps;
- a two-dimensional heatmap for the advantage of
B2overB1; - Nature-style research figures;
- an Excel report for easier reading;
- Chinese and Japanese result-interpretation documents.
The current phase does not yet implement:
- request arrival processes;
- service capacity;
- queueing delay;
- real server congestion;
- online trust learning;
- real CDN traces;
- content-level cache placement;
- cache-capacity dynamics;
- cache replacement policies.
This distinction is important: the current origin_delay increase experiment is an origin-delay increase scenario, not a true congestion model.
Each simulated request follows this logic.
- A user requests a content item.
- The content item is assumed to require at least
Kchunks for reconstruction. - The simulator checks whether the local edge-server group can provide enough chunks.
- If local recovery succeeds, the request finishes at the edge.
- If local recovery fails, the selected fallback policy decides what to do next.
The simulation separates two kinds of edge-server availability:
local_es_availability: the reliability of the local edge servers close to the user.neighbor_es_availability: the reliability of the neighboring cooperative edge-server group.
This separation matters because one of the formal scenarios keeps the local servers normal but makes only the neighbor group unreliable. That is the key setting where B2 should avoid useless neighbor searches.
B0 is the traditional baseline. If the local edge-server group cannot reconstruct the file, the system directly fetches the missing chunks from the origin server.
This is simple and avoids wasting time on unreliable neighbors, but it also means the system may use the origin even when nearby edge servers could have helped.
B1 always searches the neighboring cooperative edge-server group after local recovery fails. If the neighbor group can provide enough chunks, the request finishes without using the origin. If the neighbor group also fails, the system then falls back to the origin.
This policy can reduce origin access when neighbors are reliable. However, when neighbors are unreliable, it can create a double delay: first the system spends time searching neighbors, then it still has to fetch from the origin.
B2 is a dynamic decision policy. After local recovery fails, it estimates whether neighbor search is worth trying. It compares:
- the expected delay of trying neighbors and falling back to origin if neighbors fail;
- the delay of going directly to the origin.
In simplified form, the current model estimates neighbor recovery success as:
P_success = Pr(X >= K), where X follows a Binomial distribution
Then it compares:
E_neighbor =
P_success * neighbor_recovery_delay
+ (1 - P_success) * (neighbor_probe_delay + origin_delay)
Neighbor search is selected only when E_neighbor <= origin_delay.
This is a simplified probability model, not a learned trust model. It is enough for the first-stage experiment because the goal is to test whether an expected-delay decision can avoid obviously unprofitable neighbor fallback.
The default first-stage experiment uses the following values.
| Parameter | Value | Meaning |
|---|---|---|
num_contents |
500 |
Number of content items in the simulated library. |
num_requests |
10000 |
Number of requests per trial in the default experiment. |
zipf_alpha |
1.1 |
Popularity skew of the request distribution. |
local_es_availability |
0.82 |
Normal local edge-server availability. |
neighbor_es_availability |
0.82 by default |
Normal neighbor availability unless a scenario overrides it. |
es_availability |
0.82 |
Compatibility field for older scripts. |
origin_delay |
180.0 ms |
Default origin access delay. |
local_es_count |
3 |
Number of local edge servers checked by the model. |
neighbor_group_size |
5 |
Number of neighboring edge servers in the cooperative group. |
k |
3 |
Minimum number of chunks needed to reconstruct the file. |
local_probe_delay |
12.0 ms |
Delay cost for local probing. |
neighbor_probe_delay |
28.0 ms |
Delay cost for neighbor probing. |
local_recovery_delay |
18.0 ms |
Recovery delay when local reconstruction succeeds. |
neighbor_recovery_delay |
48.0 ms |
Recovery delay when neighbor reconstruction succeeds. |
seed |
20260525 |
Base random seed for reproducibility. |
The first-stage study now uses three formal scenarios. These scenarios are the main bridge between the code and the research plan.
| Scenario key | External explanation | Local availability | Neighbor availability | Origin delay | Purpose |
|---|---|---|---|---|---|
steady |
steady scenario | 0.82 |
0.82 |
180 ms |
Tests whether neighbor fallback adds unnecessary overhead under normal conditions. |
low_reliability_neighbor |
low-reliability neighbor ES scenario | 0.82 |
0.25 |
180 ms |
Tests whether B2 avoids useless neighbor search when the neighbor group is unreliable. |
origin_congestion |
origin-delay increase scenario | 0.82 |
0.82 |
320 ms |
Tests the value of edge cooperation when the origin path is more expensive. |
The internal key origin_congestion is kept for compatibility with earlier output files. In reports and memos, it should be described as origin-delay increase, because the current model only increases origin_delay; it does not simulate queueing, request arrivals, service capacity, or real congestion.
scripts/run_experiment.py runs a quick baseline experiment for B0, B1, and B2. It is useful as a smoke test and produces a simple policy-level summary.
scripts/run_sweep.py runs quick one-dimensional sweeps. It keeps the old fast workflow available so that the model can be checked quickly before running repeated trials.
scripts/run_scenarios.py runs the three formal scenarios: steady, low_reliability_neighbor, and origin_congestion as the internal key for origin-delay increase.
For each scenario and each policy, the script runs repeated trials and reports statistics such as mean, standard deviation, standard error, and 95% confidence intervals.
scripts/run_repeated.py runs a larger repeated-trial workflow:
- baseline repeated trials;
origin_delaysweep;es_availabilitysweep for compatibility;- two-dimensional
origin_delay x neighbor_es_availabilitygrid.
The default number of trials is 10. The seed for each trial is generated as:
trial_seed = base_seed + trial_index
This makes the experiment reproducible while still allowing repeated-trial statistics.
scripts/run_memo_sweep.py creates a heatmap grid aligned with the formal scenario parameters. It uses:
neighbor_es_availability = [0.20, 0.25, 0.30, 0.35, 0.45, 0.55, 0.65, 0.82]
origin_delay = [80, 120, 180, 240, 320]
local_es_availability = 0.82
This grid covers:
- the low-reliability neighbor scenario at
neighbor_es_availability = 0.25; - the steady scenario at
neighbor_es_availability = 0.82andorigin_delay = 180; - the origin-delay increase scenario at
neighbor_es_availability = 0.82andorigin_delay = 320.
The main metrics are:
| Metric | Meaning |
|---|---|
mean_response_time |
Average response time across all simulated requests. |
p95_response_time |
95th percentile response time, used to observe tail latency. |
origin_free_rate |
Fraction of requests completed without accessing the origin. |
neighbor_failure_rate |
Fraction of neighbor fallback attempts that fail and then need origin fallback. |
b2_advantage_vs_b1_mean |
Difference between B1 and B2 mean response time. Defined as B1 - B2. Positive values mean B2 is faster than B1. |
The most important result pattern is conditional:
- Under normal conditions,
B1andB2are often similar becauseB2also chooses neighbor search when neighbors are reliable. - Under low neighbor reliability,
B2tends to behave closer toB0by suppressing unprofitable neighbor searches. - When origin delay is higher and neighbors are reliable, both
B1andB2can benefit from neighbor fallback.
| File or directory | Description |
|---|---|
results/summary.csv |
Baseline policy summary. |
results/sweep_summary.csv |
Quick single-seed sweep results. |
results/scenario_summary.csv |
Repeated-trial summary for the three formal scenarios. |
results/scenario_trials.csv |
Per-trial summaries for the formal scenarios. |
results/repeated_summary.csv |
Repeated-trial means, standard deviations, standard errors, and 95% confidence intervals. |
results/repeated_trials.csv |
Per-trial policy summaries for repeated experiments. |
results/grid_summary.csv |
Two-dimensional repeated grid over origin delay and neighbor availability. |
results/memo_heatmap_summary.csv |
Scenario-aligned heatmap grid covering the formal scenario settings. |
results/figures/ |
Publication-style figures exported as SVG, PDF, PNG, and TIFF. |
results/edge_cache_fallback_report.xlsx |
Formatted Excel report for reading results without manually opening CSV files. |
phase1_results.ch.md |
Chinese first-stage result interpretation. |
phase1_results.ja.md |
Japanese first-stage result interpretation. |
research_log.md |
Research-log language index. |
research_log.ch.md |
Chinese research log. |
research_log.ja.md |
Japanese research log. |
Create and activate the Python environment:
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txtRun quick checks:
python scripts\run_experiment.py
python scripts\run_sweep.pyRun the formal first-stage scenarios:
python scripts\run_scenarios.pyRun the repeated sensitivity and grid experiments:
python scripts\run_repeated.pyRun the scenario-aligned heatmap sweep:
python scripts\run_memo_sweep.pyGenerate figures and Excel report:
python scripts\build_figures.py
python scripts\build_report.pyFor faster development checks:
python scripts\run_scenarios.py --trials 3 --num-requests 1000
python scripts\run_repeated.py --trials 3 --num-requests 1000Run unit tests:
python -m unittest discover -s testsThe test suite checks the main simulation behavior, repeated-trial statistics, confidence-interval consistency, policy ordering, local/neighbor availability separation, and the heatmap sweep coverage for the formal scenario parameters.
The first-stage result should not be read as "B2 always wins." The better interpretation is:
B2is useful when the value of neighbor search is conditional. It behaves likeB1when neighbors are likely to help, and it behaves closer toB0when neighbor search is expected to be wasteful.
This is why the heatmap is important. It shows the region where B2 has a positive advantage over B1, based on:
B2 advantage = B1 mean response time - B2 mean response time
Positive values mean B2 is faster than B1.
The next modeling decision is whether to prioritize:
- content-level cache placement and cache capacity modeling; or
- a full discrete-event simulator with request arrivals, service capacity, and queueing delay.
The current repository is designed to make that decision easier by providing a reproducible first-stage baseline and clear evidence about when neighbor fallback is helpful or harmful.