MapShift is an executable benchmark and evaluation protocol for testing whether environment knowledge acquired during reward-free exploration remains useful after a structured environmental intervention. In the controlled post-intervention evaluation (CPE) protocol, an agent explores a base environment without task reward, the environment is changed along one controlled intervention family, and the agent is evaluated on post-intervention planning, inference, and adaptation tasks.
This repository is the executable code artifact for the current MapShift release. It tracks the environment generator, intervention operators, task samplers, benchmark health checks, baselines, study runner, output schemas, reproduction commands, and paper table/figure renderers. Full evaluation records, health reports, rendered tables, figures, and manifests are generated locally by the commands below rather than committed to the repository.
This repository is a self-contained executable artifact. It regenerates benchmark instances, health checks, evaluation records, summary tables, rendered figures, and provenance manifests from versioned source code and configuration files. Generated instances and outputs are produced by the code, not tracked as a separately hosted static dataset in the repository. Output filenames and some config names retain the internal protocol id cep for compatibility; reviewer-facing prose uses the paper terminology CPE.
| Artifact component | Location or command |
|---|---|
| Smoke command | python3 scripts/build_benchmark.py --tier mapshift_2d --study-config configs/analysis/mapshift_2d_full_study_smoke_v0_1.json --output-dir outputs/releases/mapshift_2d_v0_1_smoke --print-summary |
| One-command artifact audit | python3 scripts/audit_artifact.py --quick --output-dir outputs/audit/mapshift_quick |
| Full reproduction command | python3 scripts/build_benchmark.py --tier mapshift_2d --study-config configs/analysis/mapshift_2d_full_study_v0_1.json --output-dir outputs/releases/mapshift_2d_v0_1_full --print-summary |
| Deterministic mechanism diagnostic | python3 scripts/run_mapshift_2d_study.py configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json --print-summary |
| MiniGrid port smoke | python3 scripts/smoke_minigrid_port.py --family topology --severity 2 --motif two_room_door |
| MiniGrid CPE sanity-transfer study | python3 scripts/run_minigrid_cpe_study.py --output-dir outputs/studies/minigrid_cpe_sanity_transfer --print-summary |
| High-capacity pretrained graph add-on | python3 scripts/generate_calibration_report.py configs/benchmark/release_v0_1.json --tier mapshift_2d --run-config configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json --model-seed 0 --samples-per-motif 1 --task-samples-per-class 3 --output outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json --log-file outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/logs/run.log --print-summary |
| Large persistent-memory add-on | python3 scripts/generate_calibration_report.py configs/benchmark/release_v0_1.json --tier mapshift_2d --run-config configs/calibration/persistent_memory_world_model_large_v0_1.json --model-seed 0 --samples-per-motif 1 --task-samples-per-class 3 --output outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json --log-file outputs/studies/persistent_memory_world_model_large_v0_1/logs/run.log --print-summary |
| Expected runtime | Smoke: minutes on CPU. Deterministic diagnostic: short CPU/GPU run. Full expanded study: single-GPU run recommended; budget roughly 30-36 wall-clock hours on one NVIDIA L4, with faster completion expected on L40S/H100-class GPUs. |
| CPU/GPU requirements | CPU works for validation and smoke. Full study works on CPU but is intended for a CUDA-capable PyTorch install when available. |
| Disk usage | Reserve 5GB for generated outputs and checkpoints; reserve more if installing CUDA PyTorch wheels into a fresh environment. |
| Output paths | outputs/releases/<run_name>/health, study, paper_outputs, manifests, and logs. |
| Tables/figures regenerate | python3 scripts/render_paper_outputs.py <study_bundle.json> --output-dir <paper_outputs> --print-summary |
| Version/config hash | Package version 0.1.0; schema version 0.1.0; release manifest records config_hash. |
| Dependency installation | python3 -m pip install -e . from a fresh Python 3.10+ environment. |
| Reviewed dependency pins | requirements-lock.txt |
| Baseline hyperparameters | See configs/calibration/*.json and the table below. |
| Evaluation record schema | See the "Evaluation Record Schema" section below. |
Use Python 3.10 or newer. The package dependencies are bounded in pyproject.toml:
numpy>=1.26,<3.0torch>=2.2,<3.0
CPU install:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .Optional MiniGrid port install:
python -m pip install -e ".[minigrid]"
python3 scripts/smoke_minigrid_port.py --instantiate-minigridThe MiniGrid port is not required for reproducing the paper's primary MapShift-2D tables. It is included as an executable adapter showing how the CPE contract maps onto a richer external simulator API: deterministic matched base/intervened MiniGrid-compatible environments, declared metric/topology/dynamics/semantic interventions, invariant validators, a compact sanity-transfer study, and optional instantiation of an actual minigrid environment when the extra dependency is installed.
Run the compact MiniGrid CPE sanity-transfer study:
python3 scripts/run_minigrid_cpe_study.py \
--output-dir outputs/studies/minigrid_cpe_sanity_transfer \
--print-summaryThis CPU-only study uses six MiniGrid-style motifs with four deterministic seeds each by default. It reports validator failures, reference solvability, task rejections, weak-baseline saturation, family-wise deterministic probe scores, and a protocol-sensitivity table comparing same-environment evaluation with CPE.
Reviewed-environment pins are provided for reviewers who want a tighter dependency target:
python -m pip install -r requirements-lock.txtCUDA install:
Install a CUDA-enabled PyTorch build appropriate for the host first if needed, then install MapShift in editable mode:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .Verify the selected Torch device:
python3 - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda_available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
print("device_0:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else None)
PYTo force the learned baselines onto a particular device and isolate checkpoints:
export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_full_v0_1If MAPSHIFT_TORCH_DEVICE is unset, learned baselines use their config value. The release configs use torch_device: "auto", which selects CUDA only when PyTorch reports that CUDA is available.
For the artifact runbook, see ARTIFACT_EVALUATION.md.
Validate the release config bundle:
python3 scripts/validate_benchmark.py \
--tier mapshift_2d \
configs/benchmark/release_v0_1.jsonGenerate a benchmark health preflight:
python3 scripts/generate_benchmark_health.py \
configs/benchmark/release_v0_1.json \
--samples-per-motif 1 \
--task-samples-per-class 3 \
--output outputs/releases/mapshift_2d_v0_1_health_preflight.jsonRun tests:
python3 -m unittest discover -s tests -p 'test_*.py'Run the one-command artifact audit:
python3 scripts/audit_artifact.py \
--quick \
--output-dir outputs/audit/mapshift_quickThe audit validates the release config, runs tests, builds the reviewer smoke artifact, verifies required output paths, checks zero fatal leakage and zero validator failures, and confirms paper-facing rendered outputs exist.
The smoke command builds a small release artifact: config copies, split manifests, benchmark health, a learned-baseline smoke study, JSON tables, SVG figures, and manifests.
python3 scripts/build_benchmark.py \
--tier mapshift_2d \
--study-config configs/analysis/mapshift_2d_full_study_smoke_v0_1.json \
--output-dir outputs/releases/mapshift_2d_v0_1_smoke \
--print-summaryExpected runtime: a few minutes on a typical laptop or CPU VM. It is intended for artifact sanity checking, not for reproducing paper-scale confidence intervals.
Expected outputs:
outputs/releases/mapshift_2d_v0_1_smoke/
logs/build_benchmark.log
health/benchmark_health.json
study/study_bundle.json
study/raw/cep_report.json
study/raw/protocol_comparison_report.json
study/raw/benchmark_health_report.json
study/tables/*.json
study/figures/*.json
paper_outputs/tables/*.md
paper_outputs/figures/*.svg
manifests/release_manifest.json
The full command regenerates the paper-facing results from the release configs:
python3 scripts/build_benchmark.py \
--tier mapshift_2d \
--study-config configs/analysis/mapshift_2d_full_study_v0_1.json \
--output-dir outputs/releases/mapshift_2d_v0_1_full \
--print-summaryRecommended hardware:
- CPU: enough for validation, health checks, and smoke.
- GPU: a single CUDA GPU is recommended for the full study. The reference run was launched on one NVIDIA L4 with 23GB memory.
- RAM: 16GB is recommended for the full build because evaluation records are accumulated before the study bundle is written.
- Disk: reserve at least 5GB for outputs, logs, and checkpoints. A fresh CUDA PyTorch install may require several additional GB in the virtual environment or pip cache.
Expected runtime:
- Smoke: minutes.
- Full single-GPU study: overnight-scale to multi-day depending on GPU. The expanded 24-motif release should be budgeted at roughly 30-36 wall-clock hours on one NVIDIA L4, with faster completion expected on L40S/H100-class GPUs. Runtime is dominated by task generation, evaluation loops, bootstrap aggregation, and rendering. The learned training jobs are small.
- Full CPU-only study: possible but not recommended for deadline-sensitive reproduction.
Monitor progress:
tail -f outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.logCount evaluation chunks:
grep -c "evaluating family=" outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.logThe full config currently runs one primary CPE sweep (cep protocol id) plus four protocol-comparison sweeps. Each sweep has 24 motifs x 4 intervention families = 96 logged family chunks, so a completed run should log roughly 480 evaluating family= lines.
Check completion:
grep -E "Study complete|Study bundle written|Rendered|Release manifest written" \
outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.logCore result files are written after the long study stage completes:
outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json
outputs/releases/mapshift_2d_v0_1_full/study/raw/protocol_comparison_report.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/familywise_main_results.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/severity_response.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/tables/main_familywise_results.md
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/familywise_main_results.svg
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/severity_response_curves.svg
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/protocol_rank_reversal_comparison.svg
The complete mapshift_2d_full_study_v0_1 configuration runs the full baseline roster under five protocol variants. When wall-clock time is constrained, use two grid-aligned runs instead:
- A CPE-only full-roster run for the main family-wise results table.
- A deterministic full-protocol diagnostic for stale-map, weak-heuristic, and belief-update protocol sensitivity.
This preserves the 24 structural motifs, 10/6/8 split, all four intervention families, all severity levels, and the full task mix, while avoiding five repeated full-roster sweeps.
CPE-only full-roster run:
export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_cep_only_v0_1
python3 scripts/build_benchmark.py \
--tier mapshift_2d \
--study-config configs/analysis/mapshift_2d_main_cep_only_v0_1.json \
--output-dir outputs/releases/mapshift_2d_v0_1_cep_only \
--print-summaryDeterministic full-protocol diagnostic:
python3 scripts/run_mapshift_2d_study.py \
configs/analysis/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1.json \
--output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1 \
--log-file outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/logs/run.log \
--print-summaryThen generate held-out motif and paired-bootstrap delta tables:
python3 scripts/analyze_mechanism_diagnostic.py \
outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/study_bundle.json \
--output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/mechanism_diagnostic_analysis \
--split test \
--family topology \
--family semantic \
--resamples 1000 \
--print-summaryThe CPE-only run has about 96 logged family chunks instead of the full run's 480. The deterministic diagnostic has the same 480 family chunks but only four deterministic/reference methods and no learned baseline training, so it is much cheaper than the full-roster protocol sweep.
The paper's empirical claims come from two executable study paths. The submitted paper.pdf contains the publication figures; the artifact also regenerates code-produced protocol and intervention-example SVGs for reviewer inspection.
| Paper item | Reproduction source |
|---|---|
| Figure 1, CPE protocol diagram | Included in paper.pdf; a data-free SVG version can be regenerated with scripts/render_paper_outputs.py --protocol-diagram-only. |
| Figure 2, matched intervention pairs | Included in paper.pdf; generated intervention examples are rendered by scripts/render_paper_outputs.py as intervention_examples_2d.svg. |
| Benchmark health summary | Full reproduction run, outputs/releases/mapshift_2d_v0_1_full/study/tables/benchmark_health_summary.json. |
| Full-run family-wise calibration table | Full reproduction run, outputs/releases/mapshift_2d_v0_1_full/study/tables/familywise_main_results.json and paper_outputs/tables/main_familywise_results.md. |
| Full-run protocol sensitivity | Full reproduction run, outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json. |
| Severity-response and family stress profiles | Full reproduction run, outputs/releases/mapshift_2d_v0_1_full/study/tables/severity_response.json and paper_outputs/figures/severity_response_curves.svg. |
| Deterministic mechanism diagnostic | Belief-update diagnostic run, outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json. |
| Conclusion audit | Combination of the full reproduction run and the belief-update diagnostic run. |
| Baseline metadata and hyperparameters | configs/calibration/*.json, outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json, and paper_outputs/tables/baseline_metadata.md. |
Run the full reproduction path first:
python3 scripts/build_benchmark.py \
--tier mapshift_2d \
--study-config configs/analysis/mapshift_2d_full_study_v0_1.json \
--output-dir outputs/releases/mapshift_2d_v0_1_full \
--print-summaryThen run the deterministic mechanism diagnostic used for the stale-map versus belief-update claims:
python3 scripts/run_mapshift_2d_study.py \
configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json \
--print-summaryExpected diagnostic outputs:
outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/
logs/run_mapshift_2d_study.log
raw/cep_report.json
raw/protocol_comparison_report.json
raw/benchmark_health_report.json
tables/familywise_main_results.json
tables/protocol_sensitivity_and_rank_correlation.json
tables/benchmark_health_summary.json
figures/*.json
manifests/study_manifest.json
study_bundle.json
Inspect the full-run health gates cited in the paper:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/tables/benchmark_health_summary.json")
print(json.dumps(json.loads(p.read_text()), indent=2))
PYInspect the full-run protocol sensitivity table:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json")
print(json.dumps(json.loads(p.read_text()), indent=2))
PYInspect the deterministic mechanism diagnostic scores and rank changes:
python3 - <<'PY'
import json
from pathlib import Path
b = json.loads(Path("outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json").read_text())
print("=== Proposition support ===")
print(json.dumps(b["proposition_support"], indent=2))
print("\n=== Family-wise CPE scores ===")
rows = b["raw_reports"]["cep_report"]["familywise_summary"]["rows"]
for row in rows:
print(
row["baseline_name"],
row["family"],
round(row["family_primary_score"], 3),
"episodes=", row["episode_count"],
)
print("\n=== Protocol comparisons ===")
for name, comp in b["protocol_sensitivity"]["pairwise_comparisons"].items():
print(name, "tau=", comp.get("kendall_tau"), "best_changes=", comp.get("best_method_changes"))
PYAfter the 24-motif deterministic diagnostic finishes, generate the held-out motif consistency table and paired-bootstrap delta CIs used for the central stale-map versus belief-update claims:
python3 scripts/analyze_mechanism_diagnostic.py \
outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json \
--output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/mechanism_diagnostic_analysis \
--split test \
--family topology \
--family semantic \
--resamples 1000 \
--print-summaryThis writes:
mechanism_diagnostic_analysis/p2_p3_summary.json
mechanism_diagnostic_analysis/tables/heldout_motif_consistency.json
mechanism_diagnostic_analysis/tables/heldout_motif_consistency.md
mechanism_diagnostic_analysis/tables/heldout_motif_summary.json
mechanism_diagnostic_analysis/tables/heldout_motif_summary.md
mechanism_diagnostic_analysis/tables/paired_delta_bootstrap.json
mechanism_diagnostic_analysis/tables/paired_delta_bootstrap.md
The held-out consistency table reports, per test motif and family, BeliefUpdate - StaleMap under CPE and same-environment evaluation, plus protocol deltas for each method. The paired-bootstrap table reports 95% CIs for BeliefUpdate_CPE - StaleMap_CPE, BeliefUpdate_same_env - StaleMap_same_env, and the protocol-reversal contrast (BeliefUpdate_CPE - StaleMap_CPE) - (BeliefUpdate_same_env - StaleMap_same_env).
The main full-study artifact leaves the original baseline roster unchanged. To add the higher-capacity learned rows without recomputing the older baselines, run CPE-only calibration reports for the 1.14M-parameter pretrained structured graph world model and the large persistent-memory world model. The generated reports contain all severities; the paper table reports the non-identity severity subset for these rows.
On a CUDA host:
export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_pretrained_graph_world_model_1m_v0_1
python3 scripts/generate_calibration_report.py \
configs/benchmark/release_v0_1.json \
--tier mapshift_2d \
--run-config configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json \
--model-seed 0 \
--samples-per-motif 1 \
--task-samples-per-class 3 \
--output outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json \
--log-file outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/logs/run.log \
--print-summaryThis command trains one global pretrained world model per seed across generated train motifs, evaluates only that learned baseline on the CPE grid, and leaves all previously reported baselines untouched. The evaluator also runs an implicit oracle reference internally to populate oracle fields, so the output includes an oracle row; use the pretrained_structured_graph_world_model rows for the append-only learned-baseline table entry.
Expected output:
outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/
cep_report.json
logs/run.log
Expected scale: 24 motifs x 4 families x 4 severities x 3 task classes x 3 task samples. The learned baseline contributes 3456 all-severity episode records, of which 2592 non-identity severity records are used for the paper row; the implicit oracle contributes 3456 reference records. This is substantially shorter than the full expanded reproduction because it skips the older baselines and protocol-comparison sweeps.
Large persistent-memory capacity run:
export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_persistent_memory_world_model_large_v0_1
python3 scripts/generate_calibration_report.py \
configs/benchmark/release_v0_1.json \
--tier mapshift_2d \
--run-config configs/calibration/persistent_memory_world_model_large_v0_1.json \
--model-seed 0 \
--samples-per-motif 1 \
--task-samples-per-class 3 \
--output outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json \
--log-file outputs/studies/persistent_memory_world_model_large_v0_1/logs/run.log \
--print-summaryExtract the family-wise scores for the pretrained graph row; for the large persistent-memory row, use outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json and filter baseline_name == "persistent_memory_world_model":
python3 - <<'PY'
import json
from pathlib import Path
from mapshift.runners.evaluate import EvaluationRecord, _aggregate_metric_rows, _long_horizon_threshold
p = Path("outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json")
payload = json.loads(p.read_text())
records = [
EvaluationRecord(**{**record, "adaptation_curve": tuple(record.get("adaptation_curve", ()))})
for record in payload["records"]
if record["baseline_name"] == "pretrained_structured_graph_world_model" and int(record["severity"]) > 0
]
rows = _aggregate_metric_rows(records, ("protocol_name", "baseline_name", "family"), _long_horizon_threshold(records))
for row in rows:
print(
row["baseline_name"],
row["family"],
"episodes=", row["episode_count"],
"score=", round(row["family_primary_score"], 3),
)
PYInspect trainable parameter count and device metadata:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json")
metadata = json.loads(p.read_text())["baseline_metadata"]["pretrained_structured_graph_world_model"]
print("parameter_count_min:", metadata["parameter_count_min"])
print("parameter_count_max:", metadata["parameter_count_max"])
for run_name, run in sorted(metadata["runs"].items()):
print(run_name, "seed=", run["seed"], "device=", run.get("torch_device_resolved"))
PYThe submitted paper.pdf is built from the author manuscript and is not required to execute the artifact. The commands above regenerate the data products used to fill the paper tables.
After a smoke or full run has produced study/study_bundle.json, regenerate paper-facing Markdown tables and SVG figures without rerunning evaluation:
python3 scripts/render_paper_outputs.py \
outputs/releases/mapshift_2d_v0_1_full/study/study_bundle.json \
--output-dir outputs/releases/mapshift_2d_v0_1_full/paper_outputs \
--print-summaryThis writes:
paper_outputs/tables/main_familywise_results.md
paper_outputs/tables/benchmark_health_summary.md
paper_outputs/tables/baseline_metadata.md
paper_outputs/tables/protocol_sensitivity_summary.md
paper_outputs/figures/cep_protocol_diagram.svg
paper_outputs/figures/intervention_examples_2d.svg
paper_outputs/figures/familywise_main_results.svg
paper_outputs/figures/severity_response_curves.svg
paper_outputs/figures/protocol_rank_reversal_comparison.svg
paper_outputs/paper_outputs_manifest.json
To render only the data-free protocol diagram:
python3 scripts/render_paper_outputs.py \
--protocol-diagram-only \
--output-dir outputs/figure1 \
--print-summaryPrimary version identifiers:
- Package version:
0.1.0inpyproject.toml - Config schema version:
0.1.0inconfigs/**/*.json - Release identity:
mapshift_2d_v0_1inconfigs/benchmark/release_v0_1.json - Primary tier:
mapshift_2d
The build writes a release manifest with a config hash:
outputs/releases/<run_name>/manifests/release_manifest.json
Inspect the manifest:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/manifests/release_manifest.json")
m = json.loads(p.read_text())
print("artifact_id:", m["artifact_id"])
print("benchmark_version:", m["benchmark_version"])
print("config_hash:", m["config_hash"])
print("tier:", m["metadata"]["tier"])
PYThe main full study uses eight scientific baseline classes. Deterministic reference and classical baselines have no trainable parameters. The high-capacity pretrained structured graph world model and large persistent-memory world model are append-only learned capacity rows that can be evaluated separately with the commands above.
| Baseline | Config | Key hyperparameters |
|---|---|---|
oracle_post_intervention_planner |
configs/calibration/oracle_post_intervention_planner_v0_1.json |
oracle access; no training |
same_environment_upper_baseline |
configs/calibration/same_environment_upper_baseline_v0_1.json |
same-environment reference; no training |
weak_heuristic_baseline |
configs/calibration/weak_heuristic_baseline_v0_1.json |
visited-state heuristic; no training |
classical_belief_update_planner |
configs/calibration/classical_belief_update_planner_v0_1.json |
occupancy-map plus belief update; edge/cost/dynamics/token updates; no training |
monolithic_recurrent_world_model |
configs/calibration/monolithic_recurrent_world_model_v0_1.json |
hidden size 12, observation stride 2, 6 epochs, lr 0.01, max rollout 8 |
persistent_memory_world_model |
configs/calibration/persistent_memory_world_model_v0_1.json; capacity row: configs/calibration/persistent_memory_world_model_large_v0_1.json |
Smoke/full-roster config: 16 memory slots, slot stride 3, readout width 8, 8 epochs, lr 0.01, max rollout 8. Capacity row: 2048 memory slots, readout width 672, 120 epochs, lr 0.0005, patience 20 |
relational_graph_world_model |
configs/calibration/relational_graph_world_model_v0_1.json |
hidden size 10, 2 message-passing steps, 10 epochs, lr 0.01 |
structured_dynamics_world_model |
configs/calibration/structured_dynamics_world_model_v0_1.json |
geometry width 10, dynamics width 6, 10 epochs, lr 0.01 |
pretrained_structured_graph_world_model |
configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json |
Approximately 1.14M trainable parameters; hidden size 256, 6 message-passing steps, pair width 256, dynamics width 128, 120 epochs, lr 0.0005, batch size 64, 4000 generated train environments, 800 generated validation environments; paper row reports 2592 non-identity evaluation episodes |
Except for pretrained_structured_graph_world_model, learned baselines train separately for each base environment and model seed from the reward-free exploration trace and are not pretrained across environments. The full study expands the original learned baselines over seeds [0, 1, 2, 3, 4] using configs/analysis/mapshift_2d_full_study_v0_1.json; the append-only capacity rows are evaluated with the commands above.
For a faster deterministic diagnostic that isolates stale maps, local heuristics, and explicit belief updates, run:
python3 scripts/run_mapshift_2d_study.py \
configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json \
--print-summaryTraining targets are derived from the exploration-time graph representation:
- edge existence
- normalized geometric path cost
- normalized traversal cost
- semantic-token location
The shared learned-baseline loss combines edge binary cross-entropy, geometry mean-squared error, traversal mean-squared error, and token binary cross-entropy. Checkpoints are keyed by baseline name, environment id, model seed, and hyperparameter hash.
Generated episode records are written in:
outputs/releases/<run_name>/study/raw/cep_report.json
outputs/releases/<run_name>/study/raw/protocol_comparison_report.json
In cep_report.json, records are stored at:
records[]
In protocol_comparison_report.json, records are stored at:
protocol_reports.<protocol_name>.records[]
Each record has the schema below. The implementation source is mapshift/runners/evaluate.py::EvaluationRecord.
| Field | Type | Meaning |
|---|---|---|
baseline_name |
string | Scientific baseline class. |
baseline_run_id |
string | Concrete run id, including seed-expanded runs. |
run_name |
string | Alias for the concrete run used in manifests. |
protocol_name |
string | cep, same_environment, no_exploration, short_horizon, or long_horizon. |
family |
string | Intervention family: metric, topology, dynamics, or semantic. |
severity |
integer | Severity level 0, 1, 2, or 3. |
split_name |
string | train, val, or test. |
motif_tag |
string | Structural motif id. |
task_class |
string | planning, inference, or adaptation. |
task_type |
string | Concrete task type sampled within the class. |
environment_id |
string | Evaluated environment id. |
base_environment_id |
string | Pre-intervention environment id. |
task_id |
string | Deterministic task identifier. |
model_seed |
integer | Model/random seed for the baseline run. |
environment_model_seed_id |
string | Bootstrap grouping key combining environment and baseline run. |
task_horizon_steps |
integer | Evaluation horizon after protocol multipliers. |
success |
boolean | Baseline task success. |
solvable |
boolean | Whether the sampled task is oracle-solvable. |
primary_score |
number | Task-normalized primary score for the record. |
observed_length |
number or null | Baseline path length when applicable. |
oracle_length |
number or null | Oracle path length when applicable. |
path_efficiency |
number | Path-efficiency score. |
oracle_gap |
number or null | Gap to oracle path length/score when applicable. |
oracle_success |
boolean | Oracle success on the same task. |
oracle_primary_score |
number | Oracle primary score for the same task. |
predicted_answer |
any | Baseline answer for inference-style tasks. |
expected_answer |
any | Expected answer for inference-style tasks. |
correct |
boolean or null | Whether the answer is correct when applicable. |
adaptation_curve |
array[number] | Recovery/score curve for adaptation tasks. |
metadata |
object | Task- and baseline-specific diagnostic metadata. |
Quick evaluation-record inspection:
python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json")
report = json.loads(p.read_text())
print("records:", len(report["records"]))
print("first keys:", sorted(report["records"][0]))
print("first record:", json.dumps(report["records"][0], indent=2)[:2000])
PYThe current release uses:
- 96x96 occupancy maps
T_exp=800reward-free exploration steps- 24 structural motifs
- motif-level train/validation/test splits
- four intervention families: metric, topology, dynamics, semantic
- severity levels 0, 1, 2, 3
- planning, inference, and adaptation tasks
- 1000 bootstrap resamples with grouping by
environment_model_seed_id
The 3D-compatible files remain in the repository as prototype/future-work code and are not required for the current release claims. The MiniGrid port in mapshift/envs/minigrid_port/ is an executable adapter for the CPE contract, not a replacement for the primary MapShift-2D release. Use --tier mapshift_2d for validation and artifact building so prototype 3D status does not block the primary artifact.
mapshift/
core/ Config schemas, manifests, logging, registry
envs/map2d/ Map generator, dynamics, renderer
envs/minigrid_port/ Optional MiniGrid-compatible CPE adapter and validators
interventions/ Metric, topology, dynamics, semantic interventions
tasks/ Planning, inference, adaptation task definitions/samplers
baselines/ Baseline API and implementations
runners/ Exploration, evaluation, protocol comparisons
metrics/ Planning/inference/adaptation metrics
analysis/ Study orchestration, bootstrap, ranking, figure data
splits/ Motif splits and leakage checks
configs/
benchmark/ Top-level release config
env2d/ Environment config
interventions/ Family definitions and severity ladders
tasks/ Task classes and sampling
calibration/ Per-baseline run configs
analysis/ Smoke and full study configs
schema/ JSON schemas
scripts/
validate_benchmark.py
generate_benchmark_health.py
build_benchmark.py
render_paper_outputs.py
run_mapshift_2d_study.py
run_protocol_comparison.py
smoke_minigrid_port.py
docs/
README.md Documentation map and reviewer-facing entry points
internal/ Historical planning notes, superseded by this README and configs
CUDA is not used:
python3 - <<'PY'
import torch
print(torch.__version__, torch.cuda.is_available())
PYIf CUDA is available but MapShift uses CPU, set:
export MAPSHIFT_TORCH_DEVICE=cuda:0Checkpoint reuse is confusing:
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_new_runBuild appears stalled:
tail -f outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log
grep -c "evaluating family=" outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.logLarge output directory:
The generated JSON records and protocol comparison reports are the largest generated artifacts. Generated release directories can be archived after preserving the files needed for reproduction. Learned checkpoints are in MAPSHIFT_CHECKPOINT_DIR or the config default /tmp/mapshift_learned_baselines.
Validation fails due to 3D prototype:
Use the primary tier path:
python3 scripts/validate_benchmark.py --tier mapshift_2d configs/benchmark/release_v0_1.jsonThe MapShift source code is licensed under the MIT License. Generated benchmark outputs, evaluation records, health reports, rendered result tables, and rendered result figures produced by the artifact commands are licensed under CC BY 4.0. Third-party dependencies and tooling are summarized in THIRD_PARTY_LICENSES.md.
Citation metadata is provided in CITATION.cff.