MapShift: Controlled Post-Intervention Evaluation for Embodied World Models

MapShift is an executable benchmark and evaluation protocol for testing whether environment knowledge acquired during reward-free exploration remains useful after a structured environmental intervention. In the controlled post-intervention evaluation (CPE) protocol, an agent explores a base environment without task reward, the environment is changed along one controlled intervention family, and the agent is evaluated on post-intervention planning, inference, and adaptation tasks.

This repository is the executable code artifact for the current MapShift release. It tracks the environment generator, intervention operators, task samplers, benchmark health checks, baselines, study runner, output schemas, reproduction commands, and paper table/figure renderers. Full evaluation records, health reports, rendered tables, figures, and manifests are generated locally by the commands below rather than committed to the repository.

Artifact Overview

This repository is a self-contained executable artifact. It regenerates benchmark instances, health checks, evaluation records, summary tables, rendered figures, and provenance manifests from versioned source code and configuration files. Generated instances and outputs are produced by the code, not tracked as a separately hosted static dataset in the repository. Output filenames and some config names retain the internal protocol id cep for compatibility; reviewer-facing prose uses the paper terminology CPE.

Artifact component	Location or command
Smoke command	`python3 scripts/build_benchmark.py --tier mapshift_2d --study-config configs/analysis/mapshift_2d_full_study_smoke_v0_1.json --output-dir outputs/releases/mapshift_2d_v0_1_smoke --print-summary`
One-command artifact audit	`python3 scripts/audit_artifact.py --quick --output-dir outputs/audit/mapshift_quick`
Full reproduction command	`python3 scripts/build_benchmark.py --tier mapshift_2d --study-config configs/analysis/mapshift_2d_full_study_v0_1.json --output-dir outputs/releases/mapshift_2d_v0_1_full --print-summary`
Deterministic mechanism diagnostic	`python3 scripts/run_mapshift_2d_study.py configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json --print-summary`
MiniGrid port smoke	`python3 scripts/smoke_minigrid_port.py --family topology --severity 2 --motif two_room_door`
MiniGrid CPE sanity-transfer study	`python3 scripts/run_minigrid_cpe_study.py --output-dir outputs/studies/minigrid_cpe_sanity_transfer --print-summary`
High-capacity pretrained graph add-on	`python3 scripts/generate_calibration_report.py configs/benchmark/release_v0_1.json --tier mapshift_2d --run-config configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json --model-seed 0 --samples-per-motif 1 --task-samples-per-class 3 --output outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json --log-file outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/logs/run.log --print-summary`
Large persistent-memory add-on	`python3 scripts/generate_calibration_report.py configs/benchmark/release_v0_1.json --tier mapshift_2d --run-config configs/calibration/persistent_memory_world_model_large_v0_1.json --model-seed 0 --samples-per-motif 1 --task-samples-per-class 3 --output outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json --log-file outputs/studies/persistent_memory_world_model_large_v0_1/logs/run.log --print-summary`
Expected runtime	Smoke: minutes on CPU. Deterministic diagnostic: short CPU/GPU run. Full expanded study: single-GPU run recommended; budget roughly 30-36 wall-clock hours on one NVIDIA L4, with faster completion expected on L40S/H100-class GPUs.
CPU/GPU requirements	CPU works for validation and smoke. Full study works on CPU but is intended for a CUDA-capable PyTorch install when available.
Disk usage	Reserve 5GB for generated outputs and checkpoints; reserve more if installing CUDA PyTorch wheels into a fresh environment.
Output paths	`outputs/releases/<run_name>/health`, `study`, `paper_outputs`, `manifests`, and `logs`.
Tables/figures regenerate	`python3 scripts/render_paper_outputs.py <study_bundle.json> --output-dir <paper_outputs> --print-summary`
Version/config hash	Package version `0.1.0`; schema version `0.1.0`; release manifest records `config_hash`.
Dependency installation	`python3 -m pip install -e .` from a fresh Python 3.10+ environment.
Reviewed dependency pins	`requirements-lock.txt`
Baseline hyperparameters	See `configs/calibration/*.json` and the table below.
Evaluation record schema	See the "Evaluation Record Schema" section below.

Installation

Use Python 3.10 or newer. The package dependencies are bounded in pyproject.toml:

numpy>=1.26,<3.0
torch>=2.2,<3.0

CPU install:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

Optional MiniGrid port install:

python -m pip install -e ".[minigrid]"
python3 scripts/smoke_minigrid_port.py --instantiate-minigrid

The MiniGrid port is not required for reproducing the paper's primary MapShift-2D tables. It is included as an executable adapter showing how the CPE contract maps onto a richer external simulator API: deterministic matched base/intervened MiniGrid-compatible environments, declared metric/topology/dynamics/semantic interventions, invariant validators, a compact sanity-transfer study, and optional instantiation of an actual minigrid environment when the extra dependency is installed.

Run the compact MiniGrid CPE sanity-transfer study:

python3 scripts/run_minigrid_cpe_study.py \
  --output-dir outputs/studies/minigrid_cpe_sanity_transfer \
  --print-summary

This CPU-only study uses six MiniGrid-style motifs with four deterministic seeds each by default. It reports validator failures, reference solvability, task rejections, weak-baseline saturation, family-wise deterministic probe scores, and a protocol-sensitivity table comparing same-environment evaluation with CPE.

Reviewed-environment pins are provided for reviewers who want a tighter dependency target:

python -m pip install -r requirements-lock.txt

CUDA install:

Install a CUDA-enabled PyTorch build appropriate for the host first if needed, then install MapShift in editable mode:

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

Verify the selected Torch device:

python3 - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda_available:", torch.cuda.is_available())
print("device_count:", torch.cuda.device_count())
print("device_0:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else None)
PY

To force the learned baselines onto a particular device and isolate checkpoints:

export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_full_v0_1

If MAPSHIFT_TORCH_DEVICE is unset, learned baselines use their config value. The release configs use torch_device: "auto", which selects CUDA only when PyTorch reports that CUDA is available.

Quick Validation

For the artifact runbook, see ARTIFACT_EVALUATION.md.

Validate the release config bundle:

python3 scripts/validate_benchmark.py \
  --tier mapshift_2d \
  configs/benchmark/release_v0_1.json

Generate a benchmark health preflight:

python3 scripts/generate_benchmark_health.py \
  configs/benchmark/release_v0_1.json \
  --samples-per-motif 1 \
  --task-samples-per-class 3 \
  --output outputs/releases/mapshift_2d_v0_1_health_preflight.json

Run tests:

python3 -m unittest discover -s tests -p 'test_*.py'

Run the one-command artifact audit:

python3 scripts/audit_artifact.py \
  --quick \
  --output-dir outputs/audit/mapshift_quick

The audit validates the release config, runs tests, builds the reviewer smoke artifact, verifies required output paths, checks zero fatal leakage and zero validator failures, and confirms paper-facing rendered outputs exist.

Smoke Run

The smoke command builds a small release artifact: config copies, split manifests, benchmark health, a learned-baseline smoke study, JSON tables, SVG figures, and manifests.

python3 scripts/build_benchmark.py \
  --tier mapshift_2d \
  --study-config configs/analysis/mapshift_2d_full_study_smoke_v0_1.json \
  --output-dir outputs/releases/mapshift_2d_v0_1_smoke \
  --print-summary

Expected runtime: a few minutes on a typical laptop or CPU VM. It is intended for artifact sanity checking, not for reproducing paper-scale confidence intervals.

Expected outputs:

outputs/releases/mapshift_2d_v0_1_smoke/
  logs/build_benchmark.log
  health/benchmark_health.json
  study/study_bundle.json
  study/raw/cep_report.json
  study/raw/protocol_comparison_report.json
  study/raw/benchmark_health_report.json
  study/tables/*.json
  study/figures/*.json
  paper_outputs/tables/*.md
  paper_outputs/figures/*.svg
  manifests/release_manifest.json

Full Reproduction Run

The full command regenerates the paper-facing results from the release configs:

python3 scripts/build_benchmark.py \
  --tier mapshift_2d \
  --study-config configs/analysis/mapshift_2d_full_study_v0_1.json \
  --output-dir outputs/releases/mapshift_2d_v0_1_full \
  --print-summary

Recommended hardware:

CPU: enough for validation, health checks, and smoke.
GPU: a single CUDA GPU is recommended for the full study. The reference run was launched on one NVIDIA L4 with 23GB memory.
RAM: 16GB is recommended for the full build because evaluation records are accumulated before the study bundle is written.
Disk: reserve at least 5GB for outputs, logs, and checkpoints. A fresh CUDA PyTorch install may require several additional GB in the virtual environment or pip cache.

Expected runtime:

Smoke: minutes.
Full single-GPU study: overnight-scale to multi-day depending on GPU. The expanded 24-motif release should be budgeted at roughly 30-36 wall-clock hours on one NVIDIA L4, with faster completion expected on L40S/H100-class GPUs. Runtime is dominated by task generation, evaluation loops, bootstrap aggregation, and rendering. The learned training jobs are small.
Full CPU-only study: possible but not recommended for deadline-sensitive reproduction.

Monitor progress:

tail -f outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log

Count evaluation chunks:

grep -c "evaluating family=" outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log

The full config currently runs one primary CPE sweep (cep protocol id) plus four protocol-comparison sweeps. Each sweep has 24 motifs x 4 intervention families = 96 logged family chunks, so a completed run should log roughly 480 evaluating family= lines.

Check completion:

grep -E "Study complete|Study bundle written|Rendered|Release manifest written" \
  outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log

Core result files are written after the long study stage completes:

outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json
outputs/releases/mapshift_2d_v0_1_full/study/raw/protocol_comparison_report.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/familywise_main_results.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/severity_response.json
outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/tables/main_familywise_results.md
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/familywise_main_results.svg
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/severity_response_curves.svg
outputs/releases/mapshift_2d_v0_1_full/paper_outputs/figures/protocol_rank_reversal_comparison.svg

Deadline-Friendly Study Runs

The complete mapshift_2d_full_study_v0_1 configuration runs the full baseline roster under five protocol variants. When wall-clock time is constrained, use two grid-aligned runs instead:

A CPE-only full-roster run for the main family-wise results table.
A deterministic full-protocol diagnostic for stale-map, weak-heuristic, and belief-update protocol sensitivity.

This preserves the 24 structural motifs, 10/6/8 split, all four intervention families, all severity levels, and the full task mix, while avoiding five repeated full-roster sweeps.

CPE-only full-roster run:

export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_cep_only_v0_1

python3 scripts/build_benchmark.py \
  --tier mapshift_2d \
  --study-config configs/analysis/mapshift_2d_main_cep_only_v0_1.json \
  --output-dir outputs/releases/mapshift_2d_v0_1_cep_only \
  --print-summary

Deterministic full-protocol diagnostic:

python3 scripts/run_mapshift_2d_study.py \
  configs/analysis/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1.json \
  --output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1 \
  --log-file outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/logs/run.log \
  --print-summary

Then generate held-out motif and paired-bootstrap delta tables:

python3 scripts/analyze_mechanism_diagnostic.py \
  outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/study_bundle.json \
  --output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_full_protocols_v0_1/mechanism_diagnostic_analysis \
  --split test \
  --family topology \
  --family semantic \
  --resamples 1000 \
  --print-summary

The CPE-only run has about 96 logged family chunks instead of the full run's 480. The deterministic diagnostic has the same 480 family chunks but only four deterministic/reference methods and no learned baseline training, so it is much cheaper than the full-roster protocol sweep.

Reproducing Paper Claims

The paper's empirical claims come from two executable study paths. The submitted paper.pdf contains the publication figures; the artifact also regenerates code-produced protocol and intervention-example SVGs for reviewer inspection.

Paper item	Reproduction source
Figure 1, CPE protocol diagram	Included in `paper.pdf`; a data-free SVG version can be regenerated with `scripts/render_paper_outputs.py --protocol-diagram-only`.
Figure 2, matched intervention pairs	Included in `paper.pdf`; generated intervention examples are rendered by `scripts/render_paper_outputs.py` as `intervention_examples_2d.svg`.
Benchmark health summary	Full reproduction run, `outputs/releases/mapshift_2d_v0_1_full/study/tables/benchmark_health_summary.json`.
Full-run family-wise calibration table	Full reproduction run, `outputs/releases/mapshift_2d_v0_1_full/study/tables/familywise_main_results.json` and `paper_outputs/tables/main_familywise_results.md`.
Full-run protocol sensitivity	Full reproduction run, `outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json`.
Severity-response and family stress profiles	Full reproduction run, `outputs/releases/mapshift_2d_v0_1_full/study/tables/severity_response.json` and `paper_outputs/figures/severity_response_curves.svg`.
Deterministic mechanism diagnostic	Belief-update diagnostic run, `outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json`.
Conclusion audit	Combination of the full reproduction run and the belief-update diagnostic run.
Baseline metadata and hyperparameters	`configs/calibration/*.json`, `outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json`, and `paper_outputs/tables/baseline_metadata.md`.

Run the full reproduction path first:

python3 scripts/build_benchmark.py \
  --tier mapshift_2d \
  --study-config configs/analysis/mapshift_2d_full_study_v0_1.json \
  --output-dir outputs/releases/mapshift_2d_v0_1_full \
  --print-summary

Then run the deterministic mechanism diagnostic used for the stale-map versus belief-update claims:

python3 scripts/run_mapshift_2d_study.py \
  configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json \
  --print-summary

Expected diagnostic outputs:

outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/
  logs/run_mapshift_2d_study.log
  raw/cep_report.json
  raw/protocol_comparison_report.json
  raw/benchmark_health_report.json
  tables/familywise_main_results.json
  tables/protocol_sensitivity_and_rank_correlation.json
  tables/benchmark_health_summary.json
  figures/*.json
  manifests/study_manifest.json
  study_bundle.json

Inspect the full-run health gates cited in the paper:

python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/tables/benchmark_health_summary.json")
print(json.dumps(json.loads(p.read_text()), indent=2))
PY

Inspect the full-run protocol sensitivity table:

python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/tables/protocol_sensitivity_and_rank_correlation.json")
print(json.dumps(json.loads(p.read_text()), indent=2))
PY

Inspect the deterministic mechanism diagnostic scores and rank changes:

python3 - <<'PY'
import json
from pathlib import Path
b = json.loads(Path("outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json").read_text())

print("=== Proposition support ===")
print(json.dumps(b["proposition_support"], indent=2))

print("\n=== Family-wise CPE scores ===")
rows = b["raw_reports"]["cep_report"]["familywise_summary"]["rows"]
for row in rows:
    print(
        row["baseline_name"],
        row["family"],
        round(row["family_primary_score"], 3),
        "episodes=", row["episode_count"],
    )

print("\n=== Protocol comparisons ===")
for name, comp in b["protocol_sensitivity"]["pairwise_comparisons"].items():
    print(name, "tau=", comp.get("kendall_tau"), "best_changes=", comp.get("best_method_changes"))
PY

After the 24-motif deterministic diagnostic finishes, generate the held-out motif consistency table and paired-bootstrap delta CIs used for the central stale-map versus belief-update claims:

python3 scripts/analyze_mechanism_diagnostic.py \
  outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/study_bundle.json \
  --output-dir outputs/studies/mapshift_2d_belief_update_diagnostic_v0_1/mechanism_diagnostic_analysis \
  --split test \
  --family topology \
  --family semantic \
  --resamples 1000 \
  --print-summary

This writes:

mechanism_diagnostic_analysis/p2_p3_summary.json
mechanism_diagnostic_analysis/tables/heldout_motif_consistency.json
mechanism_diagnostic_analysis/tables/heldout_motif_consistency.md
mechanism_diagnostic_analysis/tables/heldout_motif_summary.json
mechanism_diagnostic_analysis/tables/heldout_motif_summary.md
mechanism_diagnostic_analysis/tables/paired_delta_bootstrap.json
mechanism_diagnostic_analysis/tables/paired_delta_bootstrap.md

The held-out consistency table reports, per test motif and family, BeliefUpdate - StaleMap under CPE and same-environment evaluation, plus protocol deltas for each method. The paired-bootstrap table reports 95% CIs for BeliefUpdate_CPE - StaleMap_CPE, BeliefUpdate_same_env - StaleMap_same_env, and the protocol-reversal contrast (BeliefUpdate_CPE - StaleMap_CPE) - (BeliefUpdate_same_env - StaleMap_same_env).

Capacity Add-On Runs

The main full-study artifact leaves the original baseline roster unchanged. To add the higher-capacity learned rows without recomputing the older baselines, run CPE-only calibration reports for the 1.14M-parameter pretrained structured graph world model and the large persistent-memory world model. The generated reports contain all severities; the paper table reports the non-identity severity subset for these rows.

On a CUDA host:

export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_pretrained_graph_world_model_1m_v0_1

python3 scripts/generate_calibration_report.py \
  configs/benchmark/release_v0_1.json \
  --tier mapshift_2d \
  --run-config configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json \
  --model-seed 0 \
  --samples-per-motif 1 \
  --task-samples-per-class 3 \
  --output outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json \
  --log-file outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/logs/run.log \
  --print-summary

This command trains one global pretrained world model per seed across generated train motifs, evaluates only that learned baseline on the CPE grid, and leaves all previously reported baselines untouched. The evaluator also runs an implicit oracle reference internally to populate oracle fields, so the output includes an oracle row; use the pretrained_structured_graph_world_model rows for the append-only learned-baseline table entry.

Expected output:

outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/
  cep_report.json
  logs/run.log

Expected scale: 24 motifs x 4 families x 4 severities x 3 task classes x 3 task samples. The learned baseline contributes 3456 all-severity episode records, of which 2592 non-identity severity records are used for the paper row; the implicit oracle contributes 3456 reference records. This is substantially shorter than the full expanded reproduction because it skips the older baselines and protocol-comparison sweeps.

Large persistent-memory capacity run:

export MAPSHIFT_TORCH_DEVICE=cuda:0
export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_persistent_memory_world_model_large_v0_1

python3 scripts/generate_calibration_report.py \
  configs/benchmark/release_v0_1.json \
  --tier mapshift_2d \
  --run-config configs/calibration/persistent_memory_world_model_large_v0_1.json \
  --model-seed 0 \
  --samples-per-motif 1 \
  --task-samples-per-class 3 \
  --output outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json \
  --log-file outputs/studies/persistent_memory_world_model_large_v0_1/logs/run.log \
  --print-summary

Extract the family-wise scores for the pretrained graph row; for the large persistent-memory row, use outputs/studies/persistent_memory_world_model_large_v0_1/cep_report.json and filter baseline_name == "persistent_memory_world_model":

python3 - <<'PY'
import json
from pathlib import Path

from mapshift.runners.evaluate import EvaluationRecord, _aggregate_metric_rows, _long_horizon_threshold

p = Path("outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json")
payload = json.loads(p.read_text())
records = [
    EvaluationRecord(**{**record, "adaptation_curve": tuple(record.get("adaptation_curve", ()))})
    for record in payload["records"]
    if record["baseline_name"] == "pretrained_structured_graph_world_model" and int(record["severity"]) > 0
]
rows = _aggregate_metric_rows(records, ("protocol_name", "baseline_name", "family"), _long_horizon_threshold(records))
for row in rows:
    print(
        row["baseline_name"],
        row["family"],
        "episodes=", row["episode_count"],
        "score=", round(row["family_primary_score"], 3),
    )
PY

Inspect trainable parameter count and device metadata:

python3 - <<'PY'
import json
from pathlib import Path

p = Path("outputs/studies/pretrained_structured_graph_world_model_1m_v0_1/cep_report.json")
metadata = json.loads(p.read_text())["baseline_metadata"]["pretrained_structured_graph_world_model"]
print("parameter_count_min:", metadata["parameter_count_min"])
print("parameter_count_max:", metadata["parameter_count_max"])
for run_name, run in sorted(metadata["runs"].items()):
    print(run_name, "seed=", run["seed"], "device=", run.get("torch_device_resolved"))
PY

The submitted paper.pdf is built from the author manuscript and is not required to execute the artifact. The commands above regenerate the data products used to fill the paper tables.

Regenerating Tables and Figures

After a smoke or full run has produced study/study_bundle.json, regenerate paper-facing Markdown tables and SVG figures without rerunning evaluation:

python3 scripts/render_paper_outputs.py \
  outputs/releases/mapshift_2d_v0_1_full/study/study_bundle.json \
  --output-dir outputs/releases/mapshift_2d_v0_1_full/paper_outputs \
  --print-summary

This writes:

paper_outputs/tables/main_familywise_results.md
paper_outputs/tables/benchmark_health_summary.md
paper_outputs/tables/baseline_metadata.md
paper_outputs/tables/protocol_sensitivity_summary.md
paper_outputs/figures/cep_protocol_diagram.svg
paper_outputs/figures/intervention_examples_2d.svg
paper_outputs/figures/familywise_main_results.svg
paper_outputs/figures/severity_response_curves.svg
paper_outputs/figures/protocol_rank_reversal_comparison.svg
paper_outputs/paper_outputs_manifest.json

To render only the data-free protocol diagram:

python3 scripts/render_paper_outputs.py \
  --protocol-diagram-only \
  --output-dir outputs/figure1 \
  --print-summary

Release Version and Config Hash

Primary version identifiers:

Package version: 0.1.0 in pyproject.toml
Config schema version: 0.1.0 in configs/**/*.json
Release identity: mapshift_2d_v0_1 in configs/benchmark/release_v0_1.json
Primary tier: mapshift_2d

The build writes a release manifest with a config hash:

outputs/releases/<run_name>/manifests/release_manifest.json

Inspect the manifest:

python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/manifests/release_manifest.json")
m = json.loads(p.read_text())
print("artifact_id:", m["artifact_id"])
print("benchmark_version:", m["benchmark_version"])
print("config_hash:", m["config_hash"])
print("tier:", m["metadata"]["tier"])
PY

Baselines and Hyperparameters

The main full study uses eight scientific baseline classes. Deterministic reference and classical baselines have no trainable parameters. The high-capacity pretrained structured graph world model and large persistent-memory world model are append-only learned capacity rows that can be evaluated separately with the commands above.

Baseline	Config	Key hyperparameters
`oracle_post_intervention_planner`	`configs/calibration/oracle_post_intervention_planner_v0_1.json`	oracle access; no training
`same_environment_upper_baseline`	`configs/calibration/same_environment_upper_baseline_v0_1.json`	same-environment reference; no training
`weak_heuristic_baseline`	`configs/calibration/weak_heuristic_baseline_v0_1.json`	visited-state heuristic; no training
`classical_belief_update_planner`	`configs/calibration/classical_belief_update_planner_v0_1.json`	occupancy-map plus belief update; edge/cost/dynamics/token updates; no training
`monolithic_recurrent_world_model`	`configs/calibration/monolithic_recurrent_world_model_v0_1.json`	hidden size 12, observation stride 2, 6 epochs, lr 0.01, max rollout 8
`persistent_memory_world_model`	`configs/calibration/persistent_memory_world_model_v0_1.json`; capacity row: `configs/calibration/persistent_memory_world_model_large_v0_1.json`	Smoke/full-roster config: 16 memory slots, slot stride 3, readout width 8, 8 epochs, lr 0.01, max rollout 8. Capacity row: 2048 memory slots, readout width 672, 120 epochs, lr 0.0005, patience 20
`relational_graph_world_model`	`configs/calibration/relational_graph_world_model_v0_1.json`	hidden size 10, 2 message-passing steps, 10 epochs, lr 0.01
`structured_dynamics_world_model`	`configs/calibration/structured_dynamics_world_model_v0_1.json`	geometry width 10, dynamics width 6, 10 epochs, lr 0.01
`pretrained_structured_graph_world_model`	`configs/calibration/pretrained_structured_graph_world_model_1m_v0_1.json`	Approximately 1.14M trainable parameters; hidden size 256, 6 message-passing steps, pair width 256, dynamics width 128, 120 epochs, lr 0.0005, batch size 64, 4000 generated train environments, 800 generated validation environments; paper row reports 2592 non-identity evaluation episodes

Except for pretrained_structured_graph_world_model, learned baselines train separately for each base environment and model seed from the reward-free exploration trace and are not pretrained across environments. The full study expands the original learned baselines over seeds [0, 1, 2, 3, 4] using configs/analysis/mapshift_2d_full_study_v0_1.json; the append-only capacity rows are evaluated with the commands above.

For a faster deterministic diagnostic that isolates stale maps, local heuristics, and explicit belief updates, run:

python3 scripts/run_mapshift_2d_study.py \
  configs/analysis/mapshift_2d_belief_update_diagnostic_v0_1.json \
  --print-summary

Training targets are derived from the exploration-time graph representation:

edge existence
normalized geometric path cost
normalized traversal cost
semantic-token location

The shared learned-baseline loss combines edge binary cross-entropy, geometry mean-squared error, traversal mean-squared error, and token binary cross-entropy. Checkpoints are keyed by baseline name, environment id, model seed, and hyperparameter hash.

Evaluation Record Schema

Generated episode records are written in:

outputs/releases/<run_name>/study/raw/cep_report.json
outputs/releases/<run_name>/study/raw/protocol_comparison_report.json

In cep_report.json, records are stored at:

records[]

In protocol_comparison_report.json, records are stored at:

protocol_reports.<protocol_name>.records[]

Each record has the schema below. The implementation source is mapshift/runners/evaluate.py::EvaluationRecord.

Field	Type	Meaning
`baseline_name`	string	Scientific baseline class.
`baseline_run_id`	string	Concrete run id, including seed-expanded runs.
`run_name`	string	Alias for the concrete run used in manifests.
`protocol_name`	string	`cep`, `same_environment`, `no_exploration`, `short_horizon`, or `long_horizon`.
`family`	string	Intervention family: `metric`, `topology`, `dynamics`, or `semantic`.
`severity`	integer	Severity level 0, 1, 2, or 3.
`split_name`	string	`train`, `val`, or `test`.
`motif_tag`	string	Structural motif id.
`task_class`	string	`planning`, `inference`, or `adaptation`.
`task_type`	string	Concrete task type sampled within the class.
`environment_id`	string	Evaluated environment id.
`base_environment_id`	string	Pre-intervention environment id.
`task_id`	string	Deterministic task identifier.
`model_seed`	integer	Model/random seed for the baseline run.
`environment_model_seed_id`	string	Bootstrap grouping key combining environment and baseline run.
`task_horizon_steps`	integer	Evaluation horizon after protocol multipliers.
`success`	boolean	Baseline task success.
`solvable`	boolean	Whether the sampled task is oracle-solvable.
`primary_score`	number	Task-normalized primary score for the record.
`observed_length`	number or null	Baseline path length when applicable.
`oracle_length`	number or null	Oracle path length when applicable.
`path_efficiency`	number	Path-efficiency score.
`oracle_gap`	number or null	Gap to oracle path length/score when applicable.
`oracle_success`	boolean	Oracle success on the same task.
`oracle_primary_score`	number	Oracle primary score for the same task.
`predicted_answer`	any	Baseline answer for inference-style tasks.
`expected_answer`	any	Expected answer for inference-style tasks.
`correct`	boolean or null	Whether the answer is correct when applicable.
`adaptation_curve`	array[number]	Recovery/score curve for adaptation tasks.
`metadata`	object	Task- and baseline-specific diagnostic metadata.

Quick evaluation-record inspection:

python3 - <<'PY'
import json
from pathlib import Path
p = Path("outputs/releases/mapshift_2d_v0_1_full/study/raw/cep_report.json")
report = json.loads(p.read_text())
print("records:", len(report["records"]))
print("first keys:", sorted(report["records"][0]))
print("first record:", json.dumps(report["records"][0], indent=2)[:2000])
PY

Benchmark Design

The current release uses:

96x96 occupancy maps
T_exp=800 reward-free exploration steps
24 structural motifs
motif-level train/validation/test splits
four intervention families: metric, topology, dynamics, semantic
severity levels 0, 1, 2, 3
planning, inference, and adaptation tasks
1000 bootstrap resamples with grouping by environment_model_seed_id

The 3D-compatible files remain in the repository as prototype/future-work code and are not required for the current release claims. The MiniGrid port in mapshift/envs/minigrid_port/ is an executable adapter for the CPE contract, not a replacement for the primary MapShift-2D release. Use --tier mapshift_2d for validation and artifact building so prototype 3D status does not block the primary artifact.

Repository Structure

mapshift/
  core/              Config schemas, manifests, logging, registry
  envs/map2d/        Map generator, dynamics, renderer
  envs/minigrid_port/ Optional MiniGrid-compatible CPE adapter and validators
  interventions/     Metric, topology, dynamics, semantic interventions
  tasks/             Planning, inference, adaptation task definitions/samplers
  baselines/         Baseline API and implementations
  runners/           Exploration, evaluation, protocol comparisons
  metrics/           Planning/inference/adaptation metrics
  analysis/          Study orchestration, bootstrap, ranking, figure data
  splits/            Motif splits and leakage checks

configs/
  benchmark/         Top-level release config
  env2d/             Environment config
  interventions/     Family definitions and severity ladders
  tasks/             Task classes and sampling
  calibration/       Per-baseline run configs
  analysis/          Smoke and full study configs
  schema/            JSON schemas

scripts/
  validate_benchmark.py
  generate_benchmark_health.py
  build_benchmark.py
  render_paper_outputs.py
  run_mapshift_2d_study.py
  run_protocol_comparison.py
  smoke_minigrid_port.py

docs/
  README.md          Documentation map and reviewer-facing entry points
  internal/          Historical planning notes, superseded by this README and configs

Troubleshooting

CUDA is not used:

python3 - <<'PY'
import torch
print(torch.__version__, torch.cuda.is_available())
PY

If CUDA is available but MapShift uses CPU, set:

export MAPSHIFT_TORCH_DEVICE=cuda:0

Checkpoint reuse is confusing:

export MAPSHIFT_CHECKPOINT_DIR=/tmp/mapshift_learned_baselines_new_run

Build appears stalled:

tail -f outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log
grep -c "evaluating family=" outputs/releases/mapshift_2d_v0_1_full/logs/build_benchmark.log

Large output directory:

The generated JSON records and protocol comparison reports are the largest generated artifacts. Generated release directories can be archived after preserving the files needed for reproduction. Learned checkpoints are in MAPSHIFT_CHECKPOINT_DIR or the config default /tmp/mapshift_learned_baselines.

Validation fails due to 3D prototype:

Use the primary tier path:

python3 scripts/validate_benchmark.py --tier mapshift_2d configs/benchmark/release_v0_1.json

License

The MapShift source code is licensed under the MIT License. Generated benchmark outputs, evaluation records, health reports, rendered result tables, and rendered result figures produced by the artifact commands are licensed under CC BY 4.0. Third-party dependencies and tooling are summarized in THIRD_PARTY_LICENSES.md.

Citation metadata is provided in CITATION.cff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MapShift: Controlled Post-Intervention Evaluation for Embodied World Models

Artifact Overview

Installation

Quick Validation

Smoke Run

Full Reproduction Run

Deadline-Friendly Study Runs

Reproducing Paper Claims

Capacity Add-On Runs

Regenerating Tables and Figures

Release Version and Config Hash

Baselines and Hyperparameters

Evaluation Record Schema

Benchmark Design

Repository Structure

Troubleshooting

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
mapshift		mapshift
outputs		outputs
scripts		scripts
tests		tests
.gitignore		.gitignore
ARTIFACT_EVALUATION.md		ARTIFACT_EVALUATION.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
paper.pdf		paper.pdf
pyproject.toml		pyproject.toml
requirements-lock.txt		requirements-lock.txt

Folders and files

Latest commit

History

Repository files navigation

MapShift: Controlled Post-Intervention Evaluation for Embodied World Models

Artifact Overview

Installation

Quick Validation

Smoke Run

Full Reproduction Run

Deadline-Friendly Study Runs

Reproducing Paper Claims

Capacity Add-On Runs

Regenerating Tables and Figures

Release Version and Config Hash

Baselines and Hyperparameters

Evaluation Record Schema

Benchmark Design

Repository Structure

Troubleshooting

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages