fix: reward scaling, PPO clipping, ELA memory cap, and eval metrics#4
Conversation
| nn.Linear(64, 1), | ||
| nn.ReLU(), | ||
| # No second ReLU: movement vectors are signed displacements. | ||
| # Clamping to >= 0 discards direction — the network cannot tell |
| ALL_FUNCTIONS = set(range(1, 25)) | ||
| INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80] | ||
| EASY_TRAIN_FUNCTIONS = {4, *range(6, 15), 18, 19, 20, 22, 23, 24} | ||
| EASY_TRAIN_FUNCTIONS = {1, 2, 3, 4, *range(6, 15), 18, 19, 20, 22, 23, 24} # Czy tutaj nie powinny być też funkcje 1,2,3? |
There was a problem hiding this comment.
"""
BBOB (F1-F24)
| Difficulty Mode | Training Set | Testing Set |
|---|---|---|
| easy | 4, 6-14, 18-20, 22-24 | 1, 2, 3, 5, 15, 16, 17, 21 |
| difficult | 1, 2, 3, 5, 15, 16, 17, 21 | 4, 6-14, 18-20, 22-24 |
| """ |
EASY_TRAIN_BBOB = {4, *range(6, 15), 18, 19, 20, 22, 23, 24}
ALL_FUNCTIONS = {_ for _ in range(1, 25)}
INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]
DIMENSIONS = [2, 3, 5, 10, 20, 40]
this is how it was handled in DynamicAlgorithmSelection 1. Easy refers here to sceanarion, not problem. I think that maybe naming could be changed here but not the easy/hard split. Or maybe we. could entirely omit it and focus only on random splits and CV?
There was a problem hiding this comment.
sorry if it looked like I was screaming, accidentally inserted "#"
| # random 2/3 – 1/3 split | ||
| all_ids = build_problem_ids(ALL_FUNCTIONS, dims) | ||
| rng = np.random.default_rng() | ||
| rng = np.random.default_rng(seed) |
There was a problem hiding this comment.
set_seed(args.seed) function should handle global seed setting. However I feel that it is good idea to use the same constant seed (0) to split train and test
There was a problem hiding this comment.
on the other hand it is in fact change of logis from RLDAS and DAS1, so if it is not necessary now, I would leave it as it was.
There was a problem hiding this comment.
Pull request overview
This PR updates the DAS training stack to improve reward scaling consistency, PPO stability, ELA feature computation efficiency/memory use, and the usefulness/reproducibility of evaluation metrics across BBOB problems.
Changes:
- Adds configurable/cached ELA feature recomputation and caps stored ELA history to avoid large RAM growth.
- Tightens PPO behavior (value clipping from first epoch; extra diagnostics logging; minor network/NaN-guard improvements; safer checkpoint loading).
- Improves evaluation reporting by switching from raw
best_yaggregation to gap-to-optimum metrics and by making final eval use a fresh env.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| train.py | Adds CLI flag to control ELA recomputation frequency. |
| das/training/rldas.py | Avoids mutating args, improves eval reproducibility, logs gap metrics, ensures results dir exists. |
| das/training/ppo.py | Passes ELA recomputation config into SB3 env factory config. |
| das/training/common.py | Threads ela_recompute_every through make_das_env config. |
| das/env/reward.py | Fixes first-step reward scaling to avoid out-of-range values for “linear/sparse/binary” rewards. |
| das/env/observation.py | Improves ELA robustness, reduces action-history feature cost, allows precomputed ELA injection. |
| das/env/das_env.py | Adds ELA caching + history cap to reduce compute and memory overhead. |
| das/env/bbob_splits.py | Adds deterministic seeding for random split; modifies easy-train function set. |
| agents/rl_das/trainer.py | Uses public env accessor, adds PPO diagnostics logging, clarifies bootstrap handling. |
| agents/rl_das/network.py | Removes invalid ReLU constraint for signed movement embeddings; adds NaN guard to critic. |
| agents/rl_das/env.py | Exposes problem_ids via a public property. |
| agents/rl_das/agent.py | Applies PPO value clipping from epoch 0; uses torch.load(..., weights_only=True) for safer loads. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def get_train_test_split(mode: str, dims: list[int], seed: int = 0) -> tuple[list[str], list[str]]: | ||
| """Return (train_ids, test_ids) for the given split mode and dimensions. |
| reward_option: int = 1, | ||
| n_individuals: int = 100, | ||
| seed: int | None = None, | ||
| ela_recompute_every: int = MAX_HISTORY_SAMPLE // 5 # ~500, |
There was a problem hiding this comment.
I don't think that caching ELA features is a good solution to reduce computations. Maybe some of them are unnecessary. I think, that better solution would be to omit some of them. I think that less than 10 (5-6) may be predictive. This is a good point to start and check their importances maybe. I'll send you models trained on the new portfolio I got, so we could analyse its behaviour and most important features to leave. Recently I also played with this: https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html - may come to be very useful. Many ELA features are in general correlated. those are deleted here. However in DAS specific case it may be possible that other correlations still appear. IPCA could also tell us which features are correlated. We could incrementally fit it in the rollout filling phase when we also compute std and mean for rewards and state. This is just an idea (maybe crazy). In general I think that throwing out some features will be a better idea than caching them.
| unique_idx = np.sort(unique_idx) | ||
| x = x[unique_idx][-MAX_HISTORY_SAMPLE:] | ||
| y = y[unique_idx][-MAX_HISTORY_SAMPLE:] | ||
| # Slice to the most-recent samples first; deduplication is done below |
There was a problem hiding this comment.
deduplication is really a thing for ELA features computation to work. Pyflacco breaks if some x appears twice xD. This happens very rarely I think. Maybe in later phases when population collapses into a one point. but I think that an option (alternative to the one mentioned in the other comment) would be implementation of some early stopping mechanism. We detect 0 (or close to 0) variance in population -> we stop with our computation. The issue here might be that with incosistent optimization length, metrics such as AOCC stop making sense. What do you think @WojtAcht ?
| return (old_best_y - new_best_y) / (scale + 1e-10) | ||
|
|
||
|
|
||
| def reward_log_scaled(new_best_y, old_best_y, initial_range, is_final=False): |
There was a problem hiding this comment.
here's the issue with the reward you mentioned earlier. 0.0 in the beginning is certain to be hacked. Agent would choose worst possible optimizer first so the subsequent ones can make the biggest improvement (as nothing is optimized) - this happened to us before. Implementation of logarithm at least forces to choose most exploratory one, so difference between best and worst y is big. This is also going to be hacked, but hopefully in a less painful way. As you mentioned earlier, reward choice is both important and very hard. Maybe it would be nice to add more options for reward?
| # get_portfolio already raises for unknown names; this guard catches an | ||
| # empty --portfolio list before anything expensive is initialised. | ||
| raise ValueError( | ||
| "Portfolio is empty — specify at least one optimizer name via --portfolio." |
There was a problem hiding this comment.
I modified Idea a bit. Now RLDAS should have empty portfolio parameter as it should only use the original paper's one.
| """Binary: 1 if improvement >= 0.1%, else 0 (original r4).""" | ||
| if old_best_y == float("inf"): | ||
| return float(np.log(initial_range[1] - initial_range[0] + 1e-10)) | ||
| return 0.0 |
There was a problem hiding this comment.
this is important change of behaviour. I think further portfolio study will be needed. for now I'll leave the reward as it was before
| [choices_history.count(j) for j in range(n_actions)], dtype=np.float32 | ||
| ) | ||
| # O(n) instead of O(n_actions * n_steps) from calling list.count in a loop. | ||
| counts = np.bincount(choices_history, minlength=n_actions).astype(np.float32) |
…putation control" This reverts commit 2a1c641.
No description provided.