fix: reward scaling, PPO clipping, ELA memory cap, and eval metrics by Grzmro · Pull Request #4 · helix-agh/DynamicAlgorithmSelection

Grzmro · 2026-05-19T09:02:49Z

No description provided.

wniec · 2026-05-19T09:08:59Z

            nn.Linear(64, 1),
-            nn.ReLU(),
+            # No second ReLU: movement vectors are signed displacements.
+            # Clamping to >= 0 discards direction — the network cannot tell


makes sense

wniec · 2026-05-19T09:15:33Z

 ALL_FUNCTIONS = set(range(1, 25))
 INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]
-EASY_TRAIN_FUNCTIONS = {4, *range(6, 15), 18, 19, 20, 22, 23, 24}
+EASY_TRAIN_FUNCTIONS = {1, 2, 3, 4, *range(6, 15), 18, 19, 20, 22, 23, 24} # Czy tutaj nie powinny być też funkcje 1,2,3?


"""
BBOB (F1-F24)

Difficulty Mode Training Set Testing Set

easy 4, 6-14, 18-20, 22-24 1, 2, 3, 5, 15, 16, 17, 21

difficult 1, 2, 3, 5, 15, 16, 17, 21 4, 6-14, 18-20, 22-24

"""

EASY_TRAIN_BBOB = {4, *range(6, 15), 18, 19, 20, 22, 23, 24}
ALL_FUNCTIONS = {_ for _ in range(1, 25)}
INSTANCE_IDS = [1, 2, 3, 4, 5, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80]
DIMENSIONS = [2, 3, 5, 10, 20, 40]

this is how it was handled in DynamicAlgorithmSelection 1. Easy refers here to sceanarion, not problem. I think that maybe naming could be changed here but not the easy/hard split. Or maybe we. could entirely omit it and focus only on random splits and CV?

sorry if it looked like I was screaming, accidentally inserted "#"

wniec · 2026-05-19T09:19:08Z

    # random 2/3 – 1/3 split
    all_ids = build_problem_ids(ALL_FUNCTIONS, dims)
-    rng = np.random.default_rng()
+    rng = np.random.default_rng(seed)


set_seed(args.seed) function should handle global seed setting. However I feel that it is good idea to use the same constant seed (0) to split train and test

on the other hand it is in fact change of logis from RLDAS and DAS1, so if it is not necessary now, I would leave it as it was.

Copilot

Pull request overview

This PR updates the DAS training stack to improve reward scaling consistency, PPO stability, ELA feature computation efficiency/memory use, and the usefulness/reproducibility of evaluation metrics across BBOB problems.

Changes:

Adds configurable/cached ELA feature recomputation and caps stored ELA history to avoid large RAM growth.
Tightens PPO behavior (value clipping from first epoch; extra diagnostics logging; minor network/NaN-guard improvements; safer checkpoint loading).
Improves evaluation reporting by switching from raw best_y aggregation to gap-to-optimum metrics and by making final eval use a fresh env.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
train.py	Adds CLI flag to control ELA recomputation frequency.
das/training/rldas.py	Avoids mutating args, improves eval reproducibility, logs gap metrics, ensures results dir exists.
das/training/ppo.py	Passes ELA recomputation config into SB3 env factory config.
das/training/common.py	Threads `ela_recompute_every` through `make_das_env` config.
das/env/reward.py	Fixes first-step reward scaling to avoid out-of-range values for “linear/sparse/binary” rewards.
das/env/observation.py	Improves ELA robustness, reduces action-history feature cost, allows precomputed ELA injection.
das/env/das_env.py	Adds ELA caching + history cap to reduce compute and memory overhead.
das/env/bbob_splits.py	Adds deterministic seeding for random split; modifies easy-train function set.
agents/rl_das/trainer.py	Uses public env accessor, adds PPO diagnostics logging, clarifies bootstrap handling.
agents/rl_das/network.py	Removes invalid ReLU constraint for signed movement embeddings; adds NaN guard to critic.
agents/rl_das/env.py	Exposes `problem_ids` via a public property.
agents/rl_das/agent.py	Applies PPO value clipping from epoch 0; uses `torch.load(..., weights_only=True)` for safer loads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wniec · 2026-05-19T18:40:37Z

+def get_train_test_split(mode: str, dims: list[int], seed: int = 0) -> tuple[list[str], list[str]]:
    """Return (train_ids, test_ids) for the given split mode and dimensions.


wniec · 2026-05-19T18:53:00Z

        reward_option: int = 1,
        n_individuals: int = 100,
        seed: int | None = None,
+        ela_recompute_every: int = MAX_HISTORY_SAMPLE // 5 # ~500,


I don't think that caching ELA features is a good solution to reduce computations. Maybe some of them are unnecessary. I think, that better solution would be to omit some of them. I think that less than 10 (5-6) may be predictive. This is a good point to start and check their importances maybe. I'll send you models trained on the new portfolio I got, so we could analyse its behaviour and most important features to leave. Recently I also played with this: https://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html - may come to be very useful. Many ELA features are in general correlated. those are deleted here. However in DAS specific case it may be possible that other correlations still appear. IPCA could also tell us which features are correlated. We could incrementally fit it in the rollout filling phase when we also compute std and mean for rewards and state. This is just an idea (maybe crazy). In general I think that throwing out some features will be a better idea than caching them.

wniec · 2026-05-19T19:00:09Z

-        unique_idx = np.sort(unique_idx)
-        x = x[unique_idx][-MAX_HISTORY_SAMPLE:]
-        y = y[unique_idx][-MAX_HISTORY_SAMPLE:]
+        # Slice to the most-recent samples first; deduplication is done below


deduplication is really a thing for ELA features computation to work. Pyflacco breaks if some x appears twice xD. This happens very rarely I think. Maybe in later phases when population collapses into a one point. but I think that an option (alternative to the one mentioned in the other comment) would be implementation of some early stopping mechanism. We detect 0 (or close to 0) variance in population -> we stop with our computation. The issue here might be that with incosistent optimization length, metrics such as AOCC stop making sense. What do you think @WojtAcht ?

wniec · 2026-05-19T19:05:31Z

    return (old_best_y - new_best_y) / (scale + 1e-10)


 def reward_log_scaled(new_best_y, old_best_y, initial_range, is_final=False):


here's the issue with the reward you mentioned earlier. 0.0 in the beginning is certain to be hacked. Agent would choose worst possible optimizer first so the subsequent ones can make the biggest improvement (as nothing is optimized) - this happened to us before. Implementation of logarithm at least forces to choose most exploratory one, so difference between best and worst y is big. This is also going to be hacked, but hopefully in a less painful way. As you mentioned earlier, reward choice is both important and very hard. Maybe it would be nice to add more options for reward?

wniec · 2026-05-19T19:06:09Z

+        # get_portfolio already raises for unknown names; this guard catches an
+        # empty --portfolio list before anything expensive is initialised.
+        raise ValueError(
+            "Portfolio is empty — specify at least one optimizer name via --portfolio."


nice validation

I modified Idea a bit. Now RLDAS should have empty portfolio parameter as it should only use the original paper's one.

wniec · 2026-05-29T18:06:23Z

    """Binary: 1 if improvement >= 0.1%, else 0 (original r4)."""
    if old_best_y == float("inf"):
-        return float(np.log(initial_range[1] - initial_range[0] + 1e-10))
+        return 0.0


this is important change of behaviour. I think further portfolio study will be needed. for now I'll leave the reward as it was before

wniec · 2026-05-29T18:06:48Z

-            [choices_history.count(j) for j in range(n_actions)], dtype=np.float32
-        )
+        # O(n) instead of O(n_actions * n_steps) from calling list.count in a loop.
+        counts = np.bincount(choices_history, minlength=n_actions).astype(np.float32)


… control

…putation control" This reverts commit 2a1c641.

Grzmro requested a review from wniec May 19, 2026 09:03

wniec reviewed May 19, 2026

View reviewed changes

WojtAcht requested a review from Copilot May 19, 2026 17:04

Copilot started reviewing on behalf of WojtAcht May 19, 2026 17:04 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

wniec reviewed May 19, 2026

View reviewed changes

wniec reviewed May 29, 2026

View reviewed changes

wniec force-pushed the code-review branch from e79fe58 to 90d0a49 Compare May 29, 2026 18:12

Grzmro and others added 5 commits May 29, 2026 21:00

fix: reward scaling, PPO clipping, ELA memory cap, and eval metrics

d1028c3

feat: add ela_recompute_every parameter for ELA feature recomputation…

b44e47b

… control

Revert "feat: add ela_recompute_every parameter for ELA feature recom…

e840dba

…putation control" This reverts commit 2a1c641.

refactor

592a7c7

ruff fix

c93efb8

wniec force-pushed the code-review branch from 90d0a49 to c93efb8 Compare May 29, 2026 19:02

wniec merged commit d014d6a into DAS2 May 29, 2026
4 checks passed

Difficulty Mode	Training Set	Testing Set
easy	4, 6-14, 18-20, 22-24	1, 2, 3, 5, 15, 16, 17, 21
difficult	1, 2, 3, 5, 15, 16, 17, 21	4, 6-14, 18-20, 22-24
"""

		def get_train_test_split(mode: str, dims: list[int], seed: int = 0) -> tuple[list[str], list[str]]:
		"""Return (train_ids, test_ids) for the given split mode and dimensions.

		return (old_best_y - new_best_y) / (scale + 1e-10)


		def reward_log_scaled(new_best_y, old_best_y, initial_range, is_final=False):

Conversation

Grzmro commented May 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

this is how it was handled in DynamicAlgorithmSelection 1. Easy refers here to sceanarion, not problem. I think that maybe naming could be changed here but not the easy/hard split. Or maybe we. could entirely omit it and focus only on random splits and CV?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants