feat : deepseek r1 accuracy support (release/v0.5) (#351) by arekay-nv · Pull Request #378 · mlcommons/endpoints

arekay-nv · 2026-06-27T02:05:27Z

Cherry-pick DSR1 support back from the release v0.5 branch which has been thoroughly tested.

Adds end-to-end MLPerf DeepSeek-R1 combined-subset accuracy evaluation. DeepSeek-R1's accuracy set is an ensemble of five subsets (math500, aime, gpqa, mmlu_pro, livecodebench), each parsed and graded differently. This PR wires up dataset loading, the perf/accuracy phases, and a new scorer that produces a single headline exact_match on the same 0–100 scale as the MLPerf golden number (81.3582).

What's included

Area	Change
Scorer	`DeepSeekR1Scorer` (`scorer_id="deepseek_r1"`) in `evaluation/scoring.py`
Isolated evaluator	New uv subproject `evaluation/deepseek_r1/` (runner, `pyproject.toml`, `uv.lock`, `setup_eval.sh`, `RUNBOOK.md`) — excluded from the parent wheel
Dataset	`DeepSeekR1` loader (`dataset_id="deepseek_r1"`) in `dataset_manager/predefined/deepseek_r1/` + registration in `dataset_manager/__init__.py`
Plumbing	`commands/benchmark/execute.py`: per-scorer `complete` flag plumbed into results; clearer error on bad `accuracy_config.extras`; configurable service-ready timeout
Shared client	`_lcb_ws_evaluate()` LiveCodeBench WebSocket client, shared by `LiveCodeBenchScorer` and `DeepSeekR1Scorer`
Docs/CI/tests	`AGENTS.md`, CI workflow tweaks, `transforms.py`, and new unit tests for the dataset + scorer

Design notes

Out-of-process evaluation. The official MLCommons eval_accuracy.py pulls in heavy, conflicting pins (transformers, the prm800k math grader, the LiveCodeBench executor). Following the existing VBenchScorer pattern, the evaluator runs via uv run --project against the isolated subproject; the parent process never imports it. Path resolution is explicit arg → $DEEPSEEK_EVAL_PROJECT_PATH → editable-checkout fallback.
Pre-tokenized prompt. The dataset emits an input_tokens column (MLPerf tok_input) so the openai_completions adapter's Harmonize() is a no-op and the server chat template is bypassed — the model sees the exact MLPerf prompt. One loader serves both the perf phase (issues input_tokens) and the accuracy phase (grades by dataset/ground_truth).
LiveCodeBench sandboxing. LCB executes untrusted model code, which the in-process MLCommons executor can't sandbox. When a port is configured, the livecodebench subset is graded out-of-band against the lcb-service WebSocket container; text subsets go through the subprocess; results are merged into one 5-subset number.
Partial-result honesty. A new Scorer.complete flag (default True) is set False whenever the headline can't cover every issued sample — failed text subset, diverging LCB count, or an unreachable lcb-service container (LCB left unscored rather than silently shrinking the denominator). The flag is surfaced in results and logs so a partial number is never mistaken for a complete one.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com> --------- Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

github-actions · 2026-06-27T02:05:35Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces support for the MLPerf DeepSeek-R1 combined-subset accuracy evaluation. It adds a new predefined dataset DeepSeekR1 and a corresponding DeepSeekR1Scorer that executes the official MLCommons evaluator out-of-process within an isolated uv subproject to avoid dependency conflicts. Additionally, the changes update the Dockerfile and benchmark execution flow to provision and handle this new evaluation mode. Feedback on the pull request focuses on enhancing robustness and security: mitigating a potential remote code execution (RCE) risk by setting trust_remote_code=False when loading the tokenizer, replacing several assert statements used for runtime validation with explicit conditional checks and exceptions, and adding safety guards in the DeepSeekR1 dataset loader to handle empty DataFrames and non-iterable token inputs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-27T02:06:41Z

+    # basis for tokens_per_sample. Count with the DeepSeek tokenizer so the
+    # number matches MLPerf token accounting.
+    logger.info("Loading tokenizer: %s", args.tokenizer)
+    tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)


Setting trust_remote_code=True allows the execution of arbitrary code from the Hugging Face Hub repository when loading the tokenizer. Since DeepSeek-R1 uses standard tokenizer architectures that do not require custom remote code execution, this parameter should be set to False (or omitted) to mitigate potential remote code execution (RCE) security risks.

Suggested change

tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)

tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)

gemini-code-assist · 2026-06-27T02:06:41Z

+        Returns ``(passed, total)`` or None if the service is unreachable (so
+        score() can fall back to the in-process path).
+        """
+        assert self.lcb_websocket_url is not None


Using assert statements for runtime validation or flow control is discouraged because assertions can be globally disabled in Python when run with optimization flags (e.g., python -O). If assertions are disabled, this check will be bypassed, potentially leading to unexpected errors or behaviors. Consider replacing this with an explicit if check and raising an appropriate exception.

Suggested change

assert self.lcb_websocket_url is not None

if self.lcb_websocket_url is None:

raise ValueError("lcb_websocket_url must be configured to score LCB via container")

+1, raise is probably better

gemini-code-assist · 2026-06-27T02:06:41Z

+        assert ref is not None, f"Dataset {self.dataset} has no dataframe loaded"
+        for col in (self.ground_truth_column, self.subset_column, self.question_column):
+            assert col in ref.columns, (
+                f"Column {col!r} not found in dataset {self.dataset}; "
+                f"available: {list(ref.columns)}"
+            )


Avoid using assert statements for runtime data validation as they can be disabled in optimized Python environments (using the -O flag). Replace these assertions with explicit conditional checks that raise descriptive exceptions (e.g., ValueError or RuntimeError) to ensure robust error handling and defensive programming.

if ref is None: raise RuntimeError(f"Dataset {self.dataset} has no dataframe loaded") for col in (self.ground_truth_column, self.subset_column, self.question_column): if col not in ref.columns: raise ValueError( f"Column {col!r} not found in dataset {self.dataset}; " f"available: {list(ref.columns)}" )

gemini-code-assist · 2026-06-27T02:06:41Z

+                "input_tokens": raw["tok_input"].map(
+                    lambda t: t.tolist() if hasattr(t, "tolist") else list(t)
+                ),


If tok_input contains missing or non-iterable values (such as None or NaN), calling list(t) will raise a TypeError. To ensure robust data loading and defensive programming, add a guard to check if t is iterable before attempting to convert it to a list.

Suggested change

"input_tokens": raw["tok_input"].map(

lambda t: t.tolist() if hasattr(t, "tolist") else list(t)

),

"input_tokens": raw["tok_input"].map(

lambda t: t.tolist() if hasattr(t, "tolist") else (list(t) if hasattr(t, "__iter__") else [])

),

gemini-code-assist · 2026-06-27T02:06:41Z

+        df: pd.DataFrame, max_samples: int, seed: int
+    ) -> pd.DataFrame:
+        frac = max_samples / len(df)
+        parts = [
+            group.sample(
+                n=min(max(1, round(len(group) * frac)), len(group)),
+                random_state=seed,
+            )
+            for _, group in df.groupby("dataset")
+        ]
+        return pd.concat(parts).sample(frac=1, random_state=seed).reset_index(drop=True)


If the input DataFrame df is empty, len(df) will be 0, causing a ZeroDivisionError when calculating frac. Add a guard clause to return the empty DataFrame immediately if df is empty.

def _stratified_subset( df: pd.DataFrame, max_samples: int, seed: int ) -> pd.DataFrame: if df.empty: return df frac = max_samples / len(df) parts = [ group.sample( n=min(max(1, round(len(group) * frac)), len(group)), random_state=seed, ) for _, group in df.groupby("dataset") ]

+            scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
+        )
+        scorer = self._scorer(dataset, staged, project)
+        score, n_repeats = scorer.score()


+            scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
+        )
+        scorer = self._scorer(dataset, staged, project)
+        score, n_repeats = scorer.score()


nvzhihanj · 2026-06-27T04:01:49Z

+_SERVICE_READY_TIMEOUT_ENV = "INFERENCE_ENDPOINT_SERVICE_READY_TIMEOUT_S"
+_SERVICE_READY_TIMEOUT_DEFAULT = 30.0
+
+
+def _service_ready_timeout() -> float:
+    """Service-startup readiness timeout, overridable via env.
+
+    Service imports off a shared/Lustre FS can exceed the default under heavy
+    login-node I/O contention. A non-numeric override is ignored with a warning


This seems to be a hack and non-parameterized field?

nvzhihanj · 2026-06-27T04:04:03Z

+logger = getLogger(__name__)
+
+#: Env var pointing at the local MLPerf DeepSeek-R1 source dataset.
+SOURCE_ENV = "DEEPSEEK_R1_DATASET_PKL"


Using MACRO to point dataset is not a good design, especially using a macro to point to another macro. We can push the parquet file to the repo since it's small anyway

nvzhihanj · 2026-06-27T04:04:31Z

+    ``.parquet``), or pass ``source=`` to :meth:`generate`.
+    """
+
+    COLUMN_NAMES = _OUTPUT_COLUMNS


Macro pointing to macro

nvzhihanj · 2026-06-27T04:06:18Z

@@ -0,0 +1,167 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


we should probably call the dataset legacy_mlperf_deepseek_r1_dataset or something similar. Calling a dataset ds-r1 will be confusing

nvzhihanj · 2026-06-27T04:07:31Z

+
+    @staticmethod
+    def _build_from_source(source: str | os.PathLike | None) -> pd.DataFrame:
+        resolved = source or os.environ.get(SOURCE_ENV)


Basically we cannot build from source, and have to use existing one. We can probably raise not implemented here and only allow local usage of the parquet then
(Otherwise the logic here is quite confusing)

nvzhihanj · 2026-06-27T04:08:56Z

+requires-python = ">=3.10"
+dependencies = [
+    # eval_accuracy.py core
+    "pandas>=2.0",


Would recommend locking the version == instead of >= to avoid maintenance issues

nvzhihanj · 2026-06-27T04:09:23Z

@@ -0,0 +1,106 @@
+# DeepSeek-R1 Accuracy Evaluator Subproject


Not the right place for documentation. Example is the better place.

nvzhihanj · 2026-06-27T04:10:38Z

+EVAL_ACCURACY_URL="https://raw.githubusercontent.com/mlcommons/inference/${MLC_INFERENCE_COMMIT}/language/deepseek-r1/eval_accuracy.py"
+
+mkdir -p "${SUBMODULES_DIR}"
+
+echo "==> Fetching eval_accuracy.py @ ${MLC_INFERENCE_COMMIT}"
+curl -fsSL "${EVAL_ACCURACY_URL}" -o "${EVAL_DIR}/eval_accuracy.py"


I understand we need to fetch the LCB_REPO, but if we are only fetching one file from MLCommon inference repo, can we just place it here?

nvzhihanj · 2026-06-27T04:12:50Z

+
+_DEFAULT_DEEPSEEK_EVAL_PROJECT_PATH = Path(__file__).resolve().parent / "deepseek_r1"
+
+_DEEPSEEK_EVAL_PROJECT_PATH_ENV = "DEEPSEEK_EVAL_PROJECT_PATH"


MACRO pointing to Macro (one-time)

nvzhihanj · 2026-06-27T04:18:15Z

+    def _resolve_project_path(explicit: os.PathLike | None) -> Path:
+        """Resolve the DeepSeek eval subproject path.
+
+        Lookup order: explicit ctor arg -> ``$DEEPSEEK_EVAL_PROJECT_PATH`` env


What would be the use case of this - and should we allow such ENV override for the eval directory?

nvzhihanj · 2026-06-27T04:20:30Z

+                f"Last 50 lines:\n{tail}"
+            )
+
+    def _score_lcb_via_container(self, lcb_df: pd.DataFrame) -> tuple[int, int] | None:


Should this code be reused from the GPT-OSS LCB?

feat : deepseek r1 accuracy support (release/v0.5) (#351)

a33c944

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com> --------- Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

arekay-nv requested a review from a team June 27, 2026 02:05

github-actions Bot requested a review from nvzhihanj June 27, 2026 02:05

gemini-code-assist Bot reviewed Jun 27, 2026

View reviewed changes

github-code-quality Bot found potential problems Jun 27, 2026

View reviewed changes

nvzhihanj reviewed Jun 27, 2026

View reviewed changes

	tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
	tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)

	assert self.lcb_websocket_url is not None
	if self.lcb_websocket_url is None:
	raise ValueError("lcb_websocket_url must be configured to score LCB via container")

		@@ -0,0 +1,167 @@
		# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

		@@ -0,0 +1,106 @@
		# DeepSeek-R1 Accuracy Evaluator Subproject


		_DEFAULT_DEEPSEEK_EVAL_PROJECT_PATH = Path(__file__).resolve().parent / "deepseek_r1"

		_DEEPSEEK_EVAL_PROJECT_PATH_ENV = "DEEPSEEK_EVAL_PROJECT_PATH"

Uh oh!

Conversation

arekay-nv commented Jun 27, 2026

What's included

Design notes

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nvzhihanj Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvzhihanj Jun 27, 2026 •

edited

Loading