feat : deepseek r1 accuracy support (release/v0.5) (#351)#378
Conversation
Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com> --------- Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the MLPerf DeepSeek-R1 combined-subset accuracy evaluation. It adds a new predefined dataset DeepSeekR1 and a corresponding DeepSeekR1Scorer that executes the official MLCommons evaluator out-of-process within an isolated uv subproject to avoid dependency conflicts. Additionally, the changes update the Dockerfile and benchmark execution flow to provision and handle this new evaluation mode. Feedback on the pull request focuses on enhancing robustness and security: mitigating a potential remote code execution (RCE) risk by setting trust_remote_code=False when loading the tokenizer, replacing several assert statements used for runtime validation with explicit conditional checks and exceptions, and adding safety guards in the DeepSeekR1 dataset loader to handle empty DataFrames and non-iterable token inputs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # basis for tokens_per_sample. Count with the DeepSeek tokenizer so the | ||
| # number matches MLPerf token accounting. | ||
| logger.info("Loading tokenizer: %s", args.tokenizer) | ||
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True) |
There was a problem hiding this comment.
Setting trust_remote_code=True allows the execution of arbitrary code from the Hugging Face Hub repository when loading the tokenizer. Since DeepSeek-R1 uses standard tokenizer architectures that do not require custom remote code execution, this parameter should be set to False (or omitted) to mitigate potential remote code execution (RCE) security risks.
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True) | |
| tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False) |
| Returns ``(passed, total)`` or None if the service is unreachable (so | ||
| score() can fall back to the in-process path). | ||
| """ | ||
| assert self.lcb_websocket_url is not None |
There was a problem hiding this comment.
Using assert statements for runtime validation or flow control is discouraged because assertions can be globally disabled in Python when run with optimization flags (e.g., python -O). If assertions are disabled, this check will be bypassed, potentially leading to unexpected errors or behaviors. Consider replacing this with an explicit if check and raising an appropriate exception.
| assert self.lcb_websocket_url is not None | |
| if self.lcb_websocket_url is None: | |
| raise ValueError("lcb_websocket_url must be configured to score LCB via container") |
There was a problem hiding this comment.
+1, raise is probably better
| assert ref is not None, f"Dataset {self.dataset} has no dataframe loaded" | ||
| for col in (self.ground_truth_column, self.subset_column, self.question_column): | ||
| assert col in ref.columns, ( | ||
| f"Column {col!r} not found in dataset {self.dataset}; " | ||
| f"available: {list(ref.columns)}" | ||
| ) |
There was a problem hiding this comment.
Avoid using assert statements for runtime data validation as they can be disabled in optimized Python environments (using the -O flag). Replace these assertions with explicit conditional checks that raise descriptive exceptions (e.g., ValueError or RuntimeError) to ensure robust error handling and defensive programming.
if ref is None:
raise RuntimeError(f"Dataset {self.dataset} has no dataframe loaded")
for col in (self.ground_truth_column, self.subset_column, self.question_column):
if col not in ref.columns:
raise ValueError(
f"Column {col!r} not found in dataset {self.dataset}; "
f"available: {list(ref.columns)}"
)| "input_tokens": raw["tok_input"].map( | ||
| lambda t: t.tolist() if hasattr(t, "tolist") else list(t) | ||
| ), |
There was a problem hiding this comment.
If tok_input contains missing or non-iterable values (such as None or NaN), calling list(t) will raise a TypeError. To ensure robust data loading and defensive programming, add a guard to check if t is iterable before attempting to convert it to a list.
| "input_tokens": raw["tok_input"].map( | |
| lambda t: t.tolist() if hasattr(t, "tolist") else list(t) | |
| ), | |
| "input_tokens": raw["tok_input"].map( | |
| lambda t: t.tolist() if hasattr(t, "tolist") else (list(t) if hasattr(t, "__iter__") else []) | |
| ), |
| df: pd.DataFrame, max_samples: int, seed: int | ||
| ) -> pd.DataFrame: | ||
| frac = max_samples / len(df) | ||
| parts = [ | ||
| group.sample( | ||
| n=min(max(1, round(len(group) * frac)), len(group)), | ||
| random_state=seed, | ||
| ) | ||
| for _, group in df.groupby("dataset") | ||
| ] | ||
| return pd.concat(parts).sample(frac=1, random_state=seed).reset_index(drop=True) |
There was a problem hiding this comment.
If the input DataFrame df is empty, len(df) will be 0, causing a ZeroDivisionError when calculating frac. Add a guard clause to return the empty DataFrame immediately if df is empty.
def _stratified_subset(
df: pd.DataFrame, max_samples: int, seed: int
) -> pd.DataFrame:
if df.empty:
return df
frac = max_samples / len(df)
parts = [
group.sample(
n=min(max(1, round(len(group) * frac)), len(group)),
random_state=seed,
)
for _, group in df.groupby("dataset")
]| scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None | ||
| ) | ||
| scorer = self._scorer(dataset, staged, project) | ||
| score, n_repeats = scorer.score() |
| scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None | ||
| ) | ||
| scorer = self._scorer(dataset, staged, project) | ||
| score, n_repeats = scorer.score() |
| _SERVICE_READY_TIMEOUT_ENV = "INFERENCE_ENDPOINT_SERVICE_READY_TIMEOUT_S" | ||
| _SERVICE_READY_TIMEOUT_DEFAULT = 30.0 | ||
|
|
||
|
|
||
| def _service_ready_timeout() -> float: | ||
| """Service-startup readiness timeout, overridable via env. | ||
|
|
||
| Service imports off a shared/Lustre FS can exceed the default under heavy | ||
| login-node I/O contention. A non-numeric override is ignored with a warning |
There was a problem hiding this comment.
This seems to be a hack and non-parameterized field?
| logger = getLogger(__name__) | ||
|
|
||
| #: Env var pointing at the local MLPerf DeepSeek-R1 source dataset. | ||
| SOURCE_ENV = "DEEPSEEK_R1_DATASET_PKL" |
There was a problem hiding this comment.
Using MACRO to point dataset is not a good design, especially using a macro to point to another macro. We can push the parquet file to the repo since it's small anyway
| ``.parquet``), or pass ``source=`` to :meth:`generate`. | ||
| """ | ||
|
|
||
| COLUMN_NAMES = _OUTPUT_COLUMNS |
There was a problem hiding this comment.
Macro pointing to macro
| @@ -0,0 +1,167 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
we should probably call the dataset legacy_mlperf_deepseek_r1_dataset or something similar. Calling a dataset ds-r1 will be confusing
|
|
||
| @staticmethod | ||
| def _build_from_source(source: str | os.PathLike | None) -> pd.DataFrame: | ||
| resolved = source or os.environ.get(SOURCE_ENV) |
There was a problem hiding this comment.
Basically we cannot build from source, and have to use existing one. We can probably raise not implemented here and only allow local usage of the parquet then
(Otherwise the logic here is quite confusing)
| requires-python = ">=3.10" | ||
| dependencies = [ | ||
| # eval_accuracy.py core | ||
| "pandas>=2.0", |
There was a problem hiding this comment.
Would recommend locking the version == instead of >= to avoid maintenance issues
| @@ -0,0 +1,106 @@ | |||
| # DeepSeek-R1 Accuracy Evaluator Subproject | |||
There was a problem hiding this comment.
Not the right place for documentation. Example is the better place.
| EVAL_ACCURACY_URL="https://raw.githubusercontent.com/mlcommons/inference/${MLC_INFERENCE_COMMIT}/language/deepseek-r1/eval_accuracy.py" | ||
|
|
||
| mkdir -p "${SUBMODULES_DIR}" | ||
|
|
||
| echo "==> Fetching eval_accuracy.py @ ${MLC_INFERENCE_COMMIT}" | ||
| curl -fsSL "${EVAL_ACCURACY_URL}" -o "${EVAL_DIR}/eval_accuracy.py" |
There was a problem hiding this comment.
I understand we need to fetch the LCB_REPO, but if we are only fetching one file from MLCommon inference repo, can we just place it here?
|
|
||
| _DEFAULT_DEEPSEEK_EVAL_PROJECT_PATH = Path(__file__).resolve().parent / "deepseek_r1" | ||
|
|
||
| _DEEPSEEK_EVAL_PROJECT_PATH_ENV = "DEEPSEEK_EVAL_PROJECT_PATH" |
There was a problem hiding this comment.
MACRO pointing to Macro (one-time)
| def _resolve_project_path(explicit: os.PathLike | None) -> Path: | ||
| """Resolve the DeepSeek eval subproject path. | ||
|
|
||
| Lookup order: explicit ctor arg -> ``$DEEPSEEK_EVAL_PROJECT_PATH`` env |
There was a problem hiding this comment.
What would be the use case of this - and should we allow such ENV override for the eval directory?
| f"Last 50 lines:\n{tail}" | ||
| ) | ||
|
|
||
| def _score_lcb_via_container(self, lcb_df: pd.DataFrame) -> tuple[int, int] | None: |
There was a problem hiding this comment.
Should this code be reused from the GPT-OSS LCB?
Cherry-pick DSR1 support back from the release v0.5 branch which has been thoroughly tested.
Adds end-to-end MLPerf DeepSeek-R1 combined-subset accuracy evaluation. DeepSeek-R1's accuracy set is an ensemble of five subsets (
math500,aime,gpqa,mmlu_pro,livecodebench), each parsed and graded differently. This PR wires up dataset loading, the perf/accuracy phases, and a new scorer that produces a single headlineexact_matchon the same 0–100 scale as the MLPerf golden number (81.3582).What's included
DeepSeekR1Scorer(scorer_id="deepseek_r1") inevaluation/scoring.pyevaluation/deepseek_r1/(runner,pyproject.toml,uv.lock,setup_eval.sh,RUNBOOK.md) — excluded from the parent wheelDeepSeekR1loader (dataset_id="deepseek_r1") indataset_manager/predefined/deepseek_r1/+ registration indataset_manager/__init__.pycommands/benchmark/execute.py: per-scorercompleteflag plumbed into results; clearer error on badaccuracy_config.extras; configurable service-ready timeout_lcb_ws_evaluate()LiveCodeBench WebSocket client, shared byLiveCodeBenchScorerandDeepSeekR1ScorerAGENTS.md, CI workflow tweaks,transforms.py, and new unit tests for the dataset + scorerDesign notes
eval_accuracy.pypulls in heavy, conflicting pins (transformers, theprm800kmath grader, the LiveCodeBench executor). Following the existingVBenchScorerpattern, the evaluator runs viauv run --projectagainst the isolated subproject; the parent process never imports it. Path resolution is explicit arg →$DEEPSEEK_EVAL_PROJECT_PATH→ editable-checkout fallback.input_tokenscolumn (MLPerftok_input) so theopenai_completionsadapter'sHarmonize()is a no-op and the server chat template is bypassed — the model sees the exact MLPerf prompt. One loader serves both the perf phase (issuesinput_tokens) and the accuracy phase (grades bydataset/ground_truth).livecodebenchsubset is graded out-of-band against thelcb-serviceWebSocket container; text subsets go through the subprocess; results are merged into one 5-subset number.Scorer.completeflag (defaultTrue) is setFalsewhenever the headline can't cover every issued sample — failed text subset, diverging LCB count, or an unreachablelcb-servicecontainer (LCB left unscored rather than silently shrinking the denominator). The flag is surfaced in results and logs so a partial number is never mistaken for a complete one.Type of change
Related issues
Testing
Checklist