Skip to content

feat : deepseek r1 accuracy support (release/v0.5) (#351)#378

Open
arekay-nv wants to merge 1 commit into
mainfrom
arekay/cherry_pick_dsr1
Open

feat : deepseek r1 accuracy support (release/v0.5) (#351)#378
arekay-nv wants to merge 1 commit into
mainfrom
arekay/cherry_pick_dsr1

Conversation

@arekay-nv

Copy link
Copy Markdown
Collaborator

Cherry-pick DSR1 support back from the release v0.5 branch which has been thoroughly tested.

Adds end-to-end MLPerf DeepSeek-R1 combined-subset accuracy evaluation. DeepSeek-R1's accuracy set is an ensemble of five subsets (math500, aime, gpqa, mmlu_pro, livecodebench), each parsed and graded differently. This PR wires up dataset loading, the perf/accuracy phases, and a new scorer that produces a single headline exact_match on the same 0–100 scale as the MLPerf golden number (81.3582).

What's included

Area Change
Scorer DeepSeekR1Scorer (scorer_id="deepseek_r1") in evaluation/scoring.py
Isolated evaluator New uv subproject evaluation/deepseek_r1/ (runner, pyproject.toml, uv.lock, setup_eval.sh, RUNBOOK.md) — excluded from the parent wheel
Dataset DeepSeekR1 loader (dataset_id="deepseek_r1") in dataset_manager/predefined/deepseek_r1/ + registration in dataset_manager/__init__.py
Plumbing commands/benchmark/execute.py: per-scorer complete flag plumbed into results; clearer error on bad accuracy_config.extras; configurable service-ready timeout
Shared client _lcb_ws_evaluate() LiveCodeBench WebSocket client, shared by LiveCodeBenchScorer and DeepSeekR1Scorer
Docs/CI/tests AGENTS.md, CI workflow tweaks, transforms.py, and new unit tests for the dataset + scorer

Design notes

  • Out-of-process evaluation. The official MLCommons eval_accuracy.py pulls in heavy, conflicting pins (transformers, the prm800k math grader, the LiveCodeBench executor). Following the existing VBenchScorer pattern, the evaluator runs via uv run --project against the isolated subproject; the parent process never imports it. Path resolution is explicit arg → $DEEPSEEK_EVAL_PROJECT_PATH → editable-checkout fallback.
  • Pre-tokenized prompt. The dataset emits an input_tokens column (MLPerf tok_input) so the openai_completions adapter's Harmonize() is a no-op and the server chat template is bypassed — the model sees the exact MLPerf prompt. One loader serves both the perf phase (issues input_tokens) and the accuracy phase (grades by dataset/ground_truth).
  • LiveCodeBench sandboxing. LCB executes untrusted model code, which the in-process MLCommons executor can't sandbox. When a port is configured, the livecodebench subset is graded out-of-band against the lcb-service WebSocket container; text subsets go through the subprocess; results are merged into one 5-subset number.
  • Partial-result honesty. A new Scorer.complete flag (default True) is set False whenever the headline can't cover every issued sample — failed text subset, diverging LCB count, or an unreachable lcb-service container (LCB left unscored rather than silently shrinking the denominator). The flag is surfaced in results and logs so a partial number is never mistaken for a complete one.

Type of change

  • Bug fix
  • New feature
  • Documentation update
  • Refactor/cleanup

Related issues

Testing

  • Tests added/updated
  • All tests pass locally
  • Manual testing completed

Checklist

  • Code follows project style
  • Pre-commit hooks pass
  • Documentation updated (if needed)

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>

---------

Signed-off-by: Rashid Kaleem <230885705+arekay-nv@users.noreply.github.com>
@arekay-nv arekay-nv requested a review from a team June 27, 2026 02:05
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from nvzhihanj June 27, 2026 02:05

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the MLPerf DeepSeek-R1 combined-subset accuracy evaluation. It adds a new predefined dataset DeepSeekR1 and a corresponding DeepSeekR1Scorer that executes the official MLCommons evaluator out-of-process within an isolated uv subproject to avoid dependency conflicts. Additionally, the changes update the Dockerfile and benchmark execution flow to provision and handle this new evaluation mode. Feedback on the pull request focuses on enhancing robustness and security: mitigating a potential remote code execution (RCE) risk by setting trust_remote_code=False when loading the tokenizer, replacing several assert statements used for runtime validation with explicit conditional checks and exceptions, and adding safety guards in the DeepSeekR1 dataset loader to handle empty DataFrames and non-iterable token inputs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

# basis for tokens_per_sample. Count with the DeepSeek tokenizer so the
# number matches MLPerf token accounting.
logger.info("Loading tokenizer: %s", args.tokenizer)
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

Setting trust_remote_code=True allows the execution of arbitrary code from the Hugging Face Hub repository when loading the tokenizer. Since DeepSeek-R1 uses standard tokenizer architectures that do not require custom remote code execution, this parameter should be set to False (or omitted) to mitigate potential remote code execution (RCE) security risks.

Suggested change
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
tok = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=False)

Returns ``(passed, total)`` or None if the service is unreachable (so
score() can fall back to the in-process path).
"""
assert self.lcb_websocket_url is not None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using assert statements for runtime validation or flow control is discouraged because assertions can be globally disabled in Python when run with optimization flags (e.g., python -O). If assertions are disabled, this check will be bypassed, potentially leading to unexpected errors or behaviors. Consider replacing this with an explicit if check and raising an appropriate exception.

Suggested change
assert self.lcb_websocket_url is not None
if self.lcb_websocket_url is None:
raise ValueError("lcb_websocket_url must be configured to score LCB via container")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, raise is probably better

Comment on lines +1962 to +1967
assert ref is not None, f"Dataset {self.dataset} has no dataframe loaded"
for col in (self.ground_truth_column, self.subset_column, self.question_column):
assert col in ref.columns, (
f"Column {col!r} not found in dataset {self.dataset}; "
f"available: {list(ref.columns)}"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid using assert statements for runtime data validation as they can be disabled in optimized Python environments (using the -O flag). Replace these assertions with explicit conditional checks that raise descriptive exceptions (e.g., ValueError or RuntimeError) to ensure robust error handling and defensive programming.

        if ref is None:
            raise RuntimeError(f"Dataset {self.dataset} has no dataframe loaded")
        for col in (self.ground_truth_column, self.subset_column, self.question_column):
            if col not in ref.columns:
                raise ValueError(
                    f"Column {col!r} not found in dataset {self.dataset}; "
                    f"available: {list(ref.columns)}"
                )

Comment on lines +146 to +148
"input_tokens": raw["tok_input"].map(
lambda t: t.tolist() if hasattr(t, "tolist") else list(t)
),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If tok_input contains missing or non-iterable values (such as None or NaN), calling list(t) will raise a TypeError. To ensure robust data loading and defensive programming, add a guard to check if t is iterable before attempting to convert it to a list.

Suggested change
"input_tokens": raw["tok_input"].map(
lambda t: t.tolist() if hasattr(t, "tolist") else list(t)
),
"input_tokens": raw["tok_input"].map(
lambda t: t.tolist() if hasattr(t, "tolist") else (list(t) if hasattr(t, "__iter__") else [])
),

Comment on lines +157 to +167
df: pd.DataFrame, max_samples: int, seed: int
) -> pd.DataFrame:
frac = max_samples / len(df)
parts = [
group.sample(
n=min(max(1, round(len(group) * frac)), len(group)),
random_state=seed,
)
for _, group in df.groupby("dataset")
]
return pd.concat(parts).sample(frac=1, random_state=seed).reset_index(drop=True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If the input DataFrame df is empty, len(df) will be 0, causing a ZeroDivisionError when calculating frac. Add a guard clause to return the empty DataFrame immediately if df is empty.

    def _stratified_subset(
        df: pd.DataFrame, max_samples: int, seed: int
    ) -> pd.DataFrame:
        if df.empty:
            return df
        frac = max_samples / len(df)
        parts = [
            group.sample(
                n=min(max(1, round(len(group) * frac)), len(group)),
                random_state=seed,
            )
            for _, group in df.groupby("dataset")
        ]

scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
)
scorer = self._scorer(dataset, staged, project)
score, n_repeats = scorer.score()
scoring_mod, "_lcb_ws_evaluate", lambda url, codes, timeout: None
)
scorer = self._scorer(dataset, staged, project)
score, n_repeats = scorer.score()
Comment on lines +109 to +117
_SERVICE_READY_TIMEOUT_ENV = "INFERENCE_ENDPOINT_SERVICE_READY_TIMEOUT_S"
_SERVICE_READY_TIMEOUT_DEFAULT = 30.0


def _service_ready_timeout() -> float:
"""Service-startup readiness timeout, overridable via env.

Service imports off a shared/Lustre FS can exceed the default under heavy
login-node I/O contention. A non-numeric override is ignored with a warning

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a hack and non-parameterized field?

logger = getLogger(__name__)

#: Env var pointing at the local MLPerf DeepSeek-R1 source dataset.
SOURCE_ENV = "DEEPSEEK_R1_DATASET_PKL"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using MACRO to point dataset is not a good design, especially using a macro to point to another macro. We can push the parquet file to the repo since it's small anyway

``.parquet``), or pass ``source=`` to :meth:`generate`.
"""

COLUMN_NAMES = _OUTPUT_COLUMNS

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Macro pointing to macro

@@ -0,0 +1,167 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably call the dataset legacy_mlperf_deepseek_r1_dataset or something similar. Calling a dataset ds-r1 will be confusing


@staticmethod
def _build_from_source(source: str | os.PathLike | None) -> pd.DataFrame:
resolved = source or os.environ.get(SOURCE_ENV)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we cannot build from source, and have to use existing one. We can probably raise not implemented here and only allow local usage of the parquet then
(Otherwise the logic here is quite confusing)

requires-python = ">=3.10"
dependencies = [
# eval_accuracy.py core
"pandas>=2.0",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend locking the version == instead of >= to avoid maintenance issues

@@ -0,0 +1,106 @@
# DeepSeek-R1 Accuracy Evaluator Subproject

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the right place for documentation. Example is the better place.

Comment on lines +30 to +35
EVAL_ACCURACY_URL="https://raw.githubusercontent.com/mlcommons/inference/${MLC_INFERENCE_COMMIT}/language/deepseek-r1/eval_accuracy.py"

mkdir -p "${SUBMODULES_DIR}"

echo "==> Fetching eval_accuracy.py @ ${MLC_INFERENCE_COMMIT}"
curl -fsSL "${EVAL_ACCURACY_URL}" -o "${EVAL_DIR}/eval_accuracy.py"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand we need to fetch the LCB_REPO, but if we are only fetching one file from MLCommon inference repo, can we just place it here?


_DEFAULT_DEEPSEEK_EVAL_PROJECT_PATH = Path(__file__).resolve().parent / "deepseek_r1"

_DEEPSEEK_EVAL_PROJECT_PATH_ENV = "DEEPSEEK_EVAL_PROJECT_PATH"

@nvzhihanj nvzhihanj Jun 27, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MACRO pointing to Macro (one-time)

def _resolve_project_path(explicit: os.PathLike | None) -> Path:
"""Resolve the DeepSeek eval subproject path.

Lookup order: explicit ctor arg -> ``$DEEPSEEK_EVAL_PROJECT_PATH`` env

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the use case of this - and should we allow such ENV override for the eval directory?

f"Last 50 lines:\n{tail}"
)

def _score_lcb_via_container(self, lcb_df: pd.DataFrame) -> tuple[int, int] | None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this code be reused from the GPT-OSS LCB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants