Hugging Face Model integration in Superbench#803
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds HuggingFace Hub as a first-class model source across SuperBench training benchmarks and ORT/TensorRT inference micro-benchmarks, enabling users to benchmark arbitrary HF models via CLI flags (including gated models via HF_TOKEN).
Changes:
- Introduces
ModelSourceConfigandHuggingFaceModelLoaderfor unified HF model configuration/loading and memory-fit checks. - Extends PyTorch model benchmarks to optionally load HF backbones and wrap them with task-specific heads.
- Adds HF→ONNX export support and integrates HF flows into ORT and TensorRT inference micro-benchmarks, plus new tests and examples.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/benchmarks/micro_benchmarks/test_model_source_config.py | Adds unit tests for ModelSourceConfig validation/defaulting. |
| tests/benchmarks/micro_benchmarks/test_huggingface_loader.py | Adds unit tests for HF loader dtype handling, load flow, and size estimation. |
| tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py | Adds integration tests that download real HF models and validate basic forward pass. |
| superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py | Adds HF config customization + wrapper and HF-loading branch for Mixtral benchmark. |
| superbench/benchmarks/model_benchmarks/pytorch_lstm.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_llama.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_gpt2.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_cnn.py | Adds HF-loading path + wrapper for HF vision backbones, keeps in-house torchvision path. |
| superbench/benchmarks/model_benchmarks/pytorch_bert.py | Adds HF-loading path + wrapper and refactors in-house model creation. |
| superbench/benchmarks/model_benchmarks/pytorch_base.py | Adds shared HF model loading flow, memory estimation, and CLI args for model source/identifier. |
| superbench/benchmarks/micro_benchmarks/tensorrt_inference_performance.py | Adds HF model preprocessing: config-only memory check, HF load, ONNX export, TRT build command. |
| superbench/benchmarks/micro_benchmarks/ort_inference_performance.py | Adds HF preprocessing (config memory check, HF load, ONNX export/quantize) + dynamic input handling. |
| superbench/benchmarks/micro_benchmarks/model_source_config.py | New dataclass encapsulating model source, identifier, dtype, token, and loader kwargs. |
| superbench/benchmarks/micro_benchmarks/huggingface_model_loader.py | New loader for HF Hub with tokenizer support, size/memory estimation utilities, and pre-checks. |
| superbench/benchmarks/micro_benchmarks/_export_torch_to_onnx.py | Adds HF model ONNX export with vision/NLP detection, dynamic axes, and optional external data output. |
| examples/benchmarks/tensorrt_inference_performance.py | Updates example script to show in-house vs HF usage via CLI. |
| examples/benchmarks/pytorch_huggingface_models.py | New example demonstrating HF-backed training benchmarks, incl. distributed option. |
| examples/benchmarks/ort_inference_performance.py | Updates ORT example script to show in-house vs HF usage via CLI. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f689460 to
2a47dc8
Compare
2a47dc8 to
a61db26
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 13 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
a61db26 to
6bebb38
Compare
6bebb38 to
4eec2f7
Compare
4eec2f7 to
2f24e0f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
2f24e0f to
18df07b
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
12a65ad to
7632427
Compare
dca9515 to
7094628
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…e benchmarks - Add HuggingFaceModelLoader for downloading and caching models from HF Hub - Support both NLP (AutoModelForCausalLM) and vision (AutoModelForImageClassification) models - Add model_source and model_identifier parameters to TensorRT/ORT benchmarks - Add ONNX export pipeline for HuggingFace models with dynamic axes - Derive vision input shapes from ONNX graph dims with HF config fallback - Filter ONNX initializers from graph.input for correct NLP input handling - Add PyTorch 2.8+ compatibility (external_data vs use_external_data_format) - Add example script, unit tests, and config schema updates - Support HF_TOKEN env var for gated model access
7094628 to
2ca3e68
Compare
| device = 'cuda' if torch.cuda.is_available() else 'cpu' | ||
|
|
||
| # Extract loading parameters | ||
| return self.load_model( |
There was a problem hiding this comment.
ModelSourceConfig.hf_token (and cache_dir) never get forwarded here, so passing a token via config has no effect unless it also exists in env vars.
That breaks private/gated model loading for callers that rely on explicit config instead of process environment. Consider wiring these fields through explicitly (for example by constructing HuggingFaceModelLoader(cache_dir=config.cache_dir, token=config.hf_token) or forwarding token/cache args into load_model).
| elif 'input_ids' in input_names: | ||
| # NLP model: use input_ids and attention_mask | ||
| seq_len = getattr(self._args, 'seq_length', 512) | ||
| input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64) |
There was a problem hiding this comment.
Using a fixed upper bound (30000) for synthetic token IDs can produce out-of-range indices for models with smaller vocabularies, causing runtime failures in embedding lookup.
Could we derive the max token ID from the model config/tokenizer vocab size (or use a conservative low bound) before generating input_ids?
| # Load config (use pre-downloaded config if provided) | ||
| if config is None: | ||
| logger.info('Loading model configuration...') | ||
| config = AutoConfig.from_pretrained(model_identifier, trust_remote_code=True, **load_kwargs) |
There was a problem hiding this comment.
trust_remote_code=True executes repository-provided Python during config/tokenizer/model load. Since model_identifier comes from CLI input, this effectively allows arbitrary code execution from untrusted model repos.
Recommend defaulting this to False and requiring an explicit opt-in flag for remote code execution (with clear warning in docs).
| import os | ||
| import re | ||
| from pathlib import Path | ||
|
|
||
| import torch |
| if 'pixel_values' in input_names: | ||
| # Vision model: use pixel_values (batch_size, 3, 224, 224) | ||
| pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) | ||
| inputs = {'pixel_values': pixel_values} |
| } | ||
|
|
||
| if normalized_dtype not in dtype_map: | ||
| raise ValueError(f"Invalid dtype '{dtype_str}'.Must be one of {list(dtype_map.keys())}") |
| def loader(self): | ||
| """Create a loader instance.""" | ||
| return HuggingFaceModelLoader(cache_dir='/tmp/hf_test_cache') |
|
|
||
| pytorch_test = unittest.skipIf(os.environ.get('SB_TEST_PYTORCH', '1') == '0', 'Skip PyTorch tests.') | ||
| directx_test = unittest.skipIf(os.environ.get('SB_TEST_DIRECTX', '0') == '0', 'Skip DirectX tests.') | ||
| hf_e2e_test = unittest.skipUnless(os.environ.get('SB_TEST_HF_E2E', '0') == '1', 'Skip HF E2E tests. Set SB_TEST_HF_E2E=1 to enable.') |
| from superbench.benchmarks.micro_benchmarks.model_source_config import ModelSourceConfig | ||
|
|
||
|
|
||
| @pytest.mark.skipif(os.environ.get('SB_TEST_HF_E2E', '0') != '1', reason='Skip HF E2E tests. Set SB_TEST_HF_E2E=1 to enable.') |
| # Extract loading parameters | ||
| return self.load_model( | ||
| model_identifier=config.identifier, | ||
| torch_dtype=config.torch_dtype, | ||
| device=device, | ||
| revision=config.revision, | ||
| device_map=config.device_map, | ||
| config=config_pretrained, | ||
| **config.additional_kwargs | ||
| ) | ||
|
|
| config = AutoConfig.from_pretrained(model_identifier, trust_remote_code=True, **load_kwargs) | ||
| else: | ||
| logger.info('Using pre-downloaded model configuration.') | ||
|
|
||
| # Load tokenizer (may fail for some models, that's ok) | ||
| tokenizer = None | ||
| try: | ||
| logger.info('Loading tokenizer...') | ||
| tokenizer = AutoTokenizer.from_pretrained(model_identifier, trust_remote_code=True, **load_kwargs) | ||
| except Exception as e: | ||
| logger.warning(f'Could not load tokenizer: {e}. Continuing without tokenizer.') | ||
|
|
||
| # Load model | ||
| logger.info(f'Loading model weights (dtype={torch_dtype}, device={device})...') | ||
| model_kwargs = load_kwargs.copy() | ||
| model_kwargs['trust_remote_code'] = True |
There was a problem hiding this comment.
[BLOCKER] Unconditional trust_remote_code=True enables RCE from a CLI argument
Every AutoConfig/AutoTokenizer/AutoModel.from_pretrained call in the loader (and the duplicate calls in _preprocess_huggingface_models in both ORT and TRT benchmarks) hardcodes trust_remote_code=True. The model identifier flows directly from the --model_identifier CLI flag — there is no allow-list, no opt-in flag, and no warning. HF then downloads and exec()s arbitrary configuration_*.py/modeling_*.py/tokenization_*.py from the supplied repo. Since SuperBench is commonly run as root in containers on GPU hosts with credentialed network access, the blast radius is full host compromise.
Default trust_remote_code=False. Add a separate --allow_remote_code opt-in flag, and log a prominent warning when it is set. Recommend pinning --revision <sha> in the example docs.
trust = bool(self._args.allow_remote_code)
if trust:
logger.warning('allow_remote_code=True: executing arbitrary Python from HF repo %r', model_identifier)
config = AutoConfig.from_pretrained(model_identifier, trust_remote_code=trust, **load_kwargs)| # Determine input format based on what the model expects | ||
| if 'pixel_values' in input_names: | ||
| # Vision model: use pixel_values (batch_size, 3, 224, 224) | ||
| pixel_values = np.random.randn(self._args.batch_size, 3, 224, 224).astype(dtype=precision) |
There was a problem hiding this comment.
[BLOCKER] ORT inference hardcodes (B, 3, 224, 224) for all vision HF models
The exporter (export_huggingface_model) correctly derives (C, H, W) from model.config.num_channels and image_size, so a 384×384 ViT or 1-channel medical model exports with the right shape — but __inference then always feeds (batch, 3, 224, 224) whenever pixel_values is an input. ORT will fail with a shape mismatch on any vision model that is not (3, 224, 224). The TRT path already reads dims from the ONNX graph; ORT must do the same.
meta = next(i for i in ort_sess.get_inputs() if i.name == 'pixel_values')
batch = self._args.batch_size
dims = [d if isinstance(d, int) else None for d in meta.shape]
_, c, h, w = (dims + [None]*4)[:4]
c, h, w = c or 3, h or 224, w or 224
pixel_values = np.random.randn(batch, c, h, w).astype(precision)
inputs = {'pixel_values': pixel_values}| '--explicitBatch', | ||
| f'--optShapes=input:{input_shape}', | ||
| '--workspace=8192', | ||
| '--memPoolSize=workspace:8192M', |
There was a problem hiding this comment.
[BLOCKER] --memPoolSize requires TensorRT 8.4+; in-house TRT path now silently breaks on older runtimes
The in-house branch had --explicitBatch --workspace=8192; this PR removes --explicitBatch and replaces --workspace= with --memPoolSize=workspace:8192M. --memPoolSize was introduced in TensorRT 8.4; --workspace= was deprecated in 8.4 but still accepted through TRT 10. The PR description does not declare a minimum TensorRT version bump, so any environment still on TRT 8.0–8.3 (common in CUDA 11.x base images) hits an unknown-argument error.
Either gate the flag on detected trtexec version (keep --workspace=8192 for < 8.4), or explicitly bump and document the minimum TensorRT version in README + Dockerfiles + changelog.
| return self.load_model( | ||
| model_identifier=config.identifier, | ||
| torch_dtype=config.torch_dtype, | ||
| device=device, | ||
| revision=config.revision, | ||
| device_map=config.device_map, | ||
| config=config_pretrained, | ||
| **config.additional_kwargs |
There was a problem hiding this comment.
[SHOULD-FIX] load_model_from_config silently drops config.hf_token and config.cache_dir
ModelSourceConfig exposes hf_token and cache_dir as part of its documented contract, but load_model_from_config extracts only identifier, torch_dtype, revision, device_map, and additional_kwargs. The loader then falls back to self.token/self.cache_dir populated from env vars in __init__. A caller who instantiates HuggingFaceModelLoader() (no args) and supplies a token only via ModelSourceConfig(hf_token=...) will get a silent 401 on private/gated models — exactly the use case the PR description advertises.
return self.load_model(
model_identifier=config.identifier,
torch_dtype=config.torch_dtype,
device=device,
revision=config.revision,
device_map=config.device_map,
config=config_pretrained,
token=config.hf_token or self.token,
cache_dir=config.cache_dir or self.cache_dir,
**config.additional_kwargs,
)| valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] | ||
| if self.torch_dtype not in valid_dtypes: | ||
| raise ValueError(f"Invalid torch_dtype '{self.torch_dtype}'. Must be one of {valid_dtypes}.") |
There was a problem hiding this comment.
[SHOULD-FIX] ModelSourceConfig validates 'int8' but _get_torch_dtype rejects it
valid_dtypes = ['float32', 'float16', 'bfloat16', 'int8'] accepts 'int8' at construction time, but HuggingFaceModelLoader._get_torch_dtype('int8') raises ValueError. Two sources of truth disagree, producing a confusing late-stage failure. The ORT preprocess works around this by mapping INT8→FLOAT32 before constructing the config; any other caller will fail.
Drop 'int8' from valid_dtypes (INT8 is handled via post-export quantize_dynamic, not via torch_dtype) — or implement it consistently in the loader.
| elif 'input_ids' in input_names: | ||
| # NLP model: use input_ids and attention_mask | ||
| seq_len = getattr(self._args, 'seq_length', 512) | ||
| input_ids = np.random.randint(0, 30000, (self._args.batch_size, seq_len)).astype(np.int64) |
There was a problem hiding this comment.
[SHOULD-FIX] Random input_ids bound 30000 can exceed model vocabulary
np.random.randint(0, 30000, ...) produces out-of-range token IDs for any model with vocab_size < 30000 (many distilled / domain-specific HF models). Embedding Gather on out-of-range indices is undefined on CUDA — silent NaNs, garbage timings, or device-side assertions. The exporter already loaded the HF config; carry vocab_size through.
vocab_size = getattr(self._hf_config, 'vocab_size', 30000)
input_ids = np.random.randint(0, vocab_size, (self._args.batch_size, seq_len), dtype=np.int64)| f'--onnx={onnx_path}', | ||
| f'--optShapes={input_shapes}', | ||
| '--memPoolSize=workspace:8192M', | ||
| None if self._args.precision == 'fp32' else f'--{self._args.precision}', |
There was a problem hiding this comment.
[SHOULD-FIX] TRT --int8 is emitted without calibration data — INT8 numbers are not meaningful
The HF TRT args list emits f'--{self._args.precision}', so for INT8 trtexec receives --int8 but no --calib=<file> and no Q/DQ ONNX. trtexec falls back to fake dynamic ranges and reports latencies, but accuracy is undefined and the benchmark is misleading.
Either reject INT8 in the HF TRT path with a clear "requires calibration" error, generate a small synthetic calibration cache from dummy inputs and pass --calib=, or run ORT static quantization first (Q/DQ embedded) and feed the quantized ONNX to trtexec.
| # Require CUDAExecutionProvider — this benchmark targets GPU inference | ||
| available = ort.get_available_providers() | ||
| if 'CUDAExecutionProvider' not in available: | ||
| logger.error(f'CUDAExecutionProvider is not available (available: {available}).') | ||
| return False |
There was a problem hiding this comment.
[SHOULD-FIX] _benchmark hard-fails without CUDAExecutionProvider — behavioral regression for CPU/ROCm smoke tests
The previous behavior allowed ORT to fall back to whatever provider was registered. The new hard return False breaks CPU-only smoke tests, ROCm hosts using onnxruntime-rocm, and future provider extensions. Even if CUDA is the intended target, this should not be a silent fail-closed.
Either downgrade to a warning and let InferenceSession pick its own providers, or gate the strict check on an explicit --require_cuda flag so the default in-house path keeps its prior tolerance.
| import onnx as onnx_lib | ||
| onnx_model = onnx_lib.load(onnx_path) |
There was a problem hiding this comment.
[SHOULD-FIX] onnx.load(onnx_path) will OOM/fail for >2GB external-data models
The exporter switches to external-data format when model size > 2GB (LLaMA-7B and up). onnx.load() defaults to load_external_data=True and will try to materialize all sidecar tensor data into memory — solely to read graph input metadata. This crashes or OOMs the host node on exactly the models the external-data branch was written for.
onnx_model = onnx_lib.load(onnx_path, load_external_data=False)| from superbench.benchmarks.micro_benchmarks._export_torch_to_onnx import torch2onnxExporter | ||
| exporter = torch2onnxExporter() | ||
|
|
||
| model_name = self._args.model_identifier.replace('/', '_') |
There was a problem hiding this comment.
[SHOULD-FIX] Insufficient sanitization of --model_identifier before use in filesystem paths and from_pretrained
model_name = self._args.model_identifier.replace('/', '_') strips / but not .., \ (Windows separator), :, or control characters. The same model_identifier is also passed to AutoConfig.from_pretrained, which will load from a local directory if the string resolves as a path — letting a crafted argument load attacker-staged files from a mounted volume.
Validate at the CLI boundary with a strict allow-list and assert path containment:
import re
_SAFE_ID = re.compile(r'^[A-Za-z0-9][A-Za-z0-9._-]{0,127}(/[A-Za-z0-9._-]{1,128})?$')
if not _SAFE_ID.match(model_identifier):
raise ValueError(f'Invalid model_identifier: {model_identifier!r}')
out = (proc_output_path / (model_name + '.onnx')).resolve()
assert proc_output_path.resolve() in out.parents|
|
||
| pytorch_test = unittest.skipIf(os.environ.get('SB_TEST_PYTORCH', '1') == '0', 'Skip PyTorch tests.') | ||
| directx_test = unittest.skipIf(os.environ.get('SB_TEST_DIRECTX', '0') == '0', 'Skip DirectX tests.') | ||
| hf_e2e_test = unittest.skipUnless(os.environ.get('SB_TEST_HF_E2E', '0') == '1', 'Skip HF E2E tests. Set SB_TEST_HF_E2E=1 to enable.') |
There was a problem hiding this comment.
[SHOULD-FIX] New hf_e2e_test is dead code and exceeds the repo's 120-char flake8 limit
The new hf_e2e_test = unittest.skipUnless(...) line is 138 characters (repo flake8 max-line-length is 120), and the symbol is not referenced anywhere — tests/benchmarks/micro_benchmarks/test_huggingface_e2e.py uses its own inline @pytest.mark.skipif.
Either delete the helper, or wire TestHuggingFaceE2E to use it (one source of truth). Either way, wrap the line:
hf_e2e_test = unittest.skipUnless(
os.environ.get('SB_TEST_HF_E2E', '0') == '1',
'Skip HF E2E tests. Set SB_TEST_HF_E2E=1 to enable.',
)| Returns: | ||
| bool: True if preprocessing succeeds. | ||
| """ | ||
| import os |
There was a problem hiding this comment.
[SHOULD-FIX] Duplicate os import in TRT module
This PR adds a module-level import os, but _preprocess_huggingface_models also has a local import os. Lint (F401/F811) will flag at least one. Keep the module-level import; remove the inner one.
| def loader(self): | ||
| """Create a loader instance.""" | ||
| return HuggingFaceModelLoader(cache_dir='/tmp/hf_test_cache') |
There was a problem hiding this comment.
[SHOULD-FIX] E2E test fixture hardcodes /tmp/hf_test_cache
HuggingFaceModelLoader(cache_dir='/tmp/hf_test_cache') is non-portable (fails on Windows runners), predictable on shared CI (symlink/squatting risk), and collides across pytest-xdist workers. The loader-unit-test file already uses tmp_path — be consistent.
@pytest.fixture
def loader(self, tmp_path):
return HuggingFaceModelLoader(cache_dir=str(tmp_path / 'hf_cache'))
Adds support for loading and benchmarking models from HuggingFace Hub across Inference micro-benchmarks -ORT/TensorRT inference. Users can run any compatible HF-hosted model through the existing benchmark harness using --model_source huggingface --model_identifier <org/model>.
SuperBench previously only supported in-house model definitions with hardcoded architectures. Adding new models required code changes. This PR allows benchmarking any compatible HuggingFace model with a CLI flag change, including gated models via HF_TOKEN.
Key Changes
New modules:
HuggingFaceModelLoader — Downloads, caches, and loads models from HF Hub. Estimates parameter count from model config (few KB) and checks GPU
memory before downloading full weights to avoid failed multi-GB downloads.
ModelSourceConfig — Dataclass for model source configuration (in-house / huggingface), dtype, revision, auth token, and device mapping.
Micro-benchmarks (inference):
ORT inference — Downloads HF model → exports to ONNX → runs ORT inference. Handles both vision (pixel_values) and NLP (input_ids) inputs
automatically.
TensorRT inference — Same flow: download → ONNX export → trtexec engine build → inference. Includes dynamic input shape detection from the
exported ONNX graph.
ONNX exporter — New export_huggingface_model() method with vision/NLP auto-detection, dynamic axes, and external data support for large models
(>2GB).
Testing
Usage
Training benchmark
ORT inference
python examples/benchmarks/ort_inference_performance.py
--model_source huggingface --model_identifier bert-base-uncased
TensorRT inference
python examples/benchmarks/tensorrt_inference_performance.py
--model_source huggingface --model_identifier microsoft/resnet-50
Gated models
export HF_TOKEN=hf_xxxxx