Scenario-based LLM benchmark for Apple Silicon. Measures what you actually wait for, not just the generation tok/s counter. It runs real scenarios (agent workflows, document classification, creative writing) and compares backends, engines, and hardware on the same workloads. You can easily add your own single-shot or conversational scenarios to test your actual use case with different backends, models, etc.
Your LLM UI says "57 tok/s". But every response starts with a prefill phase where the model processes your entire conversation history before the first token appears. As context grows, prefill dominates. A model reporting 57 tok/s can deliver as low as 3 tok/s in practice.
This benchmark measures effective throughput: output tokens divided by total wall-clock time. The speed you experience, not the speed on screen.
effective tok/s = output_tokens / (prefill_time + generation_time)
Part 1: MLX vs llama.cpp on Apple Silicon | Part 2: Five runtimes benchmarked | Discord | Bluesky | Reddit
Python 3.8+, no dependencies. Just a running inference backend with a model loaded.
# Ollama (default)
python3 bench.py --model llama3.1:8b --model-label llama-3.1-8b-instruct
# LM Studio
python3 bench.py --backend lmstudio --model mlx-community/qwen3.5-35b-a3b \
--model-label qwen3.5-35b-a3b --no-think
# oMLX or any OpenAI-compatible endpoint
export OPENAI_API_KEY=omlx
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8000 --model "Qwen3.5-35B-A3B-4bit" \
--model-label qwen3.5-35b-a3b --no-think
# Compare results
python3 compare.py results/<model>/<scenario>/*.jsonResults auto-save to results/<model>/<scenario>/<chip>_<backend>.json.
Before you run: Set your context window to at least 16K tokens. Add
--no-thinkfor Qwen3.5 models. Setup Guide -->
Effective tok/s (bold) with generation tok/s in parentheses. Higher is better.
| Hardware | Backend | Format | ops-agent | doc-summary | prefill-test | creative-writing |
|---|---|---|---|---|---|---|
| M1 Max (64GB, 24 GPU) | oMLX | MLX 4-bit fp16 | 47.3 (65.2) | 33.1 (65.9) | 12.4 (62.5) | 63.4 (70.2) |
| M1 Max (64GB, 24 GPU) | oMLX | MLX 4-bit | 37.5 (53.3) | 29.4 (55.5) | 27.8 (52.0) | 53.7 (56.2) |
| M1 Max (64GB, 24 GPU) | Rapid-MLX | MLX 4-bit | 35.6 (59.9) | 28.7 (60.7) | 8.5 (57.3) | 56.5 (62.2) |
| M1 Max (64GB, 24 GPU) | mlx-openai-server | MLX 4-bit | 26.2 (59.3) | 26.2 (59.8) | 8.7 (57.5) | 57.8 (62.7) |
| M1 Max (64GB, 24 GPU) | LM Studio | MLX | 17.0 (56.6) | 13.4 (56.8) | 5.9 (54.4) | 38.3 (58.9) |
| M1 Max (64GB, 24 GPU) | LM Studio | GGUF | 17.6 (28.2) | 19.4 (29.3) | 7.8 (28.4) | 27.7 (28.6) |
| M2 Pro (32GB, 19 GPU) | LM Studio | MLX | 17.6 (58.4) | 14.3 (60.4) | 5.6 (57.9) | 42.9 (62.5) |
| M3 Max (128GB, 40 GPU) | oMLX | MLX 4-bit | 71.3 (90.8) | 61.4 (93.8) | 22.6 (87.9) | 90.1 (94.3) |
| M3 Max (128GB, 40 GPU) | LM Studio | MLX | 37.1 (83.5) | 22.5 (87.3) | 14.8 (85.8) | 59.0 (92.2) |
Visual comparison (ops-agent effective tok/s)
M3 Max 40GPU │ oMLX 4-bit ████████████████████████████████████ 71.3
M1 Max 24GPU │ oMLX fp16 ████████████████████████ 47.3
M1 Max 24GPU │ oMLX 4-bit ███████████████████ 37.5
M3 Max 40GPU │ LM Studio MLX ███████████████████ 37.1
M1 Max 24GPU │ Rapid-MLX ██████████████████ 35.6
M1 Max 24GPU │ mlx-openai █████████████ 26.2
M1 Max 24GPU │ LM Studio GGUF █████████ 17.6
M2 Pro 19GPU │ LM Studio MLX █████████ 17.6
M1 Max 24GPU │ LM Studio MLX █████████ 17.0
oMLX wins every scenario thanks to its tiered KV cache. On M3 Max, effective throughput reaches 71 tok/s in ops-agent — nearly 2x the M1 Max result. Generation speed is identical across MLX engines (~55-93 tok/s depending on hardware), but prefill speed varies dramatically: at 8K context, LM Studio MLX takes 49s to prefill while oMLX takes 1.7s (with its persistent SSD cache from a prior run — cold prefill is higher).
Run this benchmark
Prerequisites: Disable thinking mode for Qwen3.5 — it burns tokens on invisible <think> blocks.
- Ollama / LM Studio (GGUF):
--no-thinkhandles it automatically - LM Studio (MLX) / oMLX: Run
python3 scripts/qwen3.5-35b-a3b-toggle-thinking.py offthen reload the model
# oMLX (4-bit) — fastest effective throughput
export OPENAI_API_KEY=omlx
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8000 --model Qwen3.5-35B-A3B-4bit \
--model-label qwen3.5-35b-a3b --no-think
# LM Studio (MLX)
python3 bench.py --backend lmstudio --backend-label lmstudio-mlx \
--model mlx-community/qwen3.5-35b-a3b \
--model-label qwen3.5-35b-a3b --no-think
# LM Studio (GGUF)
python3 bench.py --backend lmstudio --backend-label lmstudio-gguf \
--model qwen/qwen3.5-35b-a3b \
--model-label qwen3.5-35b-a3b --no-think
# Ollama
python3 bench.py --model qwen3.5:35b-a3b \
--model-label qwen3.5-35b-a3b --no-think| Hardware | Backend | Format | ops-agent | doc-summary | prefill-test | creative-writing |
|---|---|---|---|---|---|---|
| M4 Max (128GB, 40 GPU) | LM Studio | MLX | 70.5 (85.5) | 56.6 (91.7) | 35.7 (85.5) | 86.6 (91.6) |
| M1 Max (64GB, 24 GPU) | oMLX | UD-MLX 4-bit | 26.7 (49.3) | 25.7 (50.7) | 18.7 (47.5) | 49.1 (51.0) |
Visual comparison (ops-agent effective tok/s)
M4 Max 40GPU │ LM Studio MLX ████████████████████████████████████ 70.5
M1 Max 24GPU │ oMLX UD-4bit ██████████████ 26.7
Qwen3.6 adds vision capabilities (document classification, OCR extraction) while keeping the same MoE architecture. On M4 Max with LM Studio, performance is on par with Qwen3.5 on the same hardware. The Unsloth Dynamic (UD) quantization keeps 312 layers at 8-bit for better accuracy, at the cost of ~15% slower generation vs Qwen3.5. Vision benchmarks (vision-classify, vision-extract) are included in the results directory.
bf16 and M1/M2: The UD model ships with bf16 weights, which are software-emulated on M1/M2. Unlike standard MLX quantizations, the UD quant cannot be converted to fp16 — the conversion produces garbage output on vision tasks and likely degrades text quality. M3+ chips with native bf16 support will see significantly better prefill performance.
Run this benchmark
Prerequisites: Disable thinking mode — Qwen3.6 supports enable_thinking natively via oMLX model settings (no template patching needed).
# oMLX — disable thinking via admin API, then run
curl -X POST http://localhost:8888/admin/api/login \
-H "Content-Type: application/json" -d '{"api_key": "YOUR_KEY"}' -c /tmp/omlx.txt
curl -X PUT http://localhost:8888/admin/api/models/Qwen3.6-35B-A3B-UD-MLX-4bit/settings \
-H "Content-Type: application/json" -b /tmp/omlx.txt \
-d '{"chat_template_kwargs": {"enable_thinking": false}}'
export OPENAI_API_KEY=YOUR_KEY
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8888 --model Qwen3.6-35B-A3B-UD-MLX-4bit \
--model-label qwen3.6-35b-a3b-ud-mlx-4bit
# Vision scenarios
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8888 --model Qwen3.6-35B-A3B-UD-MLX-4bit \
--model-label qwen3.6-35b-a3b-ud-mlx-4bit --vision-only| Hardware | Backend | Format | ops-agent | doc-summary | prefill-test | creative-writing |
|---|---|---|---|---|---|---|
| M1 Max (64GB, 24 GPU) | LM Studio | MLX | 40.7 (55.0) | 21.9 (59.6) | 8.4 (51.8) | 58.9 (62.1) |
| M1 Max (64GB, 24 GPU) | LM Studio | GGUF | 30.6 (36.4) | 18.5 (40.7) | 7.1 (33.4) | 38.1 (39.1) |
| M1 Max (64GB, 24 GPU) | Ollama | GGUF | 27.1 (33.4) | 18.9 (37.8) | 5.8 (30.7) | 38.6 (39.6) |
| M3 Max (128GB, 40 GPU) | LM Studio | MLX | 57.6 (70.8) | 38.2 (76.0) | 14.4 (65.6) | 75.2 (78.5) |
| M3 Max (128GB, 40 GPU) | oMLX | MLX | 53.3 (69.4) | 35.1 (71.1) | 14.5 (63.2) | 73.6 (76.9) |
Visual comparison (ops-agent effective tok/s)
M3 Max 40GPU │ LM Studio MLX █████████████████████████████ 57.6
M3 Max 40GPU │ oMLX MLX ███████████████████████████ 53.3
M1 Max 24GPU │ LM Studio MLX ████████████████████ 40.7
M1 Max 24GPU │ LM Studio GGUF ████████████████ 30.6
M1 Max 24GPU │ Ollama GGUF ██████████████ 27.1
MLX wins across the board. At 8B the model fits comfortably in memory, prefill stays fast, and the ~1.5x generation speed advantage dominates. On M3 Max, LM Studio MLX edges out oMLX thanks to lower TTFT overhead at this model size.
Run this benchmark
# LM Studio (MLX)
python3 bench.py --backend lmstudio --backend-label lmstudio-mlx \
--model mlx-community/meta-llama-3.1-8B-instruct \
--model-label llama-3.1-8b-instruct
# LM Studio (GGUF)
python3 bench.py --backend lmstudio --backend-label lmstudio-gguf \
--model lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF \
--model-label llama-3.1-8b-instruct
# oMLX (MLX)
export OPENAI_API_KEY=omlx
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8000 --model Llama-3.1-8B-Instruct \
--model-label llama-3.1-8b-instruct
# Ollama (GGUF)
python3 bench.py --model llama3.1:8b \
--model-label llama-3.1-8b-instruct| Hardware | Backend | Format | ops-agent | doc-summary | prefill-test | creative-writing |
|---|---|---|---|---|---|---|
| M2 Pro (32GB, 19 GPU) | oMLX | MLX 4-bit | 25.4 (36.7) | 15.3 (39.1) | 5.2 (34.3) | 40.9 (42.8) |
| M2 Pro (32GB, 19 GPU) | LM Studio | MLX | 24.3 (38.5) | 16.4 (41.3) | 5.1 (35.4) | 41.4 (43.8) |
Visual comparison (ops-agent effective tok/s)
M2 Pro 19GPU │ oMLX 4-bit █████████████ 25.4
M2 Pro 19GPU │ LM Studio MLX ████████████ 24.3
Run this benchmark
# LM Studio (MLX)
python3 bench.py --backend lmstudio --backend-label lmstudio-mlx \
--model mlx-community/glm-4.7-flash \
--model-label glm-4.7-flash
# oMLX (4-bit)
export OPENAI_API_KEY=omlx
python3 bench.py --backend openai --backend-label omlx \
--base-url http://localhost:8000 --model GLM-4.7-Flash-4bit \
--model-label glm-4.7-flashRun the benchmark on your hardware and open a PR. Five minutes, no dependencies.
| Hardware | Backend | Format | ops-agent | doc-summary | prefill-test | creative-writing |
|---|---|---|---|---|---|---|
| M1 Max (64GB, 24 GPU) | Ollama | GGUF | 27.1 (33.4) | 18.9 (37.8) | 5.8 (30.7) | 38.6 (39.6) |
| M2 Pro (32GB, 19 GPU) | LM Studio | MLX | 17.6 (58.4) | 14.3 (60.4) | 5.6 (57.9) | 42.9 (62.5) |
| M3 Max (128GB, 40 GPU) | oMLX | MLX 4-bit | 71.3 (90.8) | 61.4 (93.8) | 22.6 (87.9) | 90.1 (94.3) |
| M4 / Pro / Max |
Four real-world scenarios. All run by default, or pick one with --scenario.
| Scenario | Mode | What it tests |
|---|---|---|
| ops-agent | 8-turn conversation | Agent with tool calls. Context grows every turn. |
| doc-summary | 5 single-shot turns | Document classification. Long input, short output. |
| prefill-test | 4 single-shot turns | Prefill scaling: 655 to 8.5K context, same short reply. |
| creative-writing | 3 single-shot turns | Short prompt, long output (up to 2K tokens). |
You can create your own scenarios as JSON files.
| Backend | Flag | Default URL |
|---|---|---|
| Ollama | --backend ollama |
localhost:11434 |
| LM Studio | --backend lmstudio |
localhost:1234 |
| oMLX | --backend openai --backend-label omlx |
localhost:8000 |
| llama-server | --backend llama-server |
localhost:8090 |
| MiniMax | --backend minimax |
api.minimax.io |
| Any OpenAI-compatible | --backend openai |
localhost:8080 |
Override with --base-url. Use --backend-label to customize the name in result paths.
One data point is an anecdote. A table full of hardware is useful.
# 1. Fork and clone
git clone https://github.com/<you>/local-llm-bench && cd local-llm-bench
# 2. Run — pick a command from the "Run this benchmark" sections above, or:
python3 bench.py --model llama3.1:8b --model-label llama-3.1-8b-instruct
# 3. Commit and PR
git checkout -b results/my-hardware
git add results/
git commit -m "results: my hardware"
git push -u origin HEAD
gh pr create --title "results: my hardware" --body "Benchmark results"Use --model-label to ensure results land in the correct directory (see copy-paste commands above). Filenames encode your hardware specs, so there are no merge conflicts between contributors.
The benchmark auto-detects Ollama flags and includes them in result filenames.
Ollama: OLLAMA_FLASH_ATTENTION=1 (faster prefill) | OLLAMA_KV_CACHE_TYPE=q4_0 (4x smaller KV cache) | iogpu.wired_limit_mb=8192 (more GPU memory). Restart Ollama after changes.
LM Studio MLX: Default prefill chunk size is 512. Raising to 4096 nearly doubles prefill speed.
All backends: See the Setup Guide for context window configuration, Qwen3.5 thinking mode, and step-by-step verification.
MIT