Skip to content

Latest commit

 

History

History
150 lines (103 loc) · 3.93 KB

File metadata and controls

150 lines (103 loc) · 3.93 KB

Installation Guide

Complete installation instructions for madengine.

Prerequisites

  • Python 3.8+ with pip
  • Docker with GPU support (ROCm for AMD, CUDA for NVIDIA)
  • Git for repository management
  • MAD package - Required for model discovery and execution

Quick Install

From GitHub

# Install madengine (all dependencies, including Kubernetes support, are included)
pip install git+https://github.com/ROCm/madengine.git

Development Installation

# Clone repository
git clone https://github.com/ROCm/madengine.git
cd madengine

# Create virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in editable mode (dev dependencies are included)
pip install -e .

# Setup pre-commit hooks (optional, for contributors)
pre-commit install

Dependencies

All dependencies — including Kubernetes deployment support and development tools (pytest, black, mypy, etc.) — are installed by default. There are no optional extras to select.

Note: SLURM deployment requires no additional Python dependencies (uses CLI commands).

MAD Package Setup

madengine requires the MAD package for model definitions and execution scripts.

# Clone MAD package
git clone https://github.com/ROCm/MAD.git
cd MAD

# Install madengine within MAD directory
pip install git+https://github.com/ROCm/madengine.git

# Verify installation
madengine --version
madengine discover  # Test model discovery

Docker GPU Setup

AMD ROCm

# Test ROCm GPU access
docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video \
  rocm/pytorch:latest rocm-smi

# Verify with madengine
madengine run --tags dummy \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

Non-default ROCm location (host): If ROCm is not under /opt/rocm (e.g. TheRock or pip install), set ROCM_PATH on the host or set top-level MAD_ROCM_PATH in --additional-context so host GPU checks (amd-smi, etc.) resolve correctly. In-container ROCM_PATH for Docker workloads is set separately at run (image OCI env, in-image probe, or docker_env_vars.MAD_ROCM_PATH); it is not copied from the host. See Configuration — ROCm path.

NVIDIA CUDA

# Test CUDA GPU access
docker run --rm --gpus all nvidia/cuda:latest nvidia-smi

# Verify with madengine
madengine run --tags dummy \
  --additional-context '{"gpu_vendor": "NVIDIA", "guest_os": "UBUNTU"}'

Verify Installation

# Check installation
madengine --version
madengine --version

# Test basic functionality (requires MAD package)
cd /path/to/MAD
madengine discover --tags dummy
madengine run --tags dummy \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

Troubleshooting

Import Errors

If you get import errors, ensure your virtual environment is activated and madengine is installed:

pip list | grep madengine

Docker Permission Issues

If you encounter Docker permission errors:

# Add user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker

ROCm GPU Not Detected

# Check ROCm installation
rocm-smi

# Verify devices are accessible
ls -la /dev/kfd /dev/dri

If ROCm is installed in a non-default path on the host (e.g. TheRock or pip), set export ROCM_PATH=/path/to/rocm or pass MAD_ROCM_PATH in --additional-context (host validation only; see ROCm path (run only) for in-container behavior).

MAD Package Not Found

Ensure you're running madengine commands from within a MAD package directory:

cd /path/to/MAD
export MODEL_DIR=$(pwd)
madengine discover

Next Steps