Skip to content

Investigate StableHLO → IREE compilation maturity #261

@michalharakal

Description

@michalharakal

SKaiNET Backend Architecture Strategy

Overview

SKaiNET uses a hybrid backend approach combining direct execution for development with MLIR/XLA compilation for production deployment.

Architecture Layers

Layer 1: Development Backend (CPU)

  • Purpose: Fast iteration, testing, and debugging
  • Implementation: skainet-backend-cpu with direct Kotlin implementations
  • Use Cases: Unit tests, local development, CI/CD validation
  • Platforms: All Kotlin Multiplatform targets (JVM, Native, JS, WASM)

Layer 2: Production Compilation (MLIR-based)

  • Purpose: High-performance production deployment
  • Implementation: StableHLO MLIR → Multiple compiler backends
  • Compiler Options:
    • XLA: Google's mature compiler for datacenter deployment
    • IREE: Modern MLIR compiler optimized for edge and mobile deployment
  • Use Cases: Production inference, training, edge deployment
  • Platforms: Any MLIR-supported hardware (CPU, GPU, TPU, mobile accelerators)

Why This Hybrid Approach?

Direct CPU Backend Benefits

  1. Fast Development Cycle: No compilation step needed for testing
  2. Multiplatform Support: Runs on all Kotlin targets including JS/WASM
  3. Debugging: Standard Kotlin debugging tools work directly
  4. Reference Implementation: Validates correctness of MLIR compilation

MLIR Compilation Benefits

  1. Hardware Portability: Single compilation target for all hardware
  2. Automatic Optimization: Advanced compiler optimizations and fusion
  3. Ecosystem Integration: Compatible with JAX, TensorFlow, PyTorch (via ONNX)
  4. Future-Proof: New hardware automatically supported via compiler updates

XLA vs IREE: Choosing the Right Compiler

SKaiNET supports both XLA and IREE as MLIR compilation targets, each optimized for different deployment scenarios:

XLA (Accelerated Linear Algebra)

Best for: Datacenter deployment, integration with Google ecosystem

Strengths:

  • Mature and battle-tested in production (TensorFlow, JAX)
  • Excellent optimization for large-scale training workloads
  • Strong TPU support and Google Cloud integration
  • Comprehensive operator coverage
  • Stable API and extensive documentation

Limitations:

  • Heavier runtime dependencies
  • Less optimized for mobile/edge deployment
  • Primarily Google-controlled development
  • Limited customization for specialized hardware

IREE (Intermediate Representation Execution Environment)

Best for: Edge deployment, mobile applications, standalone executables

Strengths:

  • Lightweight Runtime: Minimal dependencies, suitable for embedded systems
  • Standalone Executables: No heavy runtime required for deployment
  • Mobile-First Design: Optimized for power and memory constraints
  • Flexible Backends: CPU, GPU (Vulkan/Metal/CUDA), WebGPU, bare-metal
  • Modern Architecture: Built from ground-up with lessons from XLA
  • Open Governance: Part of OpenXLA with broader community input

Advantages for SKaiNET:

  • Multiplatform Alignment: Better suited for Kotlin Multiplatform's diverse targets
  • WebGPU Support: Enables high-performance browser execution for JS/WASM
  • Bare-Metal Support: Can target embedded systems without OS
  • Smaller Footprint: Critical for mobile and edge applications

Compilation Flow

flowchart TD
    A[SKaiNET Kotlin DSL]
    B["Compute Graph (DAG)"]

    C[CPU Backend<br/>Direct Execution]
    D[StableHLO Converter<br/>MLIR Generation]

    E[Compiler Choice]
    F[XLA Compiler]
    G[IREE Compiler]

    H["Hardware Executables<br/>CPU | GPU | TPU"]
    I["Standalone Executables<br/>CPU | GPU | WebGPU"]

    A --> B
    B -->|Development Path| C
    B -->|Production Path| D
    D --> E
    E --> F
    E --> G
    F --> H
    G --> I

    %% Styles
    classDef input fill:#1f77b4,color:#fff,stroke:#0d3c61,stroke-width:2px;
    classDef grdaph fill:#9467bd,color:#fff,stroke:#4b2e83,stroke-width:2px;
    classDef dev fill:#2ca02c,color:#fff,stroke:#1b6e1b,stroke-width:2px;
    classDef prod fill:#ff7f0e,color:#fff,stroke:#b35400,stroke-width:2px;
    classDef compiler fill:#d62728,color:#fff,stroke:#7f1d1d,stroke-width:2px;
    classDef output fill:#7f7f7f,color:#fff,stroke:#3f3f3f,stroke-width:2px;

    %% Class assignments
    class A input
    class B grdaph
    class C dev
    class D,E prod
    class F,G compiler
    class H,I output
Loading

Deployment Strategy by Use Case

Datacenter & Cloud Deployment

Recommended: XLA compilation

val model = neuralNetwork { /* DSL */ }
val executable = model.toStableHlo().compileWithXla(
    target = XlaTarget.GPU_CUDA,
    optimization = OptimizationLevel.AGGRESSIVE
)

Mobile & Edge Deployment

Recommended: IREE compilation

val model = neuralNetwork { /* DSL */ }
val executable = model.toStableHlo().compileWithIree(
    target = IreeTarget.CPU_ARM64,
    optimization = OptimizationLevel.SIZE
)

Web Applications

Recommended: IREE with WebGPU

val model = neuralNetwork { /* DSL */ }
val executable = model.toStableHlo().compileWithIree(
    target = IreeTarget.WEBGPU,
    optimization = OptimizationLevel.LATENCY
)

Development & Testing

Recommended: CPU Backend

val model = neuralNetwork { /* DSL */ }
val result = model.execute(input, CpuBackend())

When to Use Each Path

Use CPU Backend When:

  • Running unit tests
  • Developing new operators
  • Debugging model behavior
  • Targeting JS/WASM platforms without WebGPU
  • Quick prototyping without compilation overhead

Use XLA Compilation When:

  • Deploying to datacenter/cloud environments
  • Targeting TPU hardware
  • Integrating with TensorFlow/JAX ecosystems
  • Large-scale training workloads
  • Maximum performance on server-class hardware

Use IREE Compilation When:

  • Deploying to mobile devices (iOS/Android)
  • Targeting edge devices with limited resources
  • Building standalone applications
  • Using WebGPU in browsers
  • Deploying to embedded/bare-metal systems
  • Optimizing for binary size and startup time

Hardware Backend Support Comparison

Target Platform CPU Backend XLA IREE
CPU (x64/ARM) ✅ Direct ✅ Optimized ✅ Lightweight
NVIDIA GPU ✅ CUDA ✅ CUDA/Vulkan
AMD GPU ✅ ROCm ✅ ROCm/Vulkan
Apple GPU ✅ Metal
Intel GPU ✅ Level Zero ✅ Vulkan
TPU ✅ Native
WebGPU ✅ WGSL
Mobile GPU ✅ Vulkan/Metal
Bare Metal ✅ Limited ✅ LLVM

No Separate Hardware Backends Needed

Unlike frameworks that implement separate backends for each hardware target (CUDA, Metal, ROCm, etc.), SKaiNET relies on MLIR compilers for hardware targeting. This means:

  • No skainet-backend-cuda: XLA/IREE handle NVIDIA GPU compilation
  • No skainet-backend-metal: IREE handles Apple GPU compilation
  • No skainet-backend-rocm: XLA/IREE handle AMD GPU compilation
  • No skainet-backend-vulkan: IREE handles cross-platform GPU via Vulkan

The only exception is the CPU backend, which serves as a reference implementation and development tool rather than a production execution engine.

Implementation Roadmap

Phase 1: StableHLO Foundation (Current)

  • ✅ Basic StableHLO export for core operations
  • ✅ Integration with existing ComputeGraph infrastructure
  • 🔄 Comprehensive operation coverage (in progress)

Phase 2: XLA Integration

  • 🔄 XLA compiler integration and runtime
  • 🔄 Performance benchmarking vs CPU backend
  • ⏳ Production deployment tooling

Phase 3: IREE Integration

  • ⏳ IREE compiler integration
  • ⏳ Mobile-optimized compilation profiles
  • ⏳ WebGPU backend for browser deployment
  • ⏳ Bare-metal deployment for embedded systems

Phase 4: Advanced Features

  • ⏳ Dynamic shape support
  • ⏳ Mixed precision compilation
  • ⏳ Custom operator integration
  • ⏳ Distributed execution support

Migration Path

For existing code using the CPU backend:

// Development: Direct execution
val result = model.execute(input, CpuBackend())

// Production (Datacenter): Compile with XLA
val mlir = model.toStableHlo()
val executable = XlaCompiler.compile(mlir, target = XlaTarget.GPU_CUDA)
val result = executable.run(input)

// Production (Mobile): Compile with IREE
val mlir = model.toStableHlo()  
val executable = IreeCompiler.compile(mlir, target = IreeTarget.CPU_ARM64)
val result = executable.run(input)

// Web Deployment: IREE with WebGPU
val mlir = model.toStableHlo()
val executable = IreeCompiler.compile(mlir, target = IreeTarget.WEBGPU)
val result = executable.run(input)

Performance Characteristics

Compilation Time

  • CPU Backend: Instant (no compilation)
  • XLA: Moderate (optimized for throughput)
  • IREE: Fast (optimized for deployment)

Runtime Performance

  • CPU Backend: Good (reference implementation)
  • XLA: Excellent (datacenter workloads)
  • IREE: Excellent (mobile/edge workloads)

Binary Size

  • CPU Backend: Small (Kotlin bytecode)
  • XLA: Large (includes runtime)
  • IREE: Small (standalone executables)

Memory Usage

  • CPU Backend: Moderate (JVM overhead)
  • XLA: High (optimization metadata)
  • IREE: Low (minimal runtime)

Future Considerations

Potential Additional Direct Backends

  • WebGPU: For browser-based GPU acceleration when IREE compilation overhead is too high
  • Custom Hardware: For specialized accelerators without MLIR support

These would follow the same pattern as the CPU backend: direct implementation for specific use cases where MLIR compilation isn't suitable.

Emerging Technologies

  • WebAssembly SIMD: Enhanced performance for browser deployment
  • RISC-V: Support for open hardware architectures
  • Neuromorphic Hardware: Specialized AI chips with unique programming models

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions