A distributed inference system demonstrating consistent hashing, dynamic batching, model sharding, and horizontal scaling
This project is developed on a Linux based system. If you are using a different OS, changes might be required all of which are not included here.
pip install numpy requests matplotlibSave all the provided Python files in a single directory:
consistent_hash.pybatch_processor.pyinference_engine.pyworker_node.pygateway.pybenchmark.pyanalyze_results.pyrun.shstop.sh
Linux/Mac:
chmod +x run_system.sh
./run.shManual (any OS) - use 5 separate terminals:
# Terminal 1
python worker_node.py --port 8001 --node-id worker_1
# Terminal 2
python worker_node.py --port 8002 --node-id worker_2
# Terminal 3
python worker_node.py --port 8003 --node-id worker_3
# Terminal 4 (wait 2 seconds after starting workers)
python gateway.py --port 8000
# Terminal 5 (wait 2 seconds after starting gateway)
python benchmark.py --requests 5000 --concurrent 50python analyze_results.pyThis generates:
latency_distribution.pngnode_distribution.pngperformance_comparison.pngperformance_report.txtbenchmark_results.json
distributed-inference-native/
├── consistent_hash.py # Consistent hashing implementation
├── batch_processor.py # Dynamic batching logic
├── inference_engine.py # Simulated ML inference
├── worker_node.py # Worker server with batching
├── gateway.py # Gateway with routing
├── benchmark.py # Load testing tool
├── analyze_results.py # Results visualization
├── run.sh # Start script (Linux/Mac)
├── stop.sh # Stop script (Linux/Mac)
└── README.md # This file
Clients
│
▼
┌───────────────┐
│ Gateway │
│ Port: 8000 │
│ │
│ Consistent │
│ Hashing │
└───────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌────────-┐ ┌────────-┐
│Worker 1│ │Worker 2 │ │Worker 3 │
│ 8001 │ │ 8002 | │ 8003 │
│ │ │ │ │ │
│ Batch │ │ Batch │ │ Batch │
│Process │ │Process │ │Process │
│ │ │ │ │ │
│Inference│ │Inference│ │Inference│
│Engine │ │Engine │ │Engine │
└────────┘ └────────-┘ └────────-┘
- 150 virtual nodes per physical node
- Uniform load distribution
- Minimal request redistribution on node changes
- Max batch size: 32 requests
- Timeout: 20ms
- Automatic batch optimization
- Simulated model partitioning across nodes
- Reduced memory footprint per node
- Easy to add more worker nodes
- Linear throughput scaling
# Light load
python benchmark.py --requests 1000 --concurrent 20
# Medium load
python benchmark.py --requests 5000 --concurrent 50
# Heavy load
python benchmark.py --requests 10000 --concurrent 100# Start additional workers
python worker_node.py --port 8004 --node-id worker_4
python worker_node.py --port 8005 --node-id worker_5
# Update gateway (edit gateway.py or pass as arguments)
python gateway.py --workers http://localhost:8001 http://localhost:8002 http://localhost:8003 http://localhost:8004 http://localhost:8005Edit worker_node.py around line 21:
self.batch_processor = BatchProcessor(
max_batch_size=64, # Increase batch size
timeout_ms=50, # Increase timeout
process_fn=self._process_batch
)Linux/Mac:
# Find and kill processes
lsof -ti:8000 | xargs kill -9
lsof -ti:8001 | xargs kill -9
lsof -ti:8002 | xargs kill -9
lsof -ti:8003 | xargs kill -9Windows:
netstat -ano | findstr :8000
taskkill /PID <PID> /F- Make sure you have Python 3.8+:
python --version - Install dependencies:
pip install numpy requests matplotlib - Check for error messages in terminal
- Try starting workers manually in separate terminals
- Ensure all workers are running: check terminals
- Ensure gateway is running: check terminal
- Wait 2-3 seconds after starting gateway before running benchmark
- Test connectivity:
curl http://localhost:8000/stats
# Reinstall dependencies
pip install --upgrade numpy requests matplotlibStarting load test: 5000 requests with 50 concurrent
Target: http://localhost:8000/infer
------------------------------------------------------------
Progress: 5000/5000 (100%) - 1087.3 req/s
------------------------------------------------------------
BENCHMARK RESULTS
============================================================
Total Requests: 5000
Successful: 2576
Failed: 2424
Total Time: 683.77s
Throughput: 3.77 req/s
Latency Distribution (ms):
Mean: 4729.26
Median (p50): 4001.24
p95: 9061.05
p99: 12667.61
Min: 316.64
Max: 17302.05
Std Dev: 2442.98
Node Distribution:
worker_1: 786 (30.5%)
worker_2: 854 (33.2%)
worker_3: 936 (36.3%)
Load Balance Variance: 7.14%
============================================================
Linux/Mac:
./stop.shManual:
# Linux/Mac
pkill -f worker_node.py
pkill -f gateway.py
# Windows
taskkill /F /IM python.exe- Add caching layer for repeated requests
- Implement circuit breakers for fault tolerance
- Add Prometheus metrics export
- Implement request prioritization
- Add authentication/API keys
- Support multiple model versions (A/B testing)
- Add GPU inference support
- Implement request hedging
The C++ version (from earlier artifacts) offers:
- 3-5x better performance
- Lower latency (sub-millisecond)
- More efficient memory usage
- gRPC for faster communication
MIT License - Free to use for portfolios and resumes!
This is a learning/portfolio project. The code prioritizes clarity and educational value.
Built with: Python • NumPy • Matplotlib • HTTP
Concepts: Distributed Systems • Load Balancing • Performance Engineering • System Design