diff --git a/mkdocs.yml b/mkdocs.yml index 0bab8c329..4c057b500 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -321,6 +321,7 @@ nav: - NCCL/RCCL tests: docs/examples/clusters/nccl-rccl-tests.md - Inference: - SGLang: docs/examples/inference/sglang.md + - Dynamo: docs/examples/inference/dynamo.md - vLLM: docs/examples/inference/vllm.md - NIM: docs/examples/inference/nim.md - TensorRT-LLM: docs/examples/inference/trtllm.md diff --git a/mkdocs/docs/concepts/services.md b/mkdocs/docs/concepts/services.md index d9c61ec54..c998bc47e 100644 --- a/mkdocs/docs/concepts/services.md +++ b/mkdocs/docs/concepts/services.md @@ -342,13 +342,13 @@ Setting the minimum number of replicas to `0` allows the service to scale down t -Since 0.20.17, `dstack` supports serving a model using PD disaggregation. To use it, configure three replica groups: one for [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html), one for prefill workers, and one for decode workers. +Since 0.20.17, `dstack` supports serving a model using Prefill-Decode disaggregation. To use it, configure three replica groups: one for the router, one for prefill workers, and one for decode workers. -> Currently, Prefill-Decode disaggregation is supported only for SGLang. +`dstack` integrates with two routers for PD disaggregation: [Shepherd Model Gateway (SMG)](https://docs.sglang.io/advanced_features/sgl_model_gateway.html) and [NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo). Below is an example for running `zai-org/GLM-4.5-Air-FP8`: -=== "NVIDIA" +=== "SMG"
@@ -372,10 +372,10 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8`: --port 8000 \ --pd-disaggregation \ --prefill-policy cache_aware - router: - type: sglang resources: cpu: 4 + router: + type: sglang - count: 1..4 scaling: @@ -418,6 +418,111 @@ Below is an example for running `zai-org/GLM-4.5-Air-FP8`:
+ > With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon. + +=== "Dynamo" + +
+ + ```yaml + type: service + name: dynamo-pd + + env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + + replicas: + - count: 1 + docker: true + commands: + - apt-get update + - apt-get install -y python3-dev python3-venv + - python3 -m venv ~/dyn-venv + - source ~/dyn-venv/bin/activate + - pip install -U pip + - pip install "ai-dynamo[sglang]==1.1.1" + - git clone https://github.com/ai-dynamo/dynamo.git + # Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend. + - docker compose -f dynamo/deploy/docker-compose.yml up -d + - | + python3 -m dynamo.frontend \ + --http-host 0.0.0.0 --http-port 8000 \ + --discovery-backend etcd --router-mode kv \ + --kv-cache-block-size 64 + resources: + cpu: 4 + router: + type: dynamo + + - count: 1..4 + scaling: + metric: rps + target: 3 + python: "3.12" + nvcc: true + commands: + # dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica + # is provisioned. Compose the etcd/NATS endpoints from it. + - export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379" + - export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222" + # Set to enable /health endpoint required by dstack probes. + - export DYN_SYSTEM_PORT="8000" + # Wait until the router's etcd and NATS ports are actually accepting connections. + - | + until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \ + && (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do + echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3 + done + - pip install "ai-dynamo[sglang]==1.1.1" + - | + python3 -m dynamo.sglang \ + --model-path $MODEL_ID --served-model-name $MODEL_ID \ + --discovery-backend etcd --host 0.0.0.0 \ + --page-size 64 \ + --disaggregation-mode prefill --disaggregation-transfer-backend nixl + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + python: "3.12" + nvcc: true + commands: + - export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379" + - export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222" + - export DYN_SYSTEM_PORT="8000" + - | + until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \ + && (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do + echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3 + done + - pip install "ai-dynamo[sglang]==1.1.1" + - | + python3 -m dynamo.sglang \ + --model-path $MODEL_ID --served-model-name $MODEL_ID \ + --discovery-backend etcd --host 0.0.0.0 \ + --page-size 64 \ + --disaggregation-mode decode --disaggregation-transfer-backend nixl + resources: + gpu: H200 + + port: 8000 + model: zai-org/GLM-4.5-Air-FP8 + + # Custom probe is required for PD disaggregation. + probes: + - type: http + url: /health + interval: 15s + ``` + +
+ + > With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers. + !!! info "Cluster" PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances. diff --git a/mkdocs/docs/examples/inference/dynamo.md b/mkdocs/docs/examples/inference/dynamo.md new file mode 100644 index 000000000..0c30f19e7 --- /dev/null +++ b/mkdocs/docs/examples/inference/dynamo.md @@ -0,0 +1,166 @@ +--- +title: Dynamo +description: Deploying zai-org/GLM-4.5-Air-FP8 using NVIDIA Dynamo +--- + +# Dynamo + +This example shows how to deploy `zai-org/GLM-4.5-Air-FP8` using +[NVIDIA Dynamo](https://github.com/ai-dynamo/dynamo) and `dstack`. + + +## Apply a configuration + +Here's an example of a service that deploys `zai-org/GLM-4.5-Air-FP8` using +Dynamo with PD disaggregation. + +
+ +```yaml +type: service +name: dynamo-pd + +env: + - HF_TOKEN + - MODEL_ID=zai-org/GLM-4.5-Air-FP8 + +replicas: + - count: 1 + docker: true + commands: + - apt-get update + - apt-get install -y python3-dev python3-venv + - python3 -m venv ~/dyn-venv + - source ~/dyn-venv/bin/activate + - pip install -U pip + - pip install "ai-dynamo[sglang]==1.1.1" + - git clone https://github.com/ai-dynamo/dynamo.git + # Brings up the NATS / etcd compose stack and runs the Dynamo HTTP frontend. + - docker compose -f dynamo/deploy/docker-compose.yml up -d + - | + python3 -m dynamo.frontend \ + --http-host 0.0.0.0 --http-port 8000 \ + --discovery-backend etcd --router-mode kv \ + --kv-cache-block-size 64 + resources: + cpu: 4 + router: + type: dynamo + + - count: 1..4 + scaling: + metric: rps + target: 3 + python: "3.12" + nvcc: true + commands: + # dstack injects DSTACK_ROUTER_INTERNAL_IP after the router replica + # is provisioned. Compose the etcd/NATS endpoints from it. + - export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379" + - export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222" + # Set to enable /health endpoint required by dstack probes. + - export DYN_SYSTEM_PORT="8000" + # Wait until the router's etcd and NATS ports are actually accepting connections. + - | + until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \ + && (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do + echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3 + done + - pip install "ai-dynamo[sglang]==1.1.1" + - | + python3 -m dynamo.sglang \ + --model-path $MODEL_ID --served-model-name $MODEL_ID \ + --discovery-backend etcd --host 0.0.0.0 \ + --page-size 64 \ + --disaggregation-mode prefill --disaggregation-transfer-backend nixl + resources: + gpu: H200 + + - count: 1..8 + scaling: + metric: rps + target: 2 + python: "3.12" + nvcc: true + commands: + - export ETCD_ENDPOINTS="http://$DSTACK_ROUTER_INTERNAL_IP:2379" + - export NATS_SERVER="nats://$DSTACK_ROUTER_INTERNAL_IP:4222" + - export DYN_SYSTEM_PORT="8000" + - | + until (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/2379) 2>/dev/null \ + && (echo > /dev/tcp/$DSTACK_ROUTER_INTERNAL_IP/4222) 2>/dev/null; do + echo "waiting for etcd/NATS on $DSTACK_ROUTER_INTERNAL_IP..."; sleep 3 + done + - pip install "ai-dynamo[sglang]==1.1.1" + - | + python3 -m dynamo.sglang \ + --model-path $MODEL_ID --served-model-name $MODEL_ID \ + --discovery-backend etcd --host 0.0.0.0 \ + --page-size 64 \ + --disaggregation-mode decode --disaggregation-transfer-backend nixl + resources: + gpu: H200 + +port: 8000 +model: zai-org/GLM-4.5-Air-FP8 + +# Custom probe is required for PD disaggregation. +probes: + - type: http + url: /health + interval: 15s +``` + +
+ +> With the the `dynamo` router, you can use SGLang, vLLM, and TensorRT-LLM prefill and decode workers. + +Save the configuration as `service.dstack.yml`, then use the +[`dstack apply`](../../reference/cli/dstack/apply.md) command. + +
+ +```shell +$ dstack apply -f service.dstack.yml +``` + +
+ +If no gateway is created, the service endpoint will be available at `/proxy/services///`. + +
+ +```shell +curl http://127.0.0.1:3000/proxy/services/main/dynamo-pd/v1/chat/completions \ + -X POST \ + -H 'Authorization: Bearer <user token>' \ + -H 'Content-Type: application/json' \ + -d '{ + "model": "zai-org/GLM-4.5-Air-FP8", + "messages": [ + { + "role": "user", + "content": "What is prefill-decode disaggregation?" + } + ], + "max_tokens": 1024 + }' +``` + +
+ +> If a [gateway](../../concepts/gateways.md) is configured (e.g. to enable auto-scaling, HTTPS, rate limits, etc.), the service endpoint will be available at `https://dynamo-pd./`. + +## Configuration options + +Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. + +!!! info "Cluster" + PD disaggregation requires the service to run in a fleet with `placement` set to `cluster`, because the replicas require an interconnect between instances. + + While the prefill and decode replicas run on GPUs, the router replica requires a CPU instance in the same cluster. + +## What's next? + +1. Read about [services](../../concepts/services.md) and [gateways](../../concepts/gateways.md) +2. Browse the [NVIDIA Dynamo GitHub repository](https://github.com/ai-dynamo/dynamo) and the [SGLang](./sglang.md) example diff --git a/mkdocs/docs/examples/inference/sglang.md b/mkdocs/docs/examples/inference/sglang.md index e900a5f0b..6e67eecdd 100644 --- a/mkdocs/docs/examples/inference/sglang.md +++ b/mkdocs/docs/examples/inference/sglang.md @@ -92,7 +92,6 @@ Here's an example of a service that deploys The AMD example keeps the deployment close to the upstream Qwen and SGLang guidance: a pinned ROCm image, tensor parallelism across all four GPUs, and the standard `qwen3` reasoning parser without extra ROCm-specific tuning flags. -The first startup on MI300X can take longer while SGLang compiles ROCm kernels. Save one of the configurations above as `service.dstack.yml`, then use the [`dstack apply`](../../reference/cli/dstack/apply.md) command. @@ -164,10 +163,10 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/ --port 8000 \ --pd-disaggregation \ --prefill-policy cache_aware - router: - type: sglang resources: cpu: 4 + router: + type: sglang - count: 1..4 scaling: @@ -212,6 +211,8 @@ To run SGLang with [PD disaggregation](https://docs.sglang.io/advanced_features/ +> With the `sglang` router, you can use SGLang prefill and decode workers. Support for vLLM and TensorRT-LLM workers is coming soon. + Currently, auto-scaling only supports `rps` as the metric. TTFT and ITL metrics are coming soon. !!! info "Cluster"