[fix] Honor served_model_name and surface HTTP errors in RemoteInferenceEngine#1783
[fix] Honor served_model_name and surface HTTP errors in RemoteInferenceEngine#1783discobot wants to merge 1 commit into
Conversation
…nceEngine create_remote_inference_engines_from_config hardcoded the policy model path as the model name, ignoring generator.inference_engine.served_model_name, so requests to a vLLM server started with --served-model-name were rejected with a 404. RemoteInferenceEngine.generate then parsed the error body as if it were a normal response and returned empty outputs, which surfaced as an opaque IndexError in InferenceEngineClient.generate. Honor served_model_name when set and raise a RuntimeError with the status and error body on non-200 responses. Adds CPU tests with a mock vLLM server covering both paths. Fixes NovaSky-AI#1672.
There was a problem hiding this comment.
Code Review
This pull request improves error handling in the remote inference engine by raising a RuntimeError with the response body when a non-200 status code is returned. It also updates the engine initialization to use served_model_name if available, falling back to the policy model path, and adds comprehensive tests. The review feedback suggests handling potential decoding errors when reading the response text and adding validation to ensure a model name is resolved.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if resp.status != 200: | ||
| # Surface the error body (e.g. vLLM's 404 for an unknown model name) instead of | ||
| # silently parsing it into empty outputs. | ||
| error_body = await resp.text() |
There was a problem hiding this comment.
Using await resp.text() without specifying error handling can raise a UnicodeDecodeError if the remote server returns a non-UTF-8 error response (e.g., binary data or corrupted encoding on a 500 Internal Server Error). This secondary exception would mask the original HTTP status code and make debugging harder. Consider using errors="replace" to gracefully handle any decoding issues.
| error_body = await resp.text() | |
| error_body = await resp.text(errors="replace") |
| # Use served_model_name if provided, otherwise fall back to the model path. | ||
| # served_model_name allows using a different model name for HTTP requests than the actual | ||
| # model path. See InferenceEngineConfig.served_model_name in skyrl/train/config/config.py. | ||
| model_name = ie_cfg.served_model_name if ie_cfg.served_model_name is not None else cfg.trainer.policy.model.path |
There was a problem hiding this comment.
If both generator.inference_engine.served_model_name and trainer.policy.model.path are None or empty, model_name will be None or empty. This will cause type mismatches or cryptic errors later when making remote inference requests. It is safer to validate that a valid model name is resolved and raise a clear ValueError early.
model_name = ie_cfg.served_model_name if ie_cfg.served_model_name is not None else cfg.trainer.policy.model.path
if not model_name:
raise ValueError(
"Model name must be specified. Please set either `generator.inference_engine.served_model_name` "
"or `trainer.policy.model.path`."
)
Fixes #1672.
Remote engines ignore
served_model_nameandRemoteInferenceEngine.generate()swallows the resulting vLLM 404, so a model-name mismatch surfaces asIndexError: list index out of range. Both failure points confirmed on current main (f4d7990).create_remote_inference_engines_from_confignow resolvesmodel_nameasserved_model_namewhen set, falling back to the policy model path — the same resolutionInferenceEngineClient.__init__and the local-engine path already do. Removes the now-resolvedTODO(tgriggs).generate()raises aRuntimeErrorwith the request URL, model name, status, and error body on non-200 responses, matching the patternpause_generation/resume_generationalready use in the same file.chat_completion()/completion()are deliberately unchanged: the HTTP endpoint proxies their response bodies back to callers, so raising there would change proxy semantics.Testing: new CPU tests in
tests/backends/skyrl_train/inference_engines/test_remote_inference_engine.pyagainst a mock vLLM server that returns vLLM's exact 404 error body — they fail before the fix and pass after. The surrounding inference-engine, remote inference client, and remote weight loader tests pass locally (147 tests); pre-commit is clean on the touched files.