Local Model
Benchmark a local checkpoint with vLLM
By default eval benchmarks a remote upstream API. To benchmark a local checkpoint (e.g. an SFT or RL output from the pipeline) instead, serve it with vLLM and point llm_api at it — no code change to start.sh is needed, because the block only ever talks to an OpenAI/Anthropic-compatible HTTP endpoint.
Topology
The eval node is CPU-only, so vLLM must run on a GPU node:
eval node (CPU) GPU node
agent container → eval LiteLLM :4101 ─────────▶ vLLM :8000/v1 (local ckpt)
(start.sh, trajectory_logger) api_base_urlvLLM only — no second proxy
The model-serving layer is vLLM only. start.sh already runs the per-job LiteLLM proxy (with trajectory_logger, sticky routing, and Anthropic-format support). Do not start a second LiteLLM next to vLLM — it would collide on the port and bypass that logging.
0. One-time: create the vLLM env (GPU node)
The eval node's /eval:setup does not build this — it only builds the CPU-side Harbor uv env + LiteLLM venv. For a standard bf16/fp16 checkpoint a plain pip install is enough:
conda create -y -n vllm_0.18.1 python=3.12
conda activate vllm_0.18.1
pip install vllm==0.18.1Only FP8 models with custom kernels (e.g. GLM-5.1-FP8) need the heavy source build in repos/harbor/scripts/serve_llm/install_vllm_32b717_cu128.sh. serve_local_model.sh activates $VLLM_CONDA_ENV (default vllm_0.18.1) and errors with this exact recipe if vllm isn't found.
1. Serve the checkpoint (GPU node)
bash scripts/serve_local_model.sh # runs in foreground; use tmuxThe checkpoint path and served name are read from config.yaml → runtime_info.input.local_model_serving (model_path / model_name) — that is the source of truth, not a hardcoded script default. Everything else is env-overridable (MODEL_PATH, MODEL_NAME, VLLM_PORT, TENSOR_PARALLEL_SIZE, API_KEY, VLLM_CONDA_ENV, …); it serves on :8000 by default. When ready it prints the exact llm_api block to paste.
local_model_serving:
model_path: /mnt/.../sft/artifacts/model/qwen3_8b_glm5_...
model_name: Qwen3-8B2. Point llm_api at it (eval node)
Edit config.yaml → runtime_info.input.llm_api to the local-vLLM recipe (commented at the top of runtime_info.input):
llm_api:
api_key: dummy-key # must match vLLM --api-key
api_base_url: "http://<GPU_NODE_IP>:8000/v1"
model: "openai/Qwen3-8B" # openai/<served-model-name>
protocols: [openai_compatible, anthropic_compatible]
served_via: per_job_litellm_proxy
input_cost_per_token: 0.0
output_cost_per_token: 0.0The served name must match the basename of llm_api.model (openai/<model_name>).
3. Validate and run
bash scripts/dryrun.sh # probe_llm_completion.sh sends a real completion to verify the endpoint
bash scripts/start.shdryrun.sh sends a real completion to the configured endpoint, so a live vLLM server is verified before the job launches.
Tagging endpoint stays separate
If the local checkpoint is a reasoning model (emits <think>…), keep job_analysis.tag_llm pointed at a JSON-clean endpoint — gold-dataset tagging needs clean JSON. See Dataset Prep.