Run Jobs
Select a benchmark, start the proxy, and run the eval
An eval job rolls a coding agent out across a registry-resolved benchmark inside Harbor containers, capturing one trajectory and per-task evaluation artifacts per task. This section covers the full run path: selecting a benchmark, fronting your model with a LiteLLM proxy, and inspecting the outputs.
The whole flow is driven from a single config.yaml and executed by scripts/start.sh:
bash scripts/update_repos.sh # clone/update Harbor at the pinned commit (first time)
bash scripts/dryrun.sh # validate config, repo, envs, model API, registry entry
bash scripts/start.sh # generate proxy config, start proxy, run Harbor job, analyzestart.sh runs the dry-run preflight, generates a per-job LiteLLM config, starts the proxy on the configured port, builds the Harbor command (--dataset <name> --registry-path repos/harbor/registry.json, plus --exclude-task-name flags from HARBOR_EXCLUDE_TASKS), runs the job under artifacts/jobs/, and then runs post-eval job analysis (unless disabled).
- Select Benchmark — the registry-driven benchmark catalog and how to switch
- LiteLLM Proxy — the per-job proxy in front of your model API
- Results & Artifacts — where trajectories, results, analysis, and archives land
Stop the proxy when done
If a job's inference finishes but the LiteLLM proxy is still running, stop the proxy process started for this job. Do not kill unrelated LiteLLM processes on the host.