Getting Started
Set up the block and run your first benchmark
This page walks through preparing the eval block and running a benchmark end to end. eval wraps a pinned Harbor checkout, so most steps are about getting that runtime in place — there is no task staging step, because the benchmark is resolved from Harbor's registry.
Prerequisites
- Docker — Harbor rolls each task out inside a container, so a working Docker daemon is required on the run host (eval is CPU-only, but every task runs in a container).
- uv — used to build the Harbor environment and run analysis; also required by
start.sh(uv run harbor …). - Python 3.13 — required by the LiteLLM venv.
- PyYAML in system
python3— everyscripts/*.shuses inlinepython3 -config readers. Runpip install pyyamlif you seeERROR: PyYAML is required. - An OpenAI- or Anthropic-compatible LLM endpoint — the model the agent will call, fronted by a per-job LiteLLM proxy. This can be a remote API or a local checkpoint served with vLLM (see Local Model).
Where to run
eval runs on the node declared in config.yaml (meta_info.resources.ip, currently 192.168.35.240). If your shell is on a different host, SSH there first; run its scripts inside a tmux session (e.g. eval) so a job survives shell disconnects.
1. Update the managed repo
eval depends on one local-only repository, harbor, pinned to a specific commit in config.yaml. Clone or update it to the pinned ref:
bash scripts/update_repos.shThe script clones/fetches, checks out the pinned commit detached, initializes submodules, and sets the worktree read-only (readonly: true). It refuses to update a worktree with local modifications. The block reads registry.json from this checkout.
2. Build the environments
eval has no setup_*_env.sh helper scripts (unlike trajgen). The two environments are built with explicit uv commands — scripts/dryrun.sh also prints these when an env is missing. There is no swe_data_process env, because eval does not convert trajectories.
# Harbor uv env — build from inside the read-only repo, to a path OUTSIDE it
( cd repos/harbor && UV_PROJECT_ENVIRONMENT="$PWD/../../artifacts/env/harbor-uv" uv sync --all-extras )
# LiteLLM venv — Python 3.13, pinned litellm
uv venv artifacts/env/litellm-venv --python 3.13
uv pip install --python artifacts/env/litellm-venv/bin/python 'litellm[proxy]==1.83.14'The Harbor env path must be outside repos/harbor while the repo is read-only — artifacts/env/harbor-uv already satisfies this.
3. Pick a benchmark
eval is registry-driven and stages no tasks locally. Select a benchmark by editing two fields in config.yaml → runtime_info.input.task_source:
task_source:
provider: harbor_registry
dataset_name: swebench-verified # or any supported benchmark
version: "1.0"
registry_path: repos/harbor/registry.jsonSee Select Benchmark for the full table. For a fast first run, pick a -100 subset (e.g. swebench-verified-100) or set harbor_job.n_tasks to a small integer.
4. Validate the config
Run the dry-run preflight. It validates the config, the Harbor repo state, both environments, the model API (a real completion probe), and confirms the configured (dataset_name, version) exists in registry.json — all without side effects:
bash scripts/dryrun.shFix anything it reports before launching a job.
5. Launch a job
Only after the dry run passes, launch the eval. start.sh runs the dryrun preflight, generates a per-job LiteLLM config, starts the proxy on the configured port, builds the Harbor command (--dataset <name> --registry-path repos/harbor/registry.json, plus any --exclude-task-name flags), runs the job, and then runs post-eval job analysis:
bash scripts/start.sh6. Inspect results
Each rollout writes a trajectory and per-task evaluation artifacts to:
artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl
artifacts/jobs/<job>/<task>/evaluation/ # per-task verdict, scoring, test logsPost-eval analysis lands under artifacts/jobs/<job>/analysis/. For a visual view, open the dashboard.
Operating with the agent plugin
If you operate the block through its Claude plugin, the same lifecycle maps to slash commands:
/root:check eval # preflight: config, repo, envs, LLM endpoint, registry lookup
/eval:setup # update repo, build envs, fill config
/eval:check # dryrun preflight
/eval:run # LiteLLM proxy + Harbor benchmark execution + analysis
/eval:dashboard # browse jobs, reports, and trajectories