eval

Getting Started

Set up the block and run your first benchmark

This page walks through preparing the eval block and running a benchmark end to end. eval wraps a pinned Harbor checkout, so most steps are about getting that runtime in place — there is no task staging step, because the benchmark is resolved from Harbor's registry.

Prerequisites

  • Docker — Harbor rolls each task out inside a container, so a working Docker daemon is required on the run host (eval is CPU-only, but every task runs in a container).
  • uv — used to build the Harbor environment and run analysis; also required by start.sh (uv run harbor …).
  • Python 3.13 — required by the LiteLLM venv.
  • PyYAML in system python3 — every scripts/*.sh uses inline python3 - config readers. Run pip install pyyaml if you see ERROR: PyYAML is required.
  • An OpenAI- or Anthropic-compatible LLM endpoint — the model the agent will call, fronted by a per-job LiteLLM proxy. This can be a remote API or a local checkpoint served with vLLM (see Local Model).

Where to run

eval runs on the node declared in config.yaml (meta_info.resources.ip, currently 192.168.35.240). If your shell is on a different host, SSH there first; run its scripts inside a tmux session (e.g. eval) so a job survives shell disconnects.

1. Update the managed repo

eval depends on one local-only repository, harbor, pinned to a specific commit in config.yaml. Clone or update it to the pinned ref:

bash scripts/update_repos.sh

The script clones/fetches, checks out the pinned commit detached, initializes submodules, and sets the worktree read-only (readonly: true). It refuses to update a worktree with local modifications. The block reads registry.json from this checkout.

2. Build the environments

eval has no setup_*_env.sh helper scripts (unlike trajgen). The two environments are built with explicit uv commands — scripts/dryrun.sh also prints these when an env is missing. There is no swe_data_process env, because eval does not convert trajectories.

# Harbor uv env — build from inside the read-only repo, to a path OUTSIDE it
( cd repos/harbor && UV_PROJECT_ENVIRONMENT="$PWD/../../artifacts/env/harbor-uv" uv sync --all-extras )

# LiteLLM venv — Python 3.13, pinned litellm
uv venv artifacts/env/litellm-venv --python 3.13
uv pip install --python artifacts/env/litellm-venv/bin/python 'litellm[proxy]==1.83.14'

The Harbor env path must be outside repos/harbor while the repo is read-only — artifacts/env/harbor-uv already satisfies this.

3. Pick a benchmark

eval is registry-driven and stages no tasks locally. Select a benchmark by editing two fields in config.yaml → runtime_info.input.task_source:

task_source:
  provider: harbor_registry
  dataset_name: swebench-verified   # or any supported benchmark
  version: "1.0"
  registry_path: repos/harbor/registry.json

See Select Benchmark for the full table. For a fast first run, pick a -100 subset (e.g. swebench-verified-100) or set harbor_job.n_tasks to a small integer.

4. Validate the config

Run the dry-run preflight. It validates the config, the Harbor repo state, both environments, the model API (a real completion probe), and confirms the configured (dataset_name, version) exists in registry.json — all without side effects:

bash scripts/dryrun.sh

Fix anything it reports before launching a job.

5. Launch a job

Only after the dry run passes, launch the eval. start.sh runs the dryrun preflight, generates a per-job LiteLLM config, starts the proxy on the configured port, builds the Harbor command (--dataset <name> --registry-path repos/harbor/registry.json, plus any --exclude-task-name flags), runs the job, and then runs post-eval job analysis:

bash scripts/start.sh

6. Inspect results

Each rollout writes a trajectory and per-task evaluation artifacts to:

artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl
artifacts/jobs/<job>/<task>/evaluation/        # per-task verdict, scoring, test logs

Post-eval analysis lands under artifacts/jobs/<job>/analysis/. For a visual view, open the dashboard.

Operating with the agent plugin

If you operate the block through its Claude plugin, the same lifecycle maps to slash commands:

/root:check eval     # preflight: config, repo, envs, LLM endpoint, registry lookup
/eval:setup          # update repo, build envs, fill config
/eval:check          # dryrun preflight
/eval:run            # LiteLLM proxy + Harbor benchmark execution + analysis
/eval:dashboard      # browse jobs, reports, and trajectories

On this page