Core Concepts

eval has the following core concepts:

Benchmark and registry

A benchmark is a named, versioned set of evaluation tasks — for example swebench-verified@1.0 (500 human-validated SWE-bench tasks). eval does not author or stage tasks. It selects a benchmark by name and lets Harbor resolve it against its registry, repos/harbor/registry.json, which maps each (dataset_name, version) pair to the underlying task data and fetches it automatically.

This is the key difference from trajgen: trajgen consumes a locally-staged, manifest-filtered task export from swegen, while eval is purely registry-driven and needs no artifacts/tasks/ staging step.

Task / trial

A task is a single evaluation instance: an instruction, a container environment, and a test/grading script. A trial is one attempt by the agent to complete a task; it produces a reward — typically 1.0 if the task's grader passes and 0.0 otherwise. eval can retry failed trials up to max_retries.

Job

A job is a full benchmark run driven by a single config.yaml. Harbor runs up to n_concurrent trials in parallel and writes per-task results plus a results.json summary under artifacts/jobs/<job>/. Set n_tasks to a small integer for a smoke run, or leave it null to run the full benchmark.

LiteLLM proxy

eval starts a per-job LiteLLM proxy in front of your upstream model API. The proxy normalizes the endpoint so the agent can speak OpenAI- or Anthropic-compatible protocols, attaches the trajectory logger, and supports sticky routing. Each job gets its own generated config under artifacts/litellm/<job>/. Because the block only ever talks to an HTTP endpoint, the same proxy works whether the upstream is a remote API or a local vLLM checkpoint.

Agent scaffold

A scaffold is the agent harness that drives the model through a task. eval validates three end-to-end: custom-claude-code (Anthropic protocol), custom-openhands-sdk (OpenAI protocol), and custom-opencode (OpenAI-compatible). Switch by editing name/version/runtime_image/runtime_host_path together in config.yaml. The runtime is bind-mounted from runtime_host_path, which must hold the agent runtime tree extracted from runtime_image.

Job analysis

After each eval, job analysis runs Harbor's job_analysis pipeline to produce attribution and scoring artifacts under <job_dir>/analysis/ — the exact files the dashboard reads (failure/resolve reports, task analysis, score comparison, instance analysis). It is on by default, non-fatal, and pure-CPU (LLM judge off by default → zero token cost). See Job Analysis.

Gold dataset

Job analysis needs a gold dataset — per-instance reference data (<id>/tests/config.json) under artifacts/datasets/<gold_base>/. When it is missing, analyze_job.sh auto-runs scripts/prepare_dataset.sh to build it (adapter → tagger), then proceeds; if it still cannot be produced, analysis skips cleanly instead of crashing.

Remote API vs local checkpoint

eval supports two interchangeable backends, selected by which llm_api block is active in config.yaml:

Mode A — remote API: point llm_api at a shared endpoint (e.g. a GLM-5 API). No local serving.
Mode B — local checkpoint: serve an SFT/RL output with vLLM on a GPU node (scripts/serve_local_model.sh) and point llm_api at it. See Local Model.

Core Concepts

On this page