Core Concepts
Core concepts and terminology in eval
eval has the following core concepts:
Benchmark and registry
A benchmark is a named, versioned set of evaluation tasks — for example swebench-verified@1.0 (500 human-validated SWE-bench tasks). eval does not author or stage tasks. It selects a benchmark by name and lets Harbor resolve it against its registry, repos/harbor/registry.json, which maps each (dataset_name, version) pair to the underlying task data and fetches it automatically.
This is the key difference from trajgen: trajgen consumes a locally-staged, manifest-filtered task export from swegen, while eval is purely registry-driven and needs no artifacts/tasks/ staging step.
Task / trial
A task is a single evaluation instance: an instruction, a container environment, and a test/grading script. A trial is one attempt by the agent to complete a task; it produces a reward — typically 1.0 if the task's grader passes and 0.0 otherwise. eval can retry failed trials up to max_retries.
Job
A job is a full benchmark run driven by a single config.yaml. Harbor runs up to n_concurrent trials in parallel and writes per-task results plus a results.json summary under artifacts/jobs/<job>/. Set n_tasks to a small integer for a smoke run, or leave it null to run the full benchmark.
LiteLLM proxy
eval starts a per-job LiteLLM proxy in front of your upstream model API. The proxy normalizes the endpoint so the agent can speak OpenAI- or Anthropic-compatible protocols, attaches the trajectory logger, and supports sticky routing. Each job gets its own generated config under artifacts/litellm/<job>/. Because the block only ever talks to an HTTP endpoint, the same proxy works whether the upstream is a remote API or a local vLLM checkpoint.
Agent scaffold
A scaffold is the agent harness that drives the model through a task. eval validates three end-to-end: custom-claude-code (Anthropic protocol), custom-openhands-sdk (OpenAI protocol), and custom-opencode (OpenAI-compatible). Switch by editing name/version/runtime_image/runtime_host_path together in config.yaml. The runtime is bind-mounted from runtime_host_path, which must hold the agent runtime tree extracted from runtime_image.
Job analysis
After each eval, job analysis runs Harbor's job_analysis pipeline to produce attribution and scoring artifacts under <job_dir>/analysis/ — the exact files the dashboard reads (failure/resolve reports, task analysis, score comparison, instance analysis). It is on by default, non-fatal, and pure-CPU (LLM judge off by default → zero token cost). See Job Analysis.
Gold dataset
Job analysis needs a gold dataset — per-instance reference data (<id>/tests/config.json) under artifacts/datasets/<gold_base>/. When it is missing, analyze_job.sh auto-runs scripts/prepare_dataset.sh to build it (adapter → tagger), then proceeds; if it still cannot be produced, analysis skips cleanly instead of crashing.
Remote API vs local checkpoint
eval supports two interchangeable backends, selected by which llm_api block is active in config.yaml:
- Mode A — remote API: point
llm_apiat a shared endpoint (e.g. a GLM-5 API). No local serving. - Mode B — local checkpoint: serve an SFT/RL output with vLLM on a GPU node (
scripts/serve_local_model.sh) and pointllm_apiat it. See Local Model.