Select Benchmark

eval is registry-driven: it does not stage tasks locally. Switch benchmark by editing two fields in config.yaml → runtime_info.input.task_source:

task_source:
  provider: harbor_registry
  dataset_name: swebench-verified
  version: "1.0"
  registry_path: repos/harbor/registry.json

Harbor resolves (dataset_name, version) against repos/harbor/registry.json and fetches the underlying task data automatically. No artifacts/tasks/ staging is required.

Curated benchmarks

These are validated to run end-to-end under the agents eval configures (e.g. custom-claude-code):

`dataset_name`	`version`	Tasks	Source
`swebench-verified`	`1.0`	500	human-validated SWE-bench
`swebench-verified-100`	`1.0`	100	100-task subset of swebench-verified
`swebench_multilingual`	`1.0`	300	multilingual SWE-bench
`swebench_multilingual-100`	`1.0`	100	random subset of swebench_multilingual (seed=42)
`swebenchpro`	`1.0`	731	SWE-bench Pro multi-language
`swebenchpro-100`	`1.0`	100	100-task subset of swebenchpro
`terminal-bench`	`2.0`	89	Terminal-Bench 2.0
`aider-polyglot`	`1.0`	225	polyglot code editing
`livecodebench`	`6.0`	100	competitive programming
`humanevalfix`	`1.0`	164	bug fixing on HumanEval programs
`bigcodebench-hard-complete`	`1.0.0`	145	function-level code completion (hard split)

Mind the version

The version field is not always 1.0: terminal-bench is 2.0, livecodebench is 6.0, and bigcodebench-hard-complete is 1.0.0. -100 subsets share the same registry entry shape as their full-set parents.

Smoke runs

For a fast run, either pick a -100 subset from the table, or cap the full benchmark with harbor_job.n_tasks:

harbor_job:
  n_tasks: 16   # null = full benchmark; an int caps the run

Other registry entries

The table above is the curated set. The full registry.json contains ~80 more entries (e.g. terminal-bench-pro, swesmith, swtbench-verified, plus non-code benchmarks like gpqa-diamond, aime, gaia, lawbench, …). The scripts accept any of them as dataset_name without code changes, but two caveats apply:

Agent compatibility is your problem. Math/MCQ/QA benchmarks (aime, gpqa-diamond, simpleqa, …) and benchmarks that ship their own agent runtime (gaia, mlgym-bench, …) generally do not work as-is with the coding agents configured here. Verify against the adapter README under Harbor's adapters/<name>/ before adding one to a production run profile.

List the full set with:

python3 -c 'import json; [print(e["name"]+"@"+e["version"]) for e in json.load(open("repos/harbor/registry.json"))]'

Pin and registry move together

Before changing meta_info.repositories.harbor.commit, confirm the new commit's registry.json still contains your chosen (dataset_name, version) pair. The Harbor pin moves as the block tracks newer registry contents.

Select Benchmark

Curated benchmarks

Smoke runs

Other registry entries

On this page