eval

Run Jobs

Select Benchmark

Registry-driven benchmark selection

eval is registry-driven: it does not stage tasks locally. Switch benchmark by editing two fields in config.yaml → runtime_info.input.task_source:

task_source:
  provider: harbor_registry
  dataset_name: swebench-verified
  version: "1.0"
  registry_path: repos/harbor/registry.json

Harbor resolves (dataset_name, version) against repos/harbor/registry.json and fetches the underlying task data automatically. No artifacts/tasks/ staging is required.

Curated benchmarks

These are validated to run end-to-end under the agents eval configures (e.g. custom-claude-code):

dataset_nameversionTasksSource
swebench-verified1.0500human-validated SWE-bench
swebench-verified-1001.0100100-task subset of swebench-verified
swebench_multilingual1.0300multilingual SWE-bench
swebench_multilingual-1001.0100random subset of swebench_multilingual (seed=42)
swebenchpro1.0731SWE-bench Pro multi-language
swebenchpro-1001.0100100-task subset of swebenchpro
terminal-bench2.089Terminal-Bench 2.0
aider-polyglot1.0225polyglot code editing
livecodebench6.0100competitive programming
humanevalfix1.0164bug fixing on HumanEval programs
bigcodebench-hard-complete1.0.0145function-level code completion (hard split)

Mind the version

The version field is not always 1.0: terminal-bench is 2.0, livecodebench is 6.0, and bigcodebench-hard-complete is 1.0.0. -100 subsets share the same registry entry shape as their full-set parents.

Smoke runs

For a fast run, either pick a -100 subset from the table, or cap the full benchmark with harbor_job.n_tasks:

harbor_job:
  n_tasks: 16   # null = full benchmark; an int caps the run

Other registry entries

The table above is the curated set. The full registry.json contains ~80 more entries (e.g. terminal-bench-pro, swesmith, swtbench-verified, plus non-code benchmarks like gpqa-diamond, aime, gaia, lawbench, …). The scripts accept any of them as dataset_name without code changes, but two caveats apply:

  • Agent compatibility is your problem. Math/MCQ/QA benchmarks (aime, gpqa-diamond, simpleqa, …) and benchmarks that ship their own agent runtime (gaia, mlgym-bench, …) generally do not work as-is with the coding agents configured here. Verify against the adapter README under Harbor's adapters/<name>/ before adding one to a production run profile.

  • List the full set with:

    python3 -c 'import json; [print(e["name"]+"@"+e["version"]) for e in json.load(open("repos/harbor/registry.json"))]'

Pin and registry move together

Before changing meta_info.repositories.harbor.commit, confirm the new commit's registry.json still contains your chosen (dataset_name, version) pair. The Harbor pin moves as the block tracks newer registry contents.

On this page