Select Benchmark
Registry-driven benchmark selection
eval is registry-driven: it does not stage tasks locally. Switch benchmark by editing two fields in config.yaml → runtime_info.input.task_source:
task_source:
provider: harbor_registry
dataset_name: swebench-verified
version: "1.0"
registry_path: repos/harbor/registry.jsonHarbor resolves (dataset_name, version) against repos/harbor/registry.json and fetches the underlying task data automatically. No artifacts/tasks/ staging is required.
Curated benchmarks
These are validated to run end-to-end under the agents eval configures (e.g. custom-claude-code):
dataset_name | version | Tasks | Source |
|---|---|---|---|
swebench-verified | 1.0 | 500 | human-validated SWE-bench |
swebench-verified-100 | 1.0 | 100 | 100-task subset of swebench-verified |
swebench_multilingual | 1.0 | 300 | multilingual SWE-bench |
swebench_multilingual-100 | 1.0 | 100 | random subset of swebench_multilingual (seed=42) |
swebenchpro | 1.0 | 731 | SWE-bench Pro multi-language |
swebenchpro-100 | 1.0 | 100 | 100-task subset of swebenchpro |
terminal-bench | 2.0 | 89 | Terminal-Bench 2.0 |
aider-polyglot | 1.0 | 225 | polyglot code editing |
livecodebench | 6.0 | 100 | competitive programming |
humanevalfix | 1.0 | 164 | bug fixing on HumanEval programs |
bigcodebench-hard-complete | 1.0.0 | 145 | function-level code completion (hard split) |
Mind the version
The version field is not always 1.0: terminal-bench is 2.0, livecodebench is 6.0, and bigcodebench-hard-complete is 1.0.0. -100 subsets share the same registry entry shape as their full-set parents.
Smoke runs
For a fast run, either pick a -100 subset from the table, or cap the full benchmark with harbor_job.n_tasks:
harbor_job:
n_tasks: 16 # null = full benchmark; an int caps the runOther registry entries
The table above is the curated set. The full registry.json contains ~80 more entries (e.g. terminal-bench-pro, swesmith, swtbench-verified, plus non-code benchmarks like gpqa-diamond, aime, gaia, lawbench, …). The scripts accept any of them as dataset_name without code changes, but two caveats apply:
-
Agent compatibility is your problem. Math/MCQ/QA benchmarks (
aime,gpqa-diamond,simpleqa, …) and benchmarks that ship their own agent runtime (gaia,mlgym-bench, …) generally do not work as-is with the coding agents configured here. Verify against the adapter README under Harbor'sadapters/<name>/before adding one to a production run profile. -
List the full set with:
python3 -c 'import json; [print(e["name"]+"@"+e["version"]) for e in json.load(open("repos/harbor/registry.json"))]'
Pin and registry move together
Before changing meta_info.repositories.harbor.commit, confirm the new commit's registry.json still contains your chosen (dataset_name, version) pair. The Harbor pin moves as the block tracks newer registry contents.