Job Analysis

A raw eval job tells you which tasks resolved and which didn't. Job analysis turns that into the attribution and scoring artifacts the dashboard reads — failure distributions, task breakdowns, resolved-vs-unresolved metrics, and per-instance summaries.

After each eval, start.sh runs scripts/analyze_job.sh "$JOB_DIR" automatically. It is non-fatal: a failed analysis never fails a completed eval run.

What it produces

The pipeline runs Harbor's job_analysis over a completed job and writes results into <job_dir>/analysis/ — the exact layout the dashboard reads:

artifacts/jobs/<job>/analysis/
├── report_failed.json / report_resolved.json
├── report_task_analysis.json
├── traj_analysis/score_comparison.json
├── instance_analysis/{summary,correlations}.json
├── instances.jsonl
└── analysis_config.yaml

Running it

# Analyze the newest job under jobs_dir (no arg), or a specific job dir
bash scripts/analyze_job.sh
bash scripts/analyze_job.sh artifacts/jobs/<job>

Safe to re-run, and safe to run on old jobs.

Aspect	Behavior
Opt-out	set `runtime_info.input.job_analysis.enabled: false` in `config.yaml`
Cost	LLM judge is off by default → pure-CPU, zero token cost. Enable with `JOB_ANALYSIS_JUDGE=1` (needs `ANTHROPIC_API_KEY`)
Engine	uses `artifacts/env/harbor-uv` (has `scipy` + `pyyaml`); the read-only Harbor repo is only `cd`-ed into for imports — all output lands in the writable job dir
Gold dependency	needs a gold dataset; auto-generated when missing, skips cleanly if it can't be built (see Dataset Prep)

Multilingual domain classification

task_analysis domain classification relies on Harbor's classifier.py repo→domain map, which is Python-SWE-bench-Verified-centric. For multilingual repos (apache/druid, lucene, …) the domain falls back to other. Any local tweak to that classifier is overwritten by update_repos.sh — keep such changes out of the read-only repo.

Learn more

Dataset Prep — how the gold dataset is built and tagged

Job Analysis

What it produces

Running it

Learn more

On this page