Motivation
Why we built eval
eval runs registered evaluation benchmarks through Harbor against any configurable upstream LLM, and stores per-task results and trajectories.
It is the evaluation stage of the SWE-Lego-Live pipeline — a leaf block with no downstream consumer. Where trajgen generates training data and sft/rl produce checkpoints, eval measures how good any model is on a chosen benchmark. It mirrors trajgen's runtime shell (a pinned Harbor checkout fronted by a per-job LiteLLM proxy), but loads its task set from Harbor's registry.json instead of a locally-staged swegen export.
swegen ─▶ trajgen ─▶ sft ─▶ rl
│
▼
eval ← benchmarks any model (remote API or local checkpoint)eval provides:
- Registry-driven benchmark selection — switch between SWE-bench Verified, Multilingual, Pro, Terminal-Bench, and a dozen other benchmarks by editing two config fields; no local task staging
- A per-job LiteLLM proxy fronting your model API with OpenAI- and Anthropic-compatible endpoints and trajectory logging
- Containerized rollouts at scale via Harbor, with configurable concurrency, retries, and timeouts
- Pluggable agent scaffolds — Claude Code, OpenHands SDK, and OpenCode validated end-to-end
- Remote API or local checkpoint evaluation — point at a shared endpoint or serve an SFT/RL output with vLLM, with no code change
- Automatic post-eval job analysis — attribution, scoring, and failure breakdowns the dashboard reads, at zero token cost by default
- A self-contained dashboard for browsing jobs, analysis reports, and agent trajectories
Where to go next
- Getting Started — install the dependencies and run your first benchmark
- Core Concepts — benchmarks, the registry, jobs, agents, and analysis
- Run Jobs — select a benchmark, start the proxy, and run the eval
- Job Analysis — post-eval attribution and scoring
- Local Model — benchmark a local checkpoint with vLLM
- Dashboard — browse jobs, reports, and trajectories