eval

Motivation

Why we built eval

eval runs registered evaluation benchmarks through Harbor against any configurable upstream LLM, and stores per-task results and trajectories.

It is the evaluation stage of the SWE-Lego-Live pipeline — a leaf block with no downstream consumer. Where trajgen generates training data and sft/rl produce checkpoints, eval measures how good any model is on a chosen benchmark. It mirrors trajgen's runtime shell (a pinned Harbor checkout fronted by a per-job LiteLLM proxy), but loads its task set from Harbor's registry.json instead of a locally-staged swegen export.

swegen ─▶ trajgen ─▶ sft ─▶ rl


                            eval   ← benchmarks any model (remote API or local checkpoint)

eval provides:

  • Registry-driven benchmark selection — switch between SWE-bench Verified, Multilingual, Pro, Terminal-Bench, and a dozen other benchmarks by editing two config fields; no local task staging
  • A per-job LiteLLM proxy fronting your model API with OpenAI- and Anthropic-compatible endpoints and trajectory logging
  • Containerized rollouts at scale via Harbor, with configurable concurrency, retries, and timeouts
  • Pluggable agent scaffolds — Claude Code, OpenHands SDK, and OpenCode validated end-to-end
  • Remote API or local checkpoint evaluation — point at a shared endpoint or serve an SFT/RL output with vLLM, with no code change
  • Automatic post-eval job analysis — attribution, scoring, and failure breakdowns the dashboard reads, at zero token cost by default
  • A self-contained dashboard for browsing jobs, analysis reports, and agent trajectories

Where to go next

  • Getting Started — install the dependencies and run your first benchmark
  • Core Concepts — benchmarks, the registry, jobs, agents, and analysis
  • Run Jobs — select a benchmark, start the proxy, and run the eval
  • Job Analysis — post-eval attribution and scoring
  • Local Model — benchmark a local checkpoint with vLLM
  • Dashboard — browse jobs, reports, and trajectories

On this page