eval

Dashboard

Browse jobs, analysis reports, and agent trajectories

eval ships a self-contained web dashboard for browsing Harbor job results, the analysis reports produced after each eval, and step-by-step agent trajectories. It is a stdlib-only Python server with a vanilla-JS frontend — no build step — styled after the LLaMA-Factory webui.

Live board: harbor-dashboard.pages.dev

What it shows

ViewContent
Overviewaggregate stats across all jobs — scaffolds, datasets, models, resolve rates
All jobssortable table of jobs with key metrics
Single jobanalysis reports, primary failure distributions, task breakdowns, trial-level detail
Compareside-by-side comparison of multiple jobs (shift-click jobs to add to the compare set)
Trajectory viewerstep-by-step agent execution — message / tool-call / observation inspection

It reads the analysis artifacts written under each artifacts/jobs/<job>/analysis/ (see Job Analysis), so run an eval — and let analysis complete — before expecting populated reports.

Run it locally

cd dashboard
python3 server.py --port 8092

Open http://localhost:8092. The server auto-discovers jobs under the block's jobs directory (customizable with --jobs-dir). All dependencies are stdlib + a Chart.js CDN.

When operating through the eval plugin, /eval:dashboard covers launching the dashboard and surfacing per-job task counts and accuracy.

Publish to Cloudflare Pages

For long-term public access, dashboard/export_static.py exports the dynamic dashboard into static JSON/HTML, and dashboard/run_cloudflare_pages_sync.sh loops the export and deploys site/ to Cloudflare Pages. The public site stays interactive (search, sort, filter, compare, charts, trajectory browsing) but updates only after each export/deploy cycle.

cd dashboard
bash run_cloudflare_pages_sync.sh

The free chunk mode (the default) groups trajectories into size-limited chunk files so opening a job downloads nothing until you click a trial — no Cloudflare R2 required. An optional R2 mode serves each trajectory on demand for the fastest per-trial loads. See dashboard/README.md for credentials, overrides, and the R2 setup.

Token scope

wrangler pages deploy needs the account-level Cloudflare Pages: Edit permission. None of the built-in templates map to it exactly, so create a Custom Token with AccountCloudflare PagesEdit. The Account ID is separate from the token.

Temporary sharing

For quick debugging of the live dynamic server, dashboard/share_pinggy.sh opens an auto-reconnecting Pinggy tunnel to the locally-running dashboard. Free Pinggy URLs are temporary and can change on reconnect — prefer Cloudflare Pages for anything long-lived.

On this page