Inputs & Outputs

eval is configured entirely through config.yaml. It uses config.yaml rather than a separate inputs.yaml because the file describes a full Harbor evaluation profile, not just upstream inputs. Treat config.yaml as the source of truth before launching a job.

Inputs

Name	Description	Source
`repositories`	Git remote/ref/path/read-only policy for `harbor`	external
`environment`	Harbor uv env and LiteLLM runtime versions	external
`llm_api`	Raw upstream model API used by the per-job LiteLLM proxy	external
`local_model_serving`	Checkpoint path / served name for local vLLM serving	external
`litellm_proxy`	LiteLLM config template, port, and master key	external
`task_source`	Benchmark selection (`provider: harbor_registry`, `dataset_name`, `version`, `registry_path`)	external
`harbor_job`	jobs dir, concurrency, retries, timeout multiplier, smoke cap	external
`job_analysis`	Post-eval analysis toggle and tagging endpoint	external
`agent`	Harbor agent scaffold, version, runtime image/host path, sampling controls	external
`HARBOR_EXCLUDE_TASKS`	Space-separated task IDs Harbor must skip	derived

eval is standalone — meta_info.dependencies is empty. The benchmark comes from Harbor's registry, not an upstream block, so the only values you normally fill are llm_api (or local_model_serving for a local checkpoint) and task_source.

Active runtime values

Excerpted from config.yaml:

llm_api:
  api_key: dummy-cf
  api_base_url: "http://llm10.jierungogogo.com/v1"
  model: "openai/GLM-5-FP8"
  protocols: [openai_compatible, anthropic_compatible]
  served_via: per_job_litellm_proxy
  input_cost_per_token: 0.0000021
  output_cost_per_token: 0.0000084
litellm_proxy:
  config_template: scripts/serve_llm/litellm_config.example.yaml
  port: 4101
  master_key: dummy-key-cf
task_source:
  provider: harbor_registry
  dataset_name: swebench-verified
  version: "1.0"
  registry_path: repos/harbor/registry.json
harbor_job:
  jobs_dir: artifacts/jobs
  n_concurrent: 2
  n_tasks: null               # null = full benchmark; int = smoke cap
  max_retries: 2
  timeout_multiplier: 1
job_analysis:
  enabled: true
  tag_llm:
    base_url: "http://llm10.jierungogogo.com/v1"
    model: "GLM-5-FP8"
    api_key: "dummy-cf"
agent:
  name: custom-openhands-sdk
  version: 1.14.0
  runtime_image: docker.io/jierun/c-oh-sdk-1.14.0:v0.5
  runtime_host_path: artifacts/runtime/openhands-sdk
  max_turns: 200
  temperature: 0.7

The active llm_api block determines the backend (remote API vs local vLLM checkpoint); the alternate recipe is kept commented in config.yaml. See LiteLLM Proxy and Local Model.

Outputs

eval declares its handoff contract in config.yaml → runtime_info.output:

Output	Path	Format	Consumer
`eval_results_dir`	`artifacts/jobs/`	`artifacts/jobs/<job>/<task>/{agent,evaluation}/`	none (terminal block)

The per-task trajectory is artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl, the aggregate summary is artifacts/jobs/<job>/results.json, and post-eval analysis lands under artifacts/jobs/<job>/analysis/. See Results & Artifacts for the full layout.

Inputs & Outputs

Inputs

Active runtime values

Outputs

On this page