Inputs & Outputs
The config-driven input/output contract
eval is configured entirely through config.yaml. It uses config.yaml rather than a separate inputs.yaml because the file describes a full Harbor evaluation profile, not just upstream inputs. Treat config.yaml as the source of truth before launching a job.
Inputs
| Name | Description | Source |
|---|---|---|
repositories | Git remote/ref/path/read-only policy for harbor | external |
environment | Harbor uv env and LiteLLM runtime versions | external |
llm_api | Raw upstream model API used by the per-job LiteLLM proxy | external |
local_model_serving | Checkpoint path / served name for local vLLM serving | external |
litellm_proxy | LiteLLM config template, port, and master key | external |
task_source | Benchmark selection (provider: harbor_registry, dataset_name, version, registry_path) | external |
harbor_job | jobs dir, concurrency, retries, timeout multiplier, smoke cap | external |
job_analysis | Post-eval analysis toggle and tagging endpoint | external |
agent | Harbor agent scaffold, version, runtime image/host path, sampling controls | external |
HARBOR_EXCLUDE_TASKS | Space-separated task IDs Harbor must skip | derived |
eval is standalone — meta_info.dependencies is empty. The benchmark comes from Harbor's registry, not an upstream block, so the only values you normally fill are llm_api (or local_model_serving for a local checkpoint) and task_source.
Active runtime values
Excerpted from config.yaml:
llm_api:
api_key: dummy-cf
api_base_url: "http://llm10.jierungogogo.com/v1"
model: "openai/GLM-5-FP8"
protocols: [openai_compatible, anthropic_compatible]
served_via: per_job_litellm_proxy
input_cost_per_token: 0.0000021
output_cost_per_token: 0.0000084
litellm_proxy:
config_template: scripts/serve_llm/litellm_config.example.yaml
port: 4101
master_key: dummy-key-cf
task_source:
provider: harbor_registry
dataset_name: swebench-verified
version: "1.0"
registry_path: repos/harbor/registry.json
harbor_job:
jobs_dir: artifacts/jobs
n_concurrent: 2
n_tasks: null # null = full benchmark; int = smoke cap
max_retries: 2
timeout_multiplier: 1
job_analysis:
enabled: true
tag_llm:
base_url: "http://llm10.jierungogogo.com/v1"
model: "GLM-5-FP8"
api_key: "dummy-cf"
agent:
name: custom-openhands-sdk
version: 1.14.0
runtime_image: docker.io/jierun/c-oh-sdk-1.14.0:v0.5
runtime_host_path: artifacts/runtime/openhands-sdk
max_turns: 200
temperature: 0.7The active llm_api block determines the backend (remote API vs local vLLM checkpoint); the alternate recipe is kept commented in config.yaml. See LiteLLM Proxy and Local Model.
Outputs
eval declares its handoff contract in config.yaml → runtime_info.output:
| Output | Path | Format | Consumer |
|---|---|---|---|
eval_results_dir | artifacts/jobs/ | artifacts/jobs/<job>/<task>/{agent,evaluation}/ | none (terminal block) |
The per-task trajectory is artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl, the aggregate summary is artifacts/jobs/<job>/results.json, and post-eval analysis lands under artifacts/jobs/<job>/analysis/. See Results & Artifacts for the full layout.