eval

Reference

Inputs & Outputs

The config-driven input/output contract

eval is configured entirely through config.yaml. It uses config.yaml rather than a separate inputs.yaml because the file describes a full Harbor evaluation profile, not just upstream inputs. Treat config.yaml as the source of truth before launching a job.

Inputs

NameDescriptionSource
repositoriesGit remote/ref/path/read-only policy for harborexternal
environmentHarbor uv env and LiteLLM runtime versionsexternal
llm_apiRaw upstream model API used by the per-job LiteLLM proxyexternal
local_model_servingCheckpoint path / served name for local vLLM servingexternal
litellm_proxyLiteLLM config template, port, and master keyexternal
task_sourceBenchmark selection (provider: harbor_registry, dataset_name, version, registry_path)external
harbor_jobjobs dir, concurrency, retries, timeout multiplier, smoke capexternal
job_analysisPost-eval analysis toggle and tagging endpointexternal
agentHarbor agent scaffold, version, runtime image/host path, sampling controlsexternal
HARBOR_EXCLUDE_TASKSSpace-separated task IDs Harbor must skipderived

eval is standalonemeta_info.dependencies is empty. The benchmark comes from Harbor's registry, not an upstream block, so the only values you normally fill are llm_api (or local_model_serving for a local checkpoint) and task_source.

Active runtime values

Excerpted from config.yaml:

llm_api:
  api_key: dummy-cf
  api_base_url: "http://llm10.jierungogogo.com/v1"
  model: "openai/GLM-5-FP8"
  protocols: [openai_compatible, anthropic_compatible]
  served_via: per_job_litellm_proxy
  input_cost_per_token: 0.0000021
  output_cost_per_token: 0.0000084
litellm_proxy:
  config_template: scripts/serve_llm/litellm_config.example.yaml
  port: 4101
  master_key: dummy-key-cf
task_source:
  provider: harbor_registry
  dataset_name: swebench-verified
  version: "1.0"
  registry_path: repos/harbor/registry.json
harbor_job:
  jobs_dir: artifacts/jobs
  n_concurrent: 2
  n_tasks: null               # null = full benchmark; int = smoke cap
  max_retries: 2
  timeout_multiplier: 1
job_analysis:
  enabled: true
  tag_llm:
    base_url: "http://llm10.jierungogogo.com/v1"
    model: "GLM-5-FP8"
    api_key: "dummy-cf"
agent:
  name: custom-openhands-sdk
  version: 1.14.0
  runtime_image: docker.io/jierun/c-oh-sdk-1.14.0:v0.5
  runtime_host_path: artifacts/runtime/openhands-sdk
  max_turns: 200
  temperature: 0.7

The active llm_api block determines the backend (remote API vs local vLLM checkpoint); the alternate recipe is kept commented in config.yaml. See LiteLLM Proxy and Local Model.

Outputs

eval declares its handoff contract in config.yaml → runtime_info.output:

OutputPathFormatConsumer
eval_results_dirartifacts/jobs/artifacts/jobs/<job>/<task>/{agent,evaluation}/none (terminal block)

The per-task trajectory is artifacts/jobs/<job>/<task>/agent/litellm-trajectory.jsonl, the aggregate summary is artifacts/jobs/<job>/results.json, and post-eval analysis lands under artifacts/jobs/<job>/analysis/. See Results & Artifacts for the full layout.

On this page