eval

Job Analysis

Dataset Prep

Building and tagging the gold dataset analysis depends on

Job analysis needs a gold dataset — per-instance reference data at artifacts/datasets/<gold_base>/<id>/tests/config.json. When it is absent, analyze_job.sh auto-runs scripts/prepare_dataset.sh <dataset_name> to build it, then proceeds; if it still can't be produced, analysis skips cleanly instead of crashing (disable the auto-build with JOB_ANALYSIS_PREPARE_DATASET=0).

The two steps

bash scripts/prepare_dataset.sh [<dataset_name>]   # no arg → the configured dataset
  1. Adapter → gold. Runs the matching repos/harbor/adapters/<name> adapter (pulling from HuggingFace) to produce each instance's tests/config.json gold.
  2. Tagger → metadata. Runs repos/harbor/scripts/task_analysis/tag_task_metadata.py to complete each task.toml's [language, area, topic, bug_class] and difficulty tags via an LLM.

It is idempotent (skips populated datasets unless PREP_FORCE=1). Tagging is best-effort: the gold is still produced even if tagging fails — only the Language/Area breakdown is then limited.

Supported datasets

dataset_nameAdapterHuggingFace source
swebench-verifiedswebenchprinceton-nlp/SWE-bench_Verified
swebench_multilingualSWE-bench/SWE-bench_Multilingual
swebenchproScaleAI/SWE-bench_Pro

-100 subsets map to the same base. Generation needs network access (HuggingFace) and the harbor-uv env (it has datasets + swebench). No Docker — only the gold metadata files are written.

The tagging model must return clean JSON

tag_task_metadata.py POSTs to <base_url>/chat/completions and parses a JSON object, so it needs a model that returns clean JSON.

Reasoning models break tagging

A reasoning model that emits <think>… into the content field (e.g. Qwen3-8B served with thinking on) produces unparseable output, and tags fail. GLM-5-FP8 keeps reasoning in a separate field and works. Point tagging at a JSON-clean / instruct endpoint via PREP_TAG_BASE_URL / PREP_TAG_MODEL / PREP_TAG_API_KEY.

The default tagging endpoint is config.yaml → runtime_info.input.job_analysis.tag_llm:

job_analysis:
  enabled: true
  tag_llm:
    base_url: "http://llm10.jierungogogo.com/v1"
    model: "GLM-5-FP8"
    api_key: "dummy-cf"

This is deliberately separate from the eval llm_api: the model under evaluation may be a reasoning model unsuitable for tagging, so the gold-tagging step gets its own JSON-clean endpoint.

On this page