eval

Reference

Agents

Supported agent scaffolds and the runtime bind-mount

The agent scaffold is the harness that drives the model through each task. eval validates three scaffolds end-to-end. Switch by editing the four agent fields below together in config.yamlname, version, runtime_image, and runtime_host_path.

Supported agents

nameProtocolVersionRuntime image
custom-claude-codeAnthropic2.1.118docker.io/jierun/c-cc-2.1.118:v0.1
custom-openhands-sdkOpenAI1.14.0docker.io/jierun/c-oh-sdk-1.14.0:v0.5
custom-opencodeOpenAI-compatible1.14.22docker.io/yjiangcm/c-oc-1.14.22:v0.2

max_turns is reused as max_iterations when the agent is openhands-sdk (same semantics). The per-job LiteLLM proxy bridges OpenAI and Anthropic formats, so any of these works against the same upstream model.

The runtime bind-mount

runtime_host_path must point at a directory that already contains the extracted agent runtime tree from runtime_image. start.sh bind-mounts it into each task container.

Pre-extract the runtime

If runtime_host_path is empty, the agent falls back to an in-container install (curl https://claude.ai/install.sh for claude-code, pip for openhands-sdk), which 403s or times out on isolated networks. Bind-mount is also preferred over image-mount, which trips an overlayfs filename-too-long bug for some task images.

Pre-extract once per runtime_image (idempotent — re-run when you bump the image). Set SUBPATH to claude-code or oh-sdk to match the agent:

RUNTIME_IMAGE=<runtime_image from config>
SUBPATH=oh-sdk            # or: claude-code
HOST_DIR=artifacts/runtime/<runtime_host_path basename>
docker pull "$RUNTIME_IMAGE"
CID=$(docker create "$RUNTIME_IMAGE")
rm -rf "$HOST_DIR" && mkdir -p "$(dirname "$HOST_DIR")"
docker cp "$CID:/opt/custom-agent-runtime/$SUBPATH" "$HOST_DIR"
docker rm "$CID"

Agent vs benchmark compatibility

The three agents above are validated against the curated SWE / terminal benchmarks. Math/MCQ/QA benchmarks and benchmarks that ship their own agent runtime generally do not work with these coding agents — see Select Benchmark.

On this page