Agents
Supported agent scaffolds and the runtime bind-mount
The agent scaffold is the harness that drives the model through each task. eval validates three scaffolds end-to-end. Switch by editing the four agent fields below together in config.yaml — name, version, runtime_image, and runtime_host_path.
Supported agents
name | Protocol | Version | Runtime image |
|---|---|---|---|
custom-claude-code | Anthropic | 2.1.118 | docker.io/jierun/c-cc-2.1.118:v0.1 |
custom-openhands-sdk | OpenAI | 1.14.0 | docker.io/jierun/c-oh-sdk-1.14.0:v0.5 |
custom-opencode | OpenAI-compatible | 1.14.22 | docker.io/yjiangcm/c-oc-1.14.22:v0.2 |
max_turns is reused as max_iterations when the agent is openhands-sdk (same semantics). The per-job LiteLLM proxy bridges OpenAI and Anthropic formats, so any of these works against the same upstream model.
The runtime bind-mount
runtime_host_path must point at a directory that already contains the extracted agent runtime tree from runtime_image. start.sh bind-mounts it into each task container.
Pre-extract the runtime
If runtime_host_path is empty, the agent falls back to an in-container install (curl https://claude.ai/install.sh for claude-code, pip for openhands-sdk), which 403s or times out on isolated networks. Bind-mount is also preferred over image-mount, which trips an overlayfs filename-too-long bug for some task images.
Pre-extract once per runtime_image (idempotent — re-run when you bump the image). Set SUBPATH to claude-code or oh-sdk to match the agent:
RUNTIME_IMAGE=<runtime_image from config>
SUBPATH=oh-sdk # or: claude-code
HOST_DIR=artifacts/runtime/<runtime_host_path basename>
docker pull "$RUNTIME_IMAGE"
CID=$(docker create "$RUNTIME_IMAGE")
rm -rf "$HOST_DIR" && mkdir -p "$(dirname "$HOST_DIR")"
docker cp "$CID:/opt/custom-agent-runtime/$SUBPATH" "$HOST_DIR"
docker rm "$CID"Agent vs benchmark compatibility
The three agents above are validated against the curated SWE / terminal benchmarks. Math/MCQ/QA benchmarks and benchmarks that ship their own agent runtime generally do not work with these coding agents — see Select Benchmark.