Evaluating AI Progress on SWE-Bench Pro

As frontier AI models approach the performance ceiling of earlier benchmarks, developers are shifting from SWE-Bench Verified to SWE-Bench Pro. This update provides a clearer signal of advanced coding capabilities. To effectively monitor capabilities advancement and ground AI safety discussions, the CAIS AI Dashboard will now track model performance on SWE-Bench Pro.

SWE-Bench Pro

Released by Scale AI, SWE-Bench Pro is a more difficult coding benchmark inspired by the original SWE-Bench design. Compared to the 500 Python-only tasks in SWE-Bench Verified, SWE-Bench Pro broadens the scope to 731 tasks in its public set, spanning Python, Go, TypeScript, and JavaScript.

Methodology

To ensure a fair comparison, we test all models using a uniform setup rather than developer-specific agent scaffolds. We use the Terminus-2 agent scaffold, a terminal command-line-only agent scaffold without any specific tools. Appendix A describes the setup in detail.

We conduct these evaluations using the Harbor framework. During our initial large-scale runs, we identified a few issues within the framework and have contributed those fixes. We also released a Harbor Dataset Registry cais/swebenchpro that we used in our development. Additional implementation details are in Appendix B.

Given that infrastructure may impact coding eval results, Appendix C documents our hardware and resource setup that we used for our evaluation.

Current Results

Results of current frontier AI models on SWE-Bench Pro. Source: CAIS AI Dashboard

AI progress on SWE-Bench Pro since GPT-4o. Source: CAIS AI Dashboard

Appendix A: Validating our Scaffold

To validate our evaluation setup, we reproduced the reported GPT-5.4 numbers from OpenAI on Codex CLI Agent through Harbor (after applying all of the fixes discussed in Appendix B). We observe that GPT-5.4 models perform higher when evaluated using Codex CLI compared to the terminal command-line only agent scaffold (Terminus 2).

Model Accuracy (Terminus-2) Accuracy (Codex) Reported
gpt-5.4-xhigh 50.50% 54.70% 57.7%
gpt-5.4-mini-xhigh 38.30% 50.7% 54.4%
gpt-5.4-nano-xhigh 37.40% 52.10% 52.4%

We also compare our results with Anthropic’s reported results:

Model Accuracy (Terminus-2) Reported
Claude Opus 4.6 56.7 53.4
Claude Opus 4.5 51.4 51.6

Codex CLI vs. Bash-Only

During our evaluations, we observed performance differences when the same model is evaluated on different agent scaffolds. GPT models consistently score higher on Codex CLI than on Terminus-2's raw bash interface, with gaps ranging from 4% for GPT-5.4 to 25% for GPT-5.

Appendix B: Network and Git History Exploitations

Inside bash-based scaffolds such as mini-swe-agent and Terminus-2, the agent has unrestricted access to shell commands. Therefore, without safeguards, agents can use git commands to find existing fixes in the git history or search for the patched version of the code from upstream repositories over the network.

We compiled agent trajectories demonstrating these exploits in this trajectory collection. We patched the upstream harbor repository with Issue #1067, PR #1070, and #1077. The patched version of the dataset can be accessed through our registry at cais/swebenchpro.

Our adapter-side fixes address these exploits by:

1. Isolating git history. The Dockerfile moves .git to a hidden path (/var/lib/apt/.a8f1c), then creates a fresh git init with a single "base" commit. The agent sees a clean repo with no exploitable history. The original .git is restored only at verification time inside test.sh. The original SWE-agent exposes the full git history to the agent then only a git reset --hard <base_commit> is performed, with no history stripping.

2. Blocking Network Access. An entrypoint.sh script blocks GitHub domains (github.com, raw.githubusercontent.com, api.github.com, codeload.github.com, and objects.githubusercontent.com) via /etc/hosts at container startup.

These patches are applied at the dataset/adapter level. The official Harbor SWE-Bench Pro registry images may remain unpatched and can be vulnerable to the same exploits during evaluations. Also, our fixes may not be comprehensive, as more powerful models may find ways to game the evaluation in unintended ways.

Appendix C: Run Configurations

​​All recorded runs used 8 CPUs and 16 GB of memory per task. 731 isolated containers were built from Docker images totaling 2.3 TB. Each Docker image build had a 1,800-second (30-minute) timeout. Agent and verifier timeouts were set to 6,000 seconds (100 minutes) per task. At every batch, containers, images, and networks were pruned to save disk space and avoid intra-task contamination.

Footnotes