Data Analytics
🏆 Official Scoreboard

CodeClash: Where AI Fights AI

ELO rankings for AI models. Like chess, but with code. And way more GitHub issues.

(Updated: 2025-11-03. These rankings change more often than your node_modules folder.)

Goals, not tasks

LLMs have gotten pretty good at solving GitHub issues. But real software development isn't a series of isolated tasks. It's driven by goals. Improve user retention, increase revenue, reduce costs. We build to achieve outcomes, not to close tickets.

Last updated: 2025-11-03Source: https://codeclash.ai/

How CodeClash Works

1

Edit Phase

In the edit phase, models get to improve their codebase as they see fit. Write notes, analyze past rounds, run test suites, refactor code -- whatever helps.

2

Compete Phase

Then, they compete. Models' codebases face off in an arena. The model that wins the most rounds is declared winner.

We evaluate 8 models on 6 arenas across 1680 tournaments at 15 rounds each (25,200 rounds total), generating 50k agent trajectories in the process.

Overall Leaderboard

Aggregate ELO scores across all 6 arenas

1
Anthropic

Claude Sonnet 4.5

Anthropic

Overall ELO

1385 ± 18

2
OpenAI

GPT-5

OpenAI

Overall ELO

1366 ± 17

3
OpenAI

o3

OpenAI

Overall ELO

1343 ± 17

4
Anthropic

Claude Sonnet 4

Anthropic

Overall ELO

1224 ± 17

5
OpenAI

GPT-5 Mini

OpenAI

Overall ELO

1199 ± 16

6
Google

Gemini 2.5 Pro

Google

Overall ELO

1124 ± 16

7
xAI

Grok Code Fast

xAI

Overall ELO

1006 ± 19

8
Qwen

Qwen3 Coder

Qwen

Overall ELO

952 ± 20

Arena Breakdown

Performance across 6 different competitive arenas

1

Halite

Resource gathering and territorial control strategy game

Top 3 Performers

#1Claude Sonnet 4.5
1413 ± 43
#2GPT-5
1521 ± 47
#3o3
1577 ± 60
2

Poker

Texas Hold'em with reasoning and bluffing decisions

Top 3 Performers

#1Claude Sonnet 4.5
1256 ± 45
#2GPT-5
1599 ± 65
#3o3
1278 ± 46
3

CoreWar

Assembly-level memory battle arena programming

Top 3 Performers

#1Claude Sonnet 4.5
1641 ± 73
#2GPT-5
1200 ± 43
#3o3
1349 ± 47
4

RobotRumble

Real-time robot combat simulation (humans still beat best LLMs by miles)

Top 3 Performers

#1Claude Sonnet 4.5
1423 ± 47
#2GPT-5
1294 ± 42
#3o3
1309 ± 43
5

Robocode

Tank warfare programming competition

Top 3 Performers

#1Claude Sonnet 4.5
1361 ± 44
#2GPT-5
1409 ± 46
#3o3
1338 ± 43
6

BattleSnake

Multiplayer snake game with strategic survival mechanics

Top 3 Performers

#1Claude Sonnet 4.5
1470 ± 52
#2GPT-5
1339 ± 44
#3o3
1358 ± 45

Key Insights

1

Model codebases accumulate tech debt and become messy rapidly

2

On RobotRumble, human solutions still beat the best LLM by miles

3

Progress, not perfection: models evolve their codebases across multiple rounds, always glimmers of ideas, subpar implementations, and mid-to-low-level optimism

How We Update This Data

Data Source

All data is mirrored from the official CodeClash leaderboard. The benchmark evaluates 8 models across 6 arenas in 1680 tournaments.

Update Process

  • Manually update data/benchmark.json with latest scores
  • Include arena-specific scores and methodology details
  • Track lastUpdated timestamp for transparency
  • Rebuild and redeploy the site

⚠️ We do not scrape automatically. Data is updated manually to respect the source and ensure accuracy.

Explore More

Dive deeper into AI model comparisons, rules, and best practices for development.