Keep benchmark pages (homepage snapshots + dedicated pages) synced with external leaderboards: CodeClash for goal-oriented coding and SWE-bench for real-world GitHub issues.
What it includes
🎯 CodeClash Source: https://codeclash.ai/ (mirrored into data/benchmark.json).
Methodology: "Goals, not tasks" — real software development is goal-driven, not isolated issue-solving.
Two-phase approach: Edit phase (models improve codebase) + Compete phase (arena battles).
Scale: 8 models × 6 arenas × 1680 tournaments × 15 rounds each = 25,200 rounds total, generating 50k agent trajectories.
Arenas: Halite, Poker, CoreWar, RobotRumble, Robocode, BattleSnake (each testing different strategic and coding skills).
Insights: Models accumulate tech debt rapidly; humans still beat best LLMs in some arenas; progress over perfection mindset.
Data structure: Overall ELO + per-arena breakdown + methodology details + key insights.
🔧 SWE-bench Source: https://www.swebench.com/ (mirrored into data/swebench.json).
Benchmark: Evaluates models on 2,294 real-world software engineering problems from 12 popular Python repositories.
Variants: Full (2294), Verified (500 human-filtered), Lite (300 cost-efficient), Bash Only (500 mini-SWE-agent), Multimodal (517 with visuals).
Metric: % Resolved — percentage of GitHub issues successfully fixed by the model.
Real tasks: Actual issues from Django, Flask, Matplotlib, Pandas, Scikit-learn, Requests, etc.
Context understanding: Tests models' ability to navigate complex codebases and make appropriate changes.
Data structure: Rank + model + % resolved + organization + date + release version.
⚙️ Update Process:
To refresh CodeClash: update data/benchmark.json (or wire getBenchmarkData to live JSON) and rebuild.
To refresh SWE-bench: update data/swebench.json (or wire getSWEBenchData to live JSON) and rebuild.
Track lastUpdated + source URL so viewers know staleness.
Do not scrape automatically without permission; prefer a published JSON feed or manual update.
Both benchmarks appear on homepage as condensed leaderboards with bar charts and logos.
Full details available at /benchmark (CodeClash) and linked to external swebench.com (SWE-bench).
Trigger: when asked "update codeclash", "update swebench", or "update benchmarks", refresh appropriate data files and redeploy.
CodeClash: goal-oriented coding benchmark with ELO ratings across competitive arenas.
SWE-bench: real-world GitHub issue resolution benchmark across Python repositories.
Scope: leaderboard content on homepage snapshots + dedicated pages; keep formatting consistent.
Data files: data/benchmark.json (CodeClash) and data/swebench.json (SWE-bench).