PostTrainBench

Leaderboard

¹ The weighted average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B) and benchmarks (AIME 2025, Arena Hard, BFCL, GPQA Main, GSM8K, HealthBench, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.

² "Instruction Tuned" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.

Changelog

Feb 24, 2026

Added confidence intervals for Gemini 3.1 Pro (3 runs)

Feb 20, 2026

Added Sonnet 4.6 (Claude Code)
Added Gemini 3.1 Pro (OpenCode)

Feb 19, 2026

Added Opus 4.6 (Claude Code) — now #1 on the leaderboard
Added GPT 5.3 Codex (Codex CLI)
Added GLM 5, Kimi K2.5, MiniMax M2.5 (OpenCode)

Filter by model:

Showing summary view

Rank	Method	Avg	AIME 2025	ArenaHard	BFCL	GPQA Main	GSM8K	HealthBench	HumanEval

^* Model not submitted — base model score shown ^† Evaluation error — base model score shown

More agents coming soon...

Detailed Breakdown by Benchmark

Select benchmark:

About

Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training

Experimental Setup

Models: Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B
Hardware: Single H100 GPU per agent
Time Limit: 10 hours per agent
Evaluation: Weighted average score across 7 benchmarks
Agent scaffolds: Native CLI scaffolds (Claude Code for Claude models, Codex CLI for OpenAI, Gemini CLI for Gemini)

Observations

Agent Behaviors

Claude Opus 4.5

Most Structured

Uses explicit todo lists to track progress
Web searches for best practices
Creates detailed implementation plans before coding

GPT-5.x Variants

Action-Oriented

Immediately starts exploring files and datasets
"Plan update" checkpoints with bullet points
Less formal planning, more exploratory

Gemini 3 Pro

Quick to Execute

Less planning overhead
Jumps directly into implementation
More failures due to less error anticipation

GPT-5.1 Codex Max

Best Performer

Building proper dataset pipelines (55k+ samples)
Iterating on training scripts when errors occurred
Using appropriate hyperparams (gradient checkpointing, bf16)

Time & Trace Patterns

Agents had 3-10 hour limits. Behaviors varied significantly:

GPT-5.1-codex: Often ran extremely long traces (381k+ lines on BFCL)
Claude: Regularly checked timer.sh for remaining time
Gemini: Shorter traces, faster iteration but more failures

Reward Hacking (Near Misses)

Claude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:

"However, the user specifically said to use Qwen/Qwen3-1.7B-Base. Let me re-read the user's constraint... So I must use the BASE model."

All agents showed awareness of contamination rules:

Claude: "Cannot use [benchmark] test data for training (data contamination)"
GPT models: "avoid leaking evaluation data", "avoiding test contamination"
All agents sourced training data from alternative datasets (MBPP, glaive-function-calling, Hermes, etc.)

Key Takeaways

Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training

Constraint awareness: Almost all agents showed understanding of rules and avoided contamination

Self-correction: Claude showed self-correction that avoids reward hacking by model substitution

Library issues: Many errors came from library version mismatches (trl, transformers)

Format alignment matters: For function calling, matching exact output format was essential for high scores

Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes

PostTrainBench

Leaderboard

Detailed Breakdown by Benchmark

Average Time Spent

Pipeline

Evaluation

About

Experimental Setup

Observations

Agent Behaviors

Claude Opus 4.5

GPT-5.x Variants

Gemini 3 Pro

GPT-5.1 Codex Max

Time & Trace Patterns

Reward Hacking (Near Misses)

Key Takeaways

Team

Citation