PostTrainBench

Measuring how well AI agents can post-train language models

Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.

GitHub

Leaderboard

* "Human Post-Trained" is not directly comparable since it exceeds the 10h + 1 GPU constraint

Rank	Method	Average Score	AIME 2025	Arena Hard	BFCL	GPQA Main	GSM8K

Evaluation Benchmarks

Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities

About

Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B IT), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training

Experimental Setup

Models: Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT
Hardware: Single H100 GPU per agent
Time Limit: 10 hours per agent
Evaluation: Average score across 5 benchmarks

Citation

If you found PostTrainBench useful, please cite us as: