PostTrainBench

Measuring how well AI agents can post-train language models

Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.

Coming Soon
GitHub

Leaderboard

* "Human Post-Trained" is not directly comparable since it exceeds the 10h + 1 GPU constraint

Rank Method Average Score AIME 2025 Arena Hard BFCL GPQA Main GSM8K

Evaluation Benchmarks

Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities

About

Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B IT), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training

Experimental Setup

  • Models: Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT
  • Hardware: Single H100 GPU per agent
  • Time Limit: 10 hours per agent
  • Evaluation: Average score across 5 benchmarks

Team

Citation

If you found PostTrainBench useful, please cite us as:

@misc{posttrainbench_2025,
  title={PostTrainBench: Measuring AI Ability to Perform LLM Post-Training},
  author={Rank, Ben and Bhatnagar, Hardik and Bethge, Matthias and Andriushchenko, Maksym},
  year={2025}
}