Measuring how well AI agents can post-train language models
Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.
* "Human Post-Trained" is not directly comparable since it exceeds the 10h + 1 GPU constraint
| Rank | Method | Average Score | AIME 2025 | Arena Hard | BFCL | GPQA Main | GSM8K |
|---|
Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities
Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B IT), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training
If you found PostTrainBench useful, please cite us as:
@misc{posttrainbench_2025,
title={PostTrainBench: Measuring AI Ability to Perform LLM Post-Training},
author={Rank, Ben and Bhatnagar, Hardik and Bethge, Matthias and Andriushchenko, Maksym},
year={2025}
}