Measuring how well AI agents can post-train language models
Can AI agents improve performance of base LLMs? We give each agent 4 small target LLMs, an H100 GPU, and 10 hours to post-train them.
1 The average is taken across all post-trained LLMs (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, Gemma 3 4B IT) and benchmarks (AIME 2025, BFCL, GPQA Main, GSM8K, HumanEval). For each run, we ask a CLI agent to maximize the performance of a specific base LLM on a specific benchmark.
2 "Human Post-Trained" is not directly comparable to the rest since it usually exceeds the 10h + 1 GPU constraint.
| Rank | Method | Average Score | AIME 2025 | BFCL | GPQA Main | GSM8K | HumanEval |
|---|
More agents coming soon...
Time taken by each agent to complete post-training (out of 10 hours).
Different agents demonstrate varying levels of persistence - some give up well before the time limit expires.
Post-trained models are evaluated across these benchmarks to measure improvement in reasoning, knowledge, and problem-solving capabilities. We use Inspect for evaluation and respect each model's generation_config.json.
Post-Train Bench measures AI R&D automation by testing whether AI agents can successfully post-train other language models. Each agent receives 4 base models (Qwen 3 1.7B, Qwen 3 4B, SmolLM3-3B, and Gemma 3 4B), access to an H100 GPU, and a 10-hour time limit to improve model performance through post-training
Agents had 3-10 hour limits. Behaviors varied significantly:
timer.sh for remaining timeClaude found that Qwen/Qwen3-1.7B (the instruct-tuned version) works "perfectly" for function calling. However, it then explicitly acknowledged:
"However, the user specifically said to use Qwen/Qwen3-1.7B-Base. Let me re-read the user's constraint... So I must use the BASE model."
All agents showed awareness of contamination rules:
Dataset quality > training duration: GPT-5.1-codex-max's success came from careful dataset curation, not longer training
Constraint awareness: Almost all agents showed understanding of rules and avoided contamination
Self-correction: Claude showed self-correction that avoids reward hacking by model substitution
Library issues: Many errors came from library version mismatches (trl, transformers)
Format alignment matters: For function calling, matching exact output format was essential for high scores
Longer traces ≠ better results: GPT-5.1-codex had longest traces but inconsistent results; GPT-5.1-codex-max had shorter traces but better outcomes
If you found PostTrainBench useful, please cite us as:
@misc{posttrainbench_2025,
title={PostTrainBench: Measuring AI Ability to Perform LLM Post-Training},
author={Rank, Ben and Bhatnagar, Hardik and Bethge, Matthias and Andriushchenko, Maksym},
year={2025}
}