Latent Space: The AI Engineer Podcast•December 31, 2025

[State of Post-Training] From GPT-4.1 to 5.1: RLVR, Agent & Token Efficiency — Josh McGrath, OpenAI

Josh McGrath explores the evolving landscape of post-training AI research, discussing token efficiency, RLVR methods, agent workflows, long context challenges, and the critical need for interdisciplinary researchers who can bridge machine learning and distributed systems.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this episode, OpenAI post-training researcher Josh McGrath discusses the evolution of post-training from GPT-4.1 to GPT-5.1, sharing insights from shipping thinking models and the new shopping model. McGrath explains how post-training offers the opportunity to change model behavior by 40% versus pre-training's 3% compute efficiency gains, while highlighting the infrastructure challenges of scaling RL systems. (01:24) He explores the shift from PPO vs DPO debates to RLVR and data quality focus, token efficiency breakthroughs, and the interplay between long context and agent capabilities.

Main themes: Post-training infrastructure complexity, token efficiency as the key metric for model performance, the evolution from optimization-focused to data-quality-focused RL approaches, and the need for hybrid ML-systems expertise in frontier AI research.

Speakers

Josh McGrath

Josh McGrath is a post-training researcher at OpenAI who has worked on GPT-4o, o1, o3, GPT-5 thinking models, and the recently launched shopping model. He previously worked on pre-training data curation before transitioning to post-training research, where he focuses on search-related functionality and RL systems. McGrath has lived through OpenAI's complete post-training evolution from early PPO vs DPO debates to today's RLVR era.

Key Takeaways

Post-Training Offers 40% Behavior Changes vs Pre-Training's 3% Efficiency Gains

McGrath switched from pre-training to post-training because he wanted to "change behavior by 40%" rather than make "compute efficiency wins of 3%." (01:35) This reflects the fundamental difference between these two approaches: pre-training focuses on incremental improvements to model capabilities, while post-training can dramatically alter how models behave and interact with users. The ability to fundamentally reshape model behavior through techniques like RLHF and RLVR makes post-training an incredibly powerful lever for improving user experience.

RL Infrastructure is Orders of Magnitude More Complex Than Pre-Training

Unlike pre-training where you're "moving tokens to many machines and getting basically a scalar from them," RL involves multiple tasks with different grading setups, creating exponentially more infrastructure complexity. (02:11) McGrath describes staying up late troubleshooting runs that could fail due to numerous interconnected systems. This complexity means post-training researchers need to understand unfamiliar codebases quickly, often jumping between internal and external partner systems at 12:30 AM when something breaks.

Token Efficiency Matters More Than Wall-Clock Time for Model Performance

McGrath emphasizes thinking in "actual number of tokens than time" because it provides a different optimization target. (13:28) From GPT-5 to 5.1, while overall evals improved modestly, the token efficiency "went way down" - meaning the models achieved better results with fewer tokens. This efficiency directly impacts user experience by enabling more tool calls and agent actions within reasonable token budgets, making the difference between a usable and unusable AI application.

The PPO vs DPO Debate Misses the Real Innovation: Data Quality

McGrath argues that RLHF and RLVR "are both policy gradient methods" where "what's different is just the input data." (09:26) The real innovation isn't in optimization techniques but in the spectrum of signal quality - from human preferences (harder to verify) to mathematical correctness (easily verifiable). He highlights how GRPO from DeepSeek Math was underappreciated because it represented a shift toward more trustworthy reward signals, not just an optimization trick.

AI Development Requires Hybrid ML-Systems Expertise That Education Doesn't Provide

McGrath identifies a critical hiring challenge: the industry struggles to find people who excel at both distributed systems and ML research. (21:40) He notes that "the education system we have right now isn't really optimized for that," as frontier research requires seamlessly moving between systems work and ML work as bottlenecks shift. This hybrid skillset is essential because in frontier AI development, "you don't know which place is currently bottlenecking the frontier, and it changes all the time."

Statistics & Facts

Token efficiency improvement from GPT-5 to GPT-5.1: While overall evaluations improved modestly, the token efficiency "went way down," meaning significantly fewer tokens were required to achieve better performance. (14:00)
Post-training behavior change potential: Post-training can achieve 40% behavior changes compared to pre-training's 3% compute efficiency gains, according to McGrath's assessment of the trade-offs between the two approaches. (01:35)
Context window scaling: OpenAI increased the effective context window by 10x for GPT-4.1, demonstrating significant progress in long context capabilities. (16:58)