Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, OpenAI post-training researcher Josh McGrath discusses the evolution of post-training from GPT-4.1 to GPT-5.1, sharing insights from shipping thinking models and the new shopping model. McGrath explains how post-training offers the opportunity to change model behavior by 40% versus pre-training's 3% compute efficiency gains, while highlighting the infrastructure challenges of scaling RL systems. (01:24) He explores the shift from PPO vs DPO debates to RLVR and data quality focus, token efficiency breakthroughs, and the interplay between long context and agent capabilities.
Josh McGrath is a post-training researcher at OpenAI who has worked on GPT-4o, o1, o3, GPT-5 thinking models, and the recently launched shopping model. He previously worked on pre-training data curation before transitioning to post-training research, where he focuses on search-related functionality and RL systems. McGrath has lived through OpenAI's complete post-training evolution from early PPO vs DPO debates to today's RLVR era.
McGrath switched from pre-training to post-training because he wanted to "change behavior by 40%" rather than make "compute efficiency wins of 3%." (01:35) This reflects the fundamental difference between these two approaches: pre-training focuses on incremental improvements to model capabilities, while post-training can dramatically alter how models behave and interact with users. The ability to fundamentally reshape model behavior through techniques like RLHF and RLVR makes post-training an incredibly powerful lever for improving user experience.
Unlike pre-training where you're "moving tokens to many machines and getting basically a scalar from them," RL involves multiple tasks with different grading setups, creating exponentially more infrastructure complexity. (02:11) McGrath describes staying up late troubleshooting runs that could fail due to numerous interconnected systems. This complexity means post-training researchers need to understand unfamiliar codebases quickly, often jumping between internal and external partner systems at 12:30 AM when something breaks.
McGrath emphasizes thinking in "actual number of tokens than time" because it provides a different optimization target. (13:28) From GPT-5 to 5.1, while overall evals improved modestly, the token efficiency "went way down" - meaning the models achieved better results with fewer tokens. This efficiency directly impacts user experience by enabling more tool calls and agent actions within reasonable token budgets, making the difference between a usable and unusable AI application.
McGrath argues that RLHF and RLVR "are both policy gradient methods" where "what's different is just the input data." (09:26) The real innovation isn't in optimization techniques but in the spectrum of signal quality - from human preferences (harder to verify) to mathematical correctness (easily verifiable). He highlights how GRPO from DeepSeek Math was underappreciated because it represented a shift toward more trustworthy reward signals, not just an optimization trick.
McGrath identifies a critical hiring challenge: the industry struggles to find people who excel at both distributed systems and ML research. (21:40) He notes that "the education system we have right now isn't really optimized for that," as frontier research requires seamlessly moving between systems work and ML work as bottlenecks shift. This hybrid skillset is essential because in frontier AI development, "you don't know which place is currently bottlenecking the frontier, and it changes all the time."