Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, Yi Tay returns to discuss his journey from leaving Reka back to Google DeepMind and establishing the Reasoning and AGI team in Singapore. The conversation covers his pivotal role in developing Gemini Deep Think and achieving IMO Gold medal performance, representing a major breakthrough where an AI system competed live in the International Mathematical Olympiad and achieved gold-level performance. (12:00) Yi explains the bold decision to abandon AlphaProof's symbolic approach in favor of an end-to-end Gemini model trained with reinforcement learning.
Yi Tay is a researcher at Google DeepMind who has played key roles in major AI breakthroughs including shipping Gemini Deep Think and achieving IMO Gold medal performance. He previously worked at Brain, then Reka, before returning to Google DeepMind to lead model training efforts and establish the Reasoning and AGI team in Singapore. Yi has significant research background in architectures, pre-training, and now specializes in reinforcement learning for reasoning systems, contributing to scaling teams from dozens to 300+ researchers while driving Gemini to top leaderboard positions across categories.
Yi emphasizes the fundamental difference between on-policy and off-policy learning approaches. Off-policy learning is essentially imitation - copying someone else's successful trajectories, like supervised fine-tuning on another model's outputs. On-policy learning involves the model generating its own outputs, receiving rewards, and training on its own experience. (05:25) This mirrors human learning where we make mistakes and learn from feedback rather than just copying others. This philosophy drove much of their RL approach for reasoning systems, allowing models to develop their own problem-solving trajectories rather than simply imitating existing solutions.
The team made a controversial decision to throw away AlphaProof's symbolic approach and bet everything on end-to-end Gemini with RL. (14:01) Yi's reasoning was simple: "If one model can't do it, can we get to AGI?" Rather than building specialized systems for each domain, they believed one model should eventually handle everything. This decision proved correct when they achieved IMO Gold in live competition, demonstrating that general intelligence approaches could surpass specialized symbolic systems.
Yi describes a major shift in his own usage patterns where AI coding became genuinely useful rather than just a curiosity. (23:36) He now runs jobs, gets bugs, pastes them into Gemini, and relaunches without even reading the fix. The model has become better than him at certain debugging tasks, representing a clear emergence of practical utility. This demonstrates how capabilities can suddenly cross thresholds where tools become genuinely productive rather than just impressive demos.
Contrary to the narrative that recent progress is just "blind scaling," Yi argues that the last five years required multiple breakthrough ideas working together. (45:06) Transformers, pre-training, RL, self-consistency, and reasoning approaches all had to be invented and play well together to reach current capabilities. Each idea compounds with previous work, creating a compounding effect where new research must be compatible with existing infrastructure and techniques to succeed.
The gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that build on everything developed before. (53:46) This creates a compounding advantage for organizations with continuous research efforts, as each small improvement builds on all previous work. The days of massive breakthrough papers that reset the field are becoming rarer - instead, progress comes through accumulated incremental improvements that require sustained research infrastructure.