Latent Space: The AI Engineer Podcast•January 23, 2026

Captaining IMO Gold, Deep Think, On-Policy RL, Feeling the AGI in Singapore — Yi Tay 2

Yi Tay discusses his journey at Google DeepMind, highlighting the IMO Gold achievement with Gemini Deep Think, the importance of on-policy reinforcement learning, and the establishment of the Reasoning and AGI team in Singapore.

AI & Machine Learning

Indie Hackers & SaaS Builders

Developer Culture

Reasoning & Inference

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this episode, Yi Tay returns to discuss his journey from leaving Reka back to Google DeepMind and establishing the Reasoning and AGI team in Singapore. The conversation covers his pivotal role in developing Gemini Deep Think and achieving IMO Gold medal performance, representing a major breakthrough where an AI system competed live in the International Mathematical Olympiad and achieved gold-level performance. (12:00) Yi explains the bold decision to abandon AlphaProof's symbolic approach in favor of an end-to-end Gemini model trained with reinforcement learning.

Main Theme: The evolution from architecture research to RL-driven reasoning systems, showcasing how ideas compound over time to achieve breakthrough capabilities in mathematical reasoning and AI coding assistance.

Speakers

Yi Tay

Yi Tay is a researcher at Google DeepMind who has played key roles in major AI breakthroughs including shipping Gemini Deep Think and achieving IMO Gold medal performance. He previously worked at Brain, then Reka, before returning to Google DeepMind to lead model training efforts and establish the Reasoning and AGI team in Singapore. Yi has significant research background in architectures, pre-training, and now specializes in reinforcement learning for reasoning systems, contributing to scaling teams from dozens to 300+ researchers while driving Gemini to top leaderboard positions across categories.

Key Takeaways

On-Policy vs Off-Policy Learning Philosophy

Yi emphasizes the fundamental difference between on-policy and off-policy learning approaches. Off-policy learning is essentially imitation - copying someone else's successful trajectories, like supervised fine-tuning on another model's outputs. On-policy learning involves the model generating its own outputs, receiving rewards, and training on its own experience. (05:25) This mirrors human learning where we make mistakes and learn from feedback rather than just copying others. This philosophy drove much of their RL approach for reasoning systems, allowing models to develop their own problem-solving trajectories rather than simply imitating existing solutions.

The Bold Decision to Abandon Symbolic Systems

The team made a controversial decision to throw away AlphaProof's symbolic approach and bet everything on end-to-end Gemini with RL. (14:01) Yi's reasoning was simple: "If one model can't do it, can we get to AGI?" Rather than building specialized systems for each domain, they believed one model should eventually handle everything. This decision proved correct when they achieved IMO Gold in live competition, demonstrating that general intelligence approaches could surpass specialized symbolic systems.

AI Coding Crossed the Capability Threshold

Yi describes a major shift in his own usage patterns where AI coding became genuinely useful rather than just a curiosity. (23:36) He now runs jobs, gets bugs, pastes them into Gemini, and relaunches without even reading the fix. The model has become better than him at certain debugging tasks, representing a clear emergence of practical utility. This demonstrates how capabilities can suddenly cross thresholds where tools become genuinely productive rather than just impressive demos.

Ideas Still Matter Despite Scaling

Contrary to the narrative that recent progress is just "blind scaling," Yi argues that the last five years required multiple breakthrough ideas working together. (45:06) Transformers, pre-training, RL, self-consistency, and reasoning approaches all had to be invented and play well together to reach current capabilities. Each idea compounds with previous work, creating a compounding effect where new research must be compatible with existing infrastructure and techniques to succeed.

The Increasing Closed Lab Advantage

The gap between frontier labs and open source is growing because ideas compound over time, and researchers keep finding new tricks that build on everything developed before. (53:46) This creates a compounding advantage for organizations with continuous research efforts, as each small improvement builds on all previous work. The days of massive breakthrough papers that reset the field are becoming rarer - instead, progress comes through accumulated incremental improvements that require sustained research infrastructure.

Statistics & Facts

Yi lost 23 kilograms over one and a half years while improving his research productivity, with his heart rate variability doubling and resting heart rate dropping by 30 beats per minute. (90:14)
The IMO gold threshold is not a fixed number but a percentile-based score that depends on human participant performance, making the live competition results uncertain until all human scores were finalized.
Humans learn from approximately 8 orders of magnitude less data than current AI models, representing a massive efficiency gap that suggests significant room for improvement in learning algorithms.