Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
The RL1000 team from Princeton made headlines at NeurIPS 2025 by winning the Best Paper Award for their groundbreaking work scaling reinforcement learning networks to 1,000 layers deep—something the RL community thought impossible for over a decade. (02:24) The research challenges conventional wisdom by demonstrating that deep networks can work in RL, but only with the right architectural components and a fundamentally different objective. Rather than traditional value-based RL that maximizes rewards directly, their approach uses self-supervised representation learning where states along the same trajectory are pushed together while states from different trajectories are pushed apart. (04:44) This shift from regression-based TD errors to classification-based representation learning unlocks the scalability that has made deep learning successful in language and vision. The team discovered critical architectural ingredients including residual connections and layer normalization that, when combined with sufficient data (15M+ transitions), create a "critical depth" phenomenon where performance doesn't just improve—it multiplies dramatically.
Kevin is a recent Princeton University graduate who led this groundbreaking research project as an undergraduate. He started this work in an independent research seminar and was instrumental in discovering the critical architectural combinations that made deep RL scaling possible. This project represents one of his first significant experiences in machine learning research.
Ishaan is a Princeton researcher who collaborated closely with Kevin during the independent work seminar where this project originated. He has been actively exploring vision-language-action models and robotics applications, particularly interested in how representation learning can advance embodied AI systems.
Michał is a Princeton PhD student working on reinforcement learning, currently pursuing research in stitching behaviors in RL—generalizing from shorter sub-behaviors that can be merged during test time. He contributed significantly to the architectural insights that made the scaling breakthrough possible.
Ben is a Princeton professor who initially taught the independent work seminar where this research began. Despite being skeptical that deep networks would work in RL based on prior failed attempts, he was willing to support the research bet. His lab focuses on deep reinforcement learning and had previously developed infrastructure that made these experiments feasible.
The breakthrough came not just from making networks deeper, but from fundamentally changing the learning objective. (04:48) Instead of traditional value-based RL that learns noisy, biased TD errors through regression, their approach learns representations where states along the same trajectory are pushed together while states from different trajectories are pushed apart. This transforms RL from a regression problem into a classification problem, leveraging cross-entropy loss and representation learning—the same scalable paradigms that work in language and vision. The key insight is that their code doesn't even have a line saying "maximize rewards"—it's pure self-supervised representation learning. (08:17)
Simply making networks deeper initially degraded performance, but combining specific architectural components unlocked dramatic gains. (05:25) The team discovered that residual connections, layer normalization, and sufficient training data (15M+ transitions) create a "critical depth" where performance doesn't gradually improve—it multiplies by huge factors. This wasn't a hyperparameter optimization problem where individual changes helped; it required the precise combination of depth, architectural tricks, and data scale to unlock the breakthrough.
When comparing scaling strategies, depth grows parameters linearly while width grows quadratically, making depth more parameter-efficient and sample-efficient for the same performance gains. (13:57) The team's experiments showed that scaling depth produced steeper performance curves compared to scaling width for the same approximate number of parameters. This has practical implications for resource-constrained environments where you want maximum performance per parameter.
The breakthrough was enabled by JAX-based GPU-accelerated environments that can collect thousands of trajectories in parallel, generating hundreds of millions of transitions in just hours. (16:10) This data abundance was crucial because the performance improvements only emerged after crossing 15M+ transitions. Unlike traditional RL where data collection is the bottleneck, this infrastructure makes forward passes through deep networks the limiting factor, fundamentally changing the economics of RL training.
Traditional RL doesn't benefit from larger batch sizes because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension. (22:50) The team hypothesizes that previous RL approaches couldn't leverage larger batches because they lacked sufficient network capacity. Their deep networks unlock this additional axis of scaling, similar to how language models benefit from both larger architectures and larger batch sizes simultaneously.