Latent Space: The AI Engineer Podcast•January 3, 2026

[NeurIPS Best Paper] 1000 Layer Networks for Self-Supervised RL — Kevin Wang et al, Princeton

Kevin Wang and his Princeton team challenge conventional wisdom in reinforcement learning by scaling neural networks to 1000 layers using a self-supervised objective that transforms RL into a classification problem, demonstrating performance gains through architectural innovations like residual connections and layer normalization.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

The RL1000 team from Princeton made headlines at NeurIPS 2025 by winning the Best Paper Award for their groundbreaking work scaling reinforcement learning networks to 1,000 layers deep—something the RL community thought impossible for over a decade. (02:24) The research challenges conventional wisdom by demonstrating that deep networks can work in RL, but only with the right architectural components and a fundamentally different objective. Rather than traditional value-based RL that maximizes rewards directly, their approach uses self-supervised representation learning where states along the same trajectory are pushed together while states from different trajectories are pushed apart. (04:44) This shift from regression-based TD errors to classification-based representation learning unlocks the scalability that has made deep learning successful in language and vision. The team discovered critical architectural ingredients including residual connections and layer normalization that, when combined with sufficient data (15M+ transitions), create a "critical depth" phenomenon where performance doesn't just improve—it multiplies dramatically.

Core Theme: Bridging the gap between RL and self-supervised learning by scaling depth through architectural innovations and objective redesign, moving from reward maximization to representation learning.

Speakers

Kevin Wang

Kevin is a recent Princeton University graduate who led this groundbreaking research project as an undergraduate. He started this work in an independent research seminar and was instrumental in discovering the critical architectural combinations that made deep RL scaling possible. This project represents one of his first significant experiences in machine learning research.

Ishaan Javali

Ishaan is a Princeton researcher who collaborated closely with Kevin during the independent work seminar where this project originated. He has been actively exploring vision-language-action models and robotics applications, particularly interested in how representation learning can advance embodied AI systems.

Michał Bortkiewicz

Michał is a Princeton PhD student working on reinforcement learning, currently pursuing research in stitching behaviors in RL—generalizing from shorter sub-behaviors that can be merged during test time. He contributed significantly to the architectural insights that made the scaling breakthrough possible.

Benjamin Eysenbach

Ben is a Princeton professor who initially taught the independent work seminar where this research began. Despite being skeptical that deep networks would work in RL based on prior failed attempts, he was willing to support the research bet. His lab focuses on deep reinforcement learning and had previously developed infrastructure that made these experiments feasible.

Key Takeaways

Shift from Value-Based to Self-Supervised RL Objectives

The breakthrough came not just from making networks deeper, but from fundamentally changing the learning objective. (04:48) Instead of traditional value-based RL that learns noisy, biased TD errors through regression, their approach learns representations where states along the same trajectory are pushed together while states from different trajectories are pushed apart. This transforms RL from a regression problem into a classification problem, leveraging cross-entropy loss and representation learning—the same scalable paradigms that work in language and vision. The key insight is that their code doesn't even have a line saying "maximize rewards"—it's pure self-supervised representation learning. (08:17)

Critical Depth Phenomenon Requires Architectural Innovation

Simply making networks deeper initially degraded performance, but combining specific architectural components unlocked dramatic gains. (05:25) The team discovered that residual connections, layer normalization, and sufficient training data (15M+ transitions) create a "critical depth" where performance doesn't gradually improve—it multiplies by huge factors. This wasn't a hyperparameter optimization problem where individual changes helped; it required the precise combination of depth, architectural tricks, and data scale to unlock the breakthrough.

Depth Scaling Is More Parameter-Efficient Than Width Scaling

When comparing scaling strategies, depth grows parameters linearly while width grows quadratically, making depth more parameter-efficient and sample-efficient for the same performance gains. (13:57) The team's experiments showed that scaling depth produced steeper performance curves compared to scaling width for the same approximate number of parameters. This has practical implications for resource-constrained environments where you want maximum performance per parameter.

GPU-Accelerated Environments Enable Data-Rich RL Training

The breakthrough was enabled by JAX-based GPU-accelerated environments that can collect thousands of trajectories in parallel, generating hundreds of millions of transitions in just hours. (16:10) This data abundance was crucial because the performance improvements only emerged after crossing 15M+ transitions. Unlike traditional RL where data collection is the bottleneck, this infrastructure makes forward passes through deep networks the limiting factor, fundamentally changing the economics of RL training.

Deep Networks Unlock Batch Size as Another Scaling Dimension

Traditional RL doesn't benefit from larger batch sizes because networks are too small to exploit the signal, but once you scale depth, batch size becomes another effective scaling dimension. (22:50) The team hypothesizes that previous RL approaches couldn't leverage larger batches because they lacked sufficient network capacity. Their deep networks unlock this additional axis of scaling, similar to how language models benefit from both larger architectures and larger batch sizes simultaneously.

Statistics & Facts

15M+ transitions threshold: Performance improvements only emerged after crossing 15 million transitions in their experiments. (17:10) This represents a critical data scale where deep networks begin to outperform shallow ones significantly.
1000 layers on single GPU: All experiments, including thousand-layer networks, ran on a single 80GB H100 GPU, making the approach surprisingly accessible despite the massive depth. (24:17)
Linear vs Quadratic parameter growth: Scaling depth increases parameters linearly while scaling width increases parameters quadratically, making depth scaling more efficient for the same performance gains. (14:04)