Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
This podcast episode features World Labs co-founders Fei-Fei Li and Justin Johnson discussing their journey from pioneering computer vision breakthroughs to building spatial intelligence systems. The conversation explores the evolution from ImageNet's foundational work in 2009 to today's generative AI revolution, highlighting why the next chapter of AI isn't about better language models but about understanding the 3D world as fundamentally as we understand text. Li and Johnson explain how spatial intelligence differs from current multimodal AI approaches, which still operate on one-dimensional token sequences despite processing pixels. (17:00)
Co-founder of World Labs and renowned AI pioneer who led the creation of ImageNet, the foundational dataset that unlocked modern computer vision. Former Stanford professor and director of the Stanford AI Lab, Li has spent over two decades advancing visual intelligence research and is considered one of the most influential figures in the development of deep learning for computer vision.
Co-founder of World Labs and former Stanford PhD student under Fei-Fei Li, Johnson made significant contributions to neural style transfer and early generative AI work. His research spans from real-time artistic style transfer to text-to-image generation using GANs, representing the evolution from discriminative to generative computer vision models.
General Partner at Andreessen Horowitz (a16z) who hosts this conversation. Casado brings a systems and infrastructure perspective to AI discussions, having previously founded networking company Nicira which was acquired by VMware for $1.26 billion.
Current multimodal language models fundamentally operate on one-dimensional token sequences, even when processing images and videos. Li and Johnson argue that true spatial intelligence requires native 3D representation at the algorithmic core. (25:16) This isn't just about processing pixels differently—it's about understanding that the 3D world follows physics laws and has inherent structure that can't be captured through 1D sequences. Johnson explains that while you can theoretically model 3D projections with 2D representations, having 3D representation at the heart of the model creates better affordances for spatial tasks like moving objects or cameras in virtual environments.
Li highlights a critical but underappreciated development in computer vision: the merger of reconstruction (understanding existing 3D scenes) and generation (creating new content). (23:52) This convergence means that whether you see something real or imagine something new, both can lead to generating 3D content. This represents a fundamental shift from the traditional separation between analyzing existing visual data and creating new visual content, enabling more seamless transitions between understanding and creating spatial environments.
Li's experience with ImageNet demonstrates that letting data drive models can unleash unprecedented power. (06:59) The field was focused on sophisticated algorithms like Bayesian models and kernel methods, but Li's lab realized that scaling data to internet-scale (millions rather than thousands of images) was the missing ingredient. This insight led to the "crazy bet" on ImageNet, which ultimately enabled the deep learning revolution in computer vision and showed that sometimes the breakthrough comes from thinking about data scale rather than algorithmic complexity.
Johnson illustrates the dramatic compute improvements by comparing AlexNet's 2012 training (6 days on two GTX 580s) to what would take under 5 minutes on a single modern GB 200 chip. (08:49) This thousand-fold improvement in compute accessibility has democratized AI research and enabled new algorithmic approaches like NERF (Neural Radiance Fields) that can train 3D models in hours on a single GPU, allowing academic researchers to contribute to cutting-edge 3D computer vision research without massive infrastructure investments.
Unlike language models that work with purely generated human signals, spatial intelligence must interface with the physical 3D world that follows physics laws and has inherent material properties. (25:46) This creates unique opportunities for applications spanning virtual world generation, augmented reality interfaces, and robotics. Li emphasizes that spatial intelligence serves as the essential connection between a robot's digital brain and its physical environment, making it fundamental for any agents operating in the real world.