a16z Podcast•December 5, 2025

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Fei-Fei Li and Justin Johnson discuss their new 3D world generation model Marble, exploring spatial intelligence, the limitations of current AI models, and the potential for generative AI to create explorable, interactive 3D environments across various industries.

AI & Machine Learning

Podcast Summary

Fei-Fei Li and Justin Johnson of World Labs join the podcast to discuss their groundbreaking work on spatial intelligence and their newly launched Marble model—the first system that generates explorable 3D worlds from text or images. (01:34) The conversation explores why spatial intelligence represents a fundamental shift beyond language models, examining how humans naturally integrate visual and spatial reasoning in ways that current AI cannot replicate. (25:50) They delve into the technical architecture of their world models, the role of physics in 3D generation, and how their approach differs from traditional video generation by creating true spatial representations rather than frame-by-frame sequences.

Main theme: The evolution from language-based AI to spatial intelligence, with World Labs pioneering the first commercially available model that generates interactive 3D worlds while maintaining precise camera control and editing capabilities.

Speakers

Fei-Fei Li

Fei-Fei Li is a Stanford professor and co-director of the Stanford Institute for Human-Centered Artificial Intelligence, as well as co-founder of World Labs. She created ImageNet, the dataset that sparked the deep learning revolution and fundamentally changed computer vision. (01:15) She has been advocating for proper resourcing of academic AI research and worked with policymakers on the National AI Resource (NAIR) Bill during the first Trump administration.

Justin Johnson

Justin Johnson is Fei-Fei's former PhD student, ex-professor at the University of Michigan, former Meta researcher, and now co-founder of World Labs. (01:24) He joined Fei-Fei's lab in 2012, the same quarter that AlexNet was released, and has focused extensively on 3D vision, computer graphics, and generative modeling throughout his career. He was instrumental in early image captioning research and dense captioning work that combined computer vision with natural language processing.

Key Takeaways

Spatial Intelligence Complements Rather Than Replaces Language Intelligence

Fei-Fei Li emphasizes that spatial intelligence isn't competing with traditional language models but represents a complementary form of intelligence. (44:04) Drawing from psychologist Howard Gardner's theory of multiple intelligences, she explains that human intelligence encompasses linguistic, spatial, logical, and emotional dimensions. Spatial intelligence specifically enables reasoning, understanding, movement, and interaction in space—capabilities that are nearly impossible to reduce to pure language. The example of DNA structure discovery illustrates how spatial reasoning of molecules and chemical bonds in 3D space led to the double helix conjecture, a process that couldn't be achieved through language alone. For professionals, this means recognizing that different types of problems require different cognitive approaches, and future AI systems will need to integrate multiple forms of intelligence rather than relying solely on language processing.

True World Models Require Physics Integration, Not Just Visual Patterns

The conversation reveals a critical limitation in current generative models: they excel at pattern matching but lack causal understanding of physical laws. (23:56) When discussing a Harvard paper about orbital prediction, the hosts noted that while an LLM could generate visually accurate planetary orbits, it failed to understand the underlying force vectors driving those movements. Justin and Fei-Fei acknowledge this challenge in their own work—while Marble can generate beautiful architectural structures like arches, the model doesn't necessarily understand the engineering principles that make those structures stable. This represents a fundamental gap between pattern fitting and true causal reasoning. For professionals working with AI-generated content, this means understanding the difference between visually plausible outputs and physically accurate ones, especially in applications where structural integrity or real-world implementation matters.

Transformers Are Actually Set Models, Opening New Architectural Possibilities

Justin Johnson provides a crucial insight that transformers are fundamentally set models rather than sequence models, despite their common application to sequential data. (57:19) The only component that injects order into transformers is the positional embedding—otherwise, all operations (FFN, QKV projections, attention mechanisms) are either token-wise or permutation equivariant. This architectural insight suggests vast unexplored possibilities for applying transformers to non-sequential data structures. For world models, this means the same fundamental architecture that powers language models can be adapted to handle 3D spatial data, particle systems, or other non-sequential representations. Professionals should recognize that the success of transformers extends far beyond language, opening opportunities to apply these architectures to entirely new domains and data types.

Academic Research Must Focus on "Wacky Ideas" While Industry Scales

The discussion highlights a crucial division of labor between academia and industry in the current AI landscape. (10:14) Justin argues that academia shouldn't try to compete with industry on training the largest models, as the compute requirements have grown exponentially beyond what academic labs can afford. Instead, academic researchers should focus on experimental algorithms, new architectures, theoretical understanding, and interdisciplinary "blue sky" problems. The example of exploring new computational primitives beyond matrix multiplication for future distributed systems illustrates this approach. Academic researchers can explore ideas that might not pay off for years but could revolutionize the field. For professionals in research settings, this suggests being strategic about choosing problems that leverage academic strengths rather than trying to compete directly with well-funded industry labs.

Real-Time Interaction Requires Fundamental Representation Changes

Marble's ability to provide precise camera control and real-time editing capabilities stems from its use of Gaussian splats rather than traditional video generation approaches. (34:57) Unlike frame-by-frame video models, Marble generates atomic particles (Gaussian splats) that can be rendered in real time on mobile devices, enabling users to navigate and edit 3D scenes with precise control. This architectural choice enables capabilities like recording camera movements with exact positioning and making interactive edits to scenes. The broader principle here is that the choice of data representation fundamentally determines what interactions become possible. For professionals designing AI systems, this demonstrates the importance of choosing representations that enable the desired end-user experiences rather than simply optimizing for generation quality.

Statistics & Facts

The performance improvement from AlexNet to today represents about 1000 times more performance per GPU card, with current models training on thousands or tens of thousands of GPUs instead of just one, resulting in roughly a million-fold increase in computational power since the start of Justin's PhD. (04:35)
At average speaking rates of 150 words per minute, humans generate approximately 215,000 tokens per day if speaking continuously for 24 hours, demonstrating the massive bandwidth difference between language and spatial experience. (45:51)
Evolution spent 540 million years optimizing perception and spatial intelligence in animals, while the most generous estimation for language development in humans is only about half a million years, highlighting the fundamental importance of spatial intelligence in biological systems. (49:34)