Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
Fei-Fei Li and Justin Johnson of World Labs join the podcast to discuss their groundbreaking work on spatial intelligence and their newly launched Marble model—the first system that generates explorable 3D worlds from text or images. (01:34) The conversation explores why spatial intelligence represents a fundamental shift beyond language models, examining how humans naturally integrate visual and spatial reasoning in ways that current AI cannot replicate. (25:50) They delve into the technical architecture of their world models, the role of physics in 3D generation, and how their approach differs from traditional video generation by creating true spatial representations rather than frame-by-frame sequences.
Fei-Fei Li is a Stanford professor and co-director of the Stanford Institute for Human-Centered Artificial Intelligence, as well as co-founder of World Labs. She created ImageNet, the dataset that sparked the deep learning revolution and fundamentally changed computer vision. (01:15) She has been advocating for proper resourcing of academic AI research and worked with policymakers on the National AI Resource (NAIR) Bill during the first Trump administration.
Justin Johnson is Fei-Fei's former PhD student, ex-professor at the University of Michigan, former Meta researcher, and now co-founder of World Labs. (01:24) He joined Fei-Fei's lab in 2012, the same quarter that AlexNet was released, and has focused extensively on 3D vision, computer graphics, and generative modeling throughout his career. He was instrumental in early image captioning research and dense captioning work that combined computer vision with natural language processing.
Fei-Fei Li emphasizes that spatial intelligence isn't competing with traditional language models but represents a complementary form of intelligence. (44:04) Drawing from psychologist Howard Gardner's theory of multiple intelligences, she explains that human intelligence encompasses linguistic, spatial, logical, and emotional dimensions. Spatial intelligence specifically enables reasoning, understanding, movement, and interaction in space—capabilities that are nearly impossible to reduce to pure language. The example of DNA structure discovery illustrates how spatial reasoning of molecules and chemical bonds in 3D space led to the double helix conjecture, a process that couldn't be achieved through language alone. For professionals, this means recognizing that different types of problems require different cognitive approaches, and future AI systems will need to integrate multiple forms of intelligence rather than relying solely on language processing.
The conversation reveals a critical limitation in current generative models: they excel at pattern matching but lack causal understanding of physical laws. (23:56) When discussing a Harvard paper about orbital prediction, the hosts noted that while an LLM could generate visually accurate planetary orbits, it failed to understand the underlying force vectors driving those movements. Justin and Fei-Fei acknowledge this challenge in their own work—while Marble can generate beautiful architectural structures like arches, the model doesn't necessarily understand the engineering principles that make those structures stable. This represents a fundamental gap between pattern fitting and true causal reasoning. For professionals working with AI-generated content, this means understanding the difference between visually plausible outputs and physically accurate ones, especially in applications where structural integrity or real-world implementation matters.
Justin Johnson provides a crucial insight that transformers are fundamentally set models rather than sequence models, despite their common application to sequential data. (57:19) The only component that injects order into transformers is the positional embedding—otherwise, all operations (FFN, QKV projections, attention mechanisms) are either token-wise or permutation equivariant. This architectural insight suggests vast unexplored possibilities for applying transformers to non-sequential data structures. For world models, this means the same fundamental architecture that powers language models can be adapted to handle 3D spatial data, particle systems, or other non-sequential representations. Professionals should recognize that the success of transformers extends far beyond language, opening opportunities to apply these architectures to entirely new domains and data types.
The discussion highlights a crucial division of labor between academia and industry in the current AI landscape. (10:14) Justin argues that academia shouldn't try to compete with industry on training the largest models, as the compute requirements have grown exponentially beyond what academic labs can afford. Instead, academic researchers should focus on experimental algorithms, new architectures, theoretical understanding, and interdisciplinary "blue sky" problems. The example of exploring new computational primitives beyond matrix multiplication for future distributed systems illustrates this approach. Academic researchers can explore ideas that might not pay off for years but could revolutionize the field. For professionals in research settings, this suggests being strategic about choosing problems that leverage academic strengths rather than trying to compete directly with well-funded industry labs.
Marble's ability to provide precise camera control and real-time editing capabilities stems from its use of Gaussian splats rather than traditional video generation approaches. (34:57) Unlike frame-by-frame video models, Marble generates atomic particles (Gaussian splats) that can be rendered in real time on mobile devices, enabling users to navigate and edit 3D scenes with precise control. This architectural choice enables capabilities like recording camera movements with exact positioning and making interactive edits to scenes. The broader principle here is that the choice of data representation fundamentally determines what interactions become possible. For professionals designing AI systems, this demonstrates the importance of choosing representations that enable the desired end-user experiences rather than simply optimizing for generation quality.