Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this fascinating episode of Latent Space, hosts Alessio Fanelli and Shawn Wang sit down with Fei-Fei Li and Justin Johnson, the powerhouse duo behind World Labs and their groundbreaking spatial intelligence model, Marble. (00:37) The conversation explores their journey from Stanford research to building the world's first publicly available generative 3D world model, diving deep into the technical architecture, use cases, and philosophical implications of spatial intelligence as the next frontier beyond language models.
Co-founder and CEO of World Labs, Fei-Fei Li is also a professor of computer science at Stanford University and founding co-director of Stanford's Institute for Human Centered AI (HAI). She led the creation of ImageNet, the dataset that helped launch the deep learning revolution, and has been instrumental in advancing computer vision research for over a decade.
Co-founder of World Labs, Justin Johnson was formerly a professor at the University of Michigan and worked at Meta. As one of Fei-Fei's former PhD students at Stanford, he made significant contributions to early vision-language work including dense captioning and image captioning research that bridged computer vision and natural language processing.
Fei-Fei emphasizes that spatial intelligence should be viewed as complementary to linguistic intelligence rather than a replacement. (42:42) She draws from psychologist Howard Gardner's concept of multiple intelligences, explaining that human intelligence encompasses linguistic, spatial, logical, and emotional dimensions. The ability to reason, understand, move, and interact in space represents a fundamental form of intelligence that language struggles to capture efficiently. For example, the process of grasping a mug involves complex spatial reasoning about geometry, affordance points, and 3D positioning that would be nearly impossible to describe adequately through language alone. This insight suggests that the future of AI lies not in choosing between modalities but in building multimodal systems that leverage the strengths of each intelligence type.
A compelling mathematical insight emerges when considering the bandwidth constraints of language communication. (44:07) Speaking continuously at 150 words per minute for 24 hours generates only about 215,000 tokens per day, while our lived experience in a rich 3D/4D world contains vastly more information. This bandwidth limitation explains why language serves as a "lossy, low-bandwidth channel" for describing spatial reality. Historical examples like Newton's discovery of gravity or the deduction of DNA's structure required spatial reasoning that couldn't be reduced to pure linguistic description. (43:42) This suggests that as AI systems become more capable, they'll need to process and understand the world through richer, higher-bandwidth channels beyond just text sequences.
One of the most intellectually honest discussions centers on whether current models truly "understand" physics or simply fit patterns in data. (23:46) The hosts reference a Harvard paper showing that while an LLM could predict planetary orbits accurately, it failed to draw correct force vectors, revealing a gap between pattern matching and causal understanding. Fei-Fei acknowledges this limitation, stating that current deep learning remains fundamentally about "fitting patterns" rather than achieving genuine causal reasoning like humans do. (27:07) However, she suggests that distilling physics engines into neural networks and attaching physical properties to Gaussian splats could bridge this gap, potentially leading to models that exhibit more genuine understanding of physical laws rather than just plausible-looking outputs.
Both speakers express concern about the growing resource disparity between academic institutions and industry labs, though they frame it differently than typical "open vs. closed" debates. (06:17) Fei-Fei advocates for initiatives like the National AI Resource (NAIR) Bill to create public sector compute clouds and data repositories. (08:38) Justin argues that academia's role should shift toward "wacky ideas" and theoretical understanding rather than trying to compete on training the largest models. He worries that too many academics are treating their programs as "vocational training" for big tech rather than pursuing fundamental research. (10:00) The solution isn't about business models but ensuring academia has sufficient resources to explore blue-sky problems and interdisciplinary research that industry might not prioritize.
A fascinating technical insight emerges around the relationship between hardware constraints and neural architecture design. (10:57) Justin explains that current transformers succeeded because matrix multiplication aligns well with GPU architectures, but future distributed computing systems may require fundamentally different primitives. As systems scale from single GPUs to massive clusters, the atomic unit of computation shifts, potentially demanding new architectures optimized for distributed processing rather than monolithic models. (12:32) He also notes that even the latest hardware improvements (Hopper to Blackwell) show diminishing returns in performance per watt, suggesting we're approaching scaling limits that will necessitate architectural innovation rather than just bigger chips.