Latent Space: The AI Engineer Podcast•November 25, 2025

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Fei-Fei Li and Justin Johnson discuss their new startup World Labs and Marble, a groundbreaking generative "world model" that creates editable 3D environments from text and images, exploring the potential of spatial intelligence as the next frontier beyond language models.

AI & Machine Learning

Developer Culture

UX/UI Design

Data Science & Analytics

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

This episode features Fei-Fei Li and Justin Johnson, co-founders of World Labs, discussing their groundbreaking work on Marble - a generative world model that creates editable 3D environments from text and images. (01:00) The conversation explores how their journey from ImageNet research led to developing spatial intelligence as the next frontier beyond language models. Li and Johnson share insights on how world models could fundamentally change how machines understand and interact with 3D spaces, from creative applications in gaming and film to robotics simulation and architectural design.

Main themes: Evolution from computer vision to spatial intelligence, the technical architecture behind Marble's Gaussian splat-based world generation, and the balance between building cutting-edge AI research and practical commercial applications.

Speakers

Fei-Fei Li

Fei-Fei Li is co-founder of World Labs and professor of computer science at Stanford University, where she serves as co-director of Stanford's Institute for Human-Centered AI. She is renowned for creating ImageNet, the dataset that launched the deep learning revolution, and has been a leading advocate for responsible AI development and academic research funding through initiatives like the National AI Resource (NAIR) Bill.

Justin Johnson

Justin Johnson is co-founder of World Labs and former professor at University of Michigan, where he worked on computer vision and 3D modeling after completing his PhD under Fei-Fei Li at Stanford. He was instrumental in pioneering early vision-language work, including dense captioning research, and has extensive experience in computer graphics, generative modeling, and three-dimensional vision systems.

Key Takeaways

Embrace Spatial Intelligence as a Complementary Form of AI

Spatial intelligence represents a fundamentally different capability from language intelligence, focusing on reasoning, understanding, moving, and interacting in 3D space. (26:38) Li emphasizes that while language models excel at abstract reasoning, they operate through a "lossy, low-bandwidth channel" that cannot fully capture the rich 3D/4D world we inhabit. This distinction is crucial for professionals working in AI - rather than viewing spatial intelligence as competing with language models, it should be seen as complementary. For example, while you can narrate the process of picking up a mug, that narrated language cannot actually enable someone to perform the physical action. Professionals should consider how spatial reasoning could enhance their work in fields ranging from robotics to architectural design.

Prioritize Real-Time Interactivity and Precise Control in 3D Systems

Marble's architecture using Gaussian splats enables real-time rendering on mobile devices and VR headsets while providing precise camera control. (33:21) This represents a fundamental shift from frame-by-frame video generation to persistent 3D world creation. Johnson explains that each Gaussian splat is a semi-transparent particle with position and orientation in 3D space, allowing scenes to be built from millions of these particles. The key insight is that by generating persistent 3D representations rather than sequential frames, users gain unprecedented control over camera placement and scene editing. This approach enables applications like architectural visualization where you can reconstruct a kitchen from photos and interactively modify elements like countertops or floors.

Balance Academic Research with Commercial Viability Through Horizontal Technology

World Labs deliberately designed Marble to serve both as a stepping stone toward spatial intelligence and as a commercially viable product today. (27:57) Johnson emphasizes they wanted to avoid building "just a science project" by creating something people could find useful immediately in gaming, VFX, and film industries. This balance is achieved through what they call "horizontal technology" - building foundational capabilities that naturally extend to multiple use cases. For professionals in AI startups, this demonstrates the importance of identifying core technologies that can address immediate market needs while building toward longer-term visions. The kitchen remodeling example illustrates how general 3D world generation capabilities can serve specific commercial applications without requiring custom development.

Leverage Physics Engines and Synthetic Data for Embodied AI Training

Traditional physics engines can be distilled into neural networks to create more capable world models for robotics training. (29:09) Li explains that robotics suffers from data starvation - high-fidelity real-world data is critical but scarce, while internet video lacks the controllability needed for training embodied agents. Synthetic simulation data provides a crucial middle ground, but the biggest pain point has been curating assets and building complex scenarios. Marble addresses this by generating synthetic simulated worlds that can provide the diverse states and interactions needed for embodied agent training. Professionals working in robotics should consider how generative world models can supplement traditional data collection methods and provide more controlled training environments.

Rethink Neural Network Architecture for Distributed Computing at Scale

As AI systems scale from single GPUs to massive distributed clusters, new architectural paradigms may be needed beyond current matrix multiplication-based approaches. (12:00) Johnson suggests that while transformers work well on GPUs through matrix multiplication primitives, future hardware scaling will require thinking about neural networks as distributed systems rather than monolithic entities. He proposes exploring architectures that better fit large-scale distributed systems, potentially discovering new primitives for the next generation of hardware. For AI practitioners, this highlights the importance of considering how current architectural assumptions may not hold as compute infrastructure evolves, and the value of exploring alternative approaches that could be better suited for future distributed computing paradigms.

Statistics & Facts

From AlexNet to today, there has been approximately a 1,000x increase in performance per GPU card, with modern training using hundreds or thousands of GPUs compared to single GPUs in the AlexNet era, representing roughly a million-fold increase in total available compute. (04:37)
Speaking continuously for 24 hours at 150 words per minute generates approximately 215,000 tokens per day, demonstrating the extreme bandwidth constraints of language compared to the rich spatial information humans process. (45:13)
Nature spent 540 million years optimizing perception and spatial intelligence, while language development took at most half a million years according to the most generous estimates. (48:48)