Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
This episode features Fei-Fei Li and Justin Johnson, co-founders of World Labs, discussing their groundbreaking work on Marble - a generative world model that creates editable 3D environments from text and images. (01:00) The conversation explores how their journey from ImageNet research led to developing spatial intelligence as the next frontier beyond language models. Li and Johnson share insights on how world models could fundamentally change how machines understand and interact with 3D spaces, from creative applications in gaming and film to robotics simulation and architectural design.
Fei-Fei Li is co-founder of World Labs and professor of computer science at Stanford University, where she serves as co-director of Stanford's Institute for Human-Centered AI. She is renowned for creating ImageNet, the dataset that launched the deep learning revolution, and has been a leading advocate for responsible AI development and academic research funding through initiatives like the National AI Resource (NAIR) Bill.
Justin Johnson is co-founder of World Labs and former professor at University of Michigan, where he worked on computer vision and 3D modeling after completing his PhD under Fei-Fei Li at Stanford. He was instrumental in pioneering early vision-language work, including dense captioning research, and has extensive experience in computer graphics, generative modeling, and three-dimensional vision systems.
Spatial intelligence represents a fundamentally different capability from language intelligence, focusing on reasoning, understanding, moving, and interacting in 3D space. (26:38) Li emphasizes that while language models excel at abstract reasoning, they operate through a "lossy, low-bandwidth channel" that cannot fully capture the rich 3D/4D world we inhabit. This distinction is crucial for professionals working in AI - rather than viewing spatial intelligence as competing with language models, it should be seen as complementary. For example, while you can narrate the process of picking up a mug, that narrated language cannot actually enable someone to perform the physical action. Professionals should consider how spatial reasoning could enhance their work in fields ranging from robotics to architectural design.
Marble's architecture using Gaussian splats enables real-time rendering on mobile devices and VR headsets while providing precise camera control. (33:21) This represents a fundamental shift from frame-by-frame video generation to persistent 3D world creation. Johnson explains that each Gaussian splat is a semi-transparent particle with position and orientation in 3D space, allowing scenes to be built from millions of these particles. The key insight is that by generating persistent 3D representations rather than sequential frames, users gain unprecedented control over camera placement and scene editing. This approach enables applications like architectural visualization where you can reconstruct a kitchen from photos and interactively modify elements like countertops or floors.
World Labs deliberately designed Marble to serve both as a stepping stone toward spatial intelligence and as a commercially viable product today. (27:57) Johnson emphasizes they wanted to avoid building "just a science project" by creating something people could find useful immediately in gaming, VFX, and film industries. This balance is achieved through what they call "horizontal technology" - building foundational capabilities that naturally extend to multiple use cases. For professionals in AI startups, this demonstrates the importance of identifying core technologies that can address immediate market needs while building toward longer-term visions. The kitchen remodeling example illustrates how general 3D world generation capabilities can serve specific commercial applications without requiring custom development.
Traditional physics engines can be distilled into neural networks to create more capable world models for robotics training. (29:09) Li explains that robotics suffers from data starvation - high-fidelity real-world data is critical but scarce, while internet video lacks the controllability needed for training embodied agents. Synthetic simulation data provides a crucial middle ground, but the biggest pain point has been curating assets and building complex scenarios. Marble addresses this by generating synthetic simulated worlds that can provide the diverse states and interactions needed for embodied agent training. Professionals working in robotics should consider how generative world models can supplement traditional data collection methods and provide more controlled training environments.
As AI systems scale from single GPUs to massive distributed clusters, new architectural paradigms may be needed beyond current matrix multiplication-based approaches. (12:00) Johnson suggests that while transformers work well on GPUs through matrix multiplication primitives, future hardware scaling will require thinking about neural networks as distributed systems rather than monolithic entities. He proposes exploring architectures that better fit large-scale distributed systems, potentially discovering new primitives for the next generation of hardware. For AI practitioners, this highlights the importance of considering how current architectural assumptions may not hold as compute infrastructure evolves, and the value of exploring alternative approaches that could be better suited for future distributed computing paradigms.