Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
The OpenAI Sora 2 team—Bill Peebles (inventor of diffusion transformers), Thomas Dimson (engineering lead), and Rohan Sahai (product lead)—discuss how they've revolutionized video generation by compressing filmmaking from months to days. (00:00) Bill explains how space-time tokens enable object permanence and physics understanding in AI-generated video, representing a fundamental shift from previous generation models that often failed at complex physical interactions. (06:22) The team shares their intentional design philosophy against mindless scrolling, instead optimizing for creative inspiration through features like Cameos that put users directly into generated videos. (26:00) They envision a future where Sora becomes a world simulator capable of running scientific experiments, with digital clones of users interacting in alternate realities for both entertainment and knowledge work. (49:48)
Head of the Sora team at OpenAI and inventor of the diffusion transformer architecture that powers Sora and most modern video generation models. He came to OpenAI directly from Berkeley where he conducted research on video generation, starting work on Sora from his first day at the company.
Engineering lead on Sora with seven years at Instagram where he developed early machine learning systems and recommender algorithms when the company had just 40 people. After leaving Instagram, he founded a startup creating Minecraft in the browser, which OpenAI acquired for the team's product expertise.
Product lead for Sora who has been at OpenAI for two and a half years, initially working as an individual contributor on ChatGPT before transitioning to lead the Sora product team. He has a background in startups and large companies throughout Silicon Valley.
Bill Peebles explains that Sora's breakthrough comes from treating video as "space-time tokens"—small cuboids that combine spatial (x, y) and temporal dimensions. (04:30) Unlike traditional autoregressive models that generate sequentially, diffusion transformers process entire videos simultaneously, enabling full global context across all positions in space and time. This architecture solves critical issues like object permanence and temporal consistency that plagued earlier video generation systems. For professionals working with AI video tools, understanding this fundamental shift from sequential to simultaneous generation helps explain why newer models produce more coherent, physics-respecting content.
Thomas Dimson shares crucial insights from his Instagram experience about the dangers of optimizing purely for consumption. (25:30) At Instagram, they initially implemented algorithmic feeds to solve the problem of heavy content creators crowding out personal posts from friends. However, over time, ad pressure and consumption metrics led to mindless scrolling behavior. With Sora, they've intentionally designed against this pattern by optimizing the recommendation algorithm for creative inspiration rather than passive consumption. The result: nearly 100% of users create content on day one, and 70% continue creating when they return. (30:17) This demonstrates how platform design fundamentally shapes user behavior.
The team emphasizes OpenAI's philosophy of iterative deployment rather than "dropping bombshells" on society. (49:38) They position Sora 2 as the "GPT-3.5 moment" for video—capable enough to demonstrate mass potential while allowing society to adapt and establish norms. Bill envisions a future with digital clones running tasks in alternate realities, but recognizes the importance of gradually introducing these capabilities. (49:58) This approach allows for learning, adjustment, and responsible scaling while building public comfort with transformative technologies.
Sora 2 demonstrates a unique form of model failure that indicates genuine intelligence: when asked to show a basketball player making a shot, if the player misses, Sora respects physics and shows the ball rebounding rather than magically guiding it into the hoop. (07:07) This represents "agent failure" versus "model failure"—the AI is simulating intelligent agents within physical constraints rather than simply fulfilling user requests. (07:54) As Bill notes, this physics understanding emerges from scale, similar to how language models develop world models to predict tokens effectively. For practitioners, this suggests that advanced AI systems aren't just pattern matching but developing internal representations of how the world actually functions.
The breakthrough Cameo feature—which allows users to place themselves directly into generated videos—emerged from recognizing that pure AI content lacks human connection. (20:45) Thomas initially doubted the feature would work technically, but when they tested it internally, the entire team's feed became nothing but Cameos of each other. (21:28) This humanized AI-generated content and created genuine social dynamics where users could tag friends, create response videos, and build on each other's creations. The lesson: even in an AI-powered world, the human element remains essential for engagement and meaning.