Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, Dwarkesh Patel interviews Sergei Levine, co-founder of Physical Intelligence and UC Berkeley professor, about the current state and future of robotic foundation models. Physical Intelligence aims to build general-purpose AI models that can control any robot to perform any task, representing what Levine sees as a fundamental component of the broader AI problem. (00:42) The company has made significant progress in getting robots to perform dexterous tasks like folding laundry, cleaning kitchens, and even handling complex situations like turning clothes inside-out before folding them properly.
• Core Theme: The discussion centers on the technical challenges, timelines, and societal implications of achieving general robotic intelligence, with particular focus on data collection strategies, the potential for human-robot collaboration, and the economic transformation that widespread automation could bring.
Sergei Levine is co-founder of Physical Intelligence, a robotics foundation model company, and a professor at UC Berkeley. He is recognized as one of the world's leading researchers in robotics, reinforcement learning, and AI. Before founding Physical Intelligence, he worked at Google where he was involved in groundbreaking robotics research that forms the foundation for much of the current work in robotic foundation models.
Dwarkesh Patel is the host of this podcast, known for conducting in-depth technical interviews with leading AI researchers and technology experts. He demonstrates deep technical knowledge and asks probing questions about timelines, capabilities, and implications of emerging AI technologies.
Rather than waiting for perfect robotic capabilities, the key is to get robots deployed in real-world scenarios as soon as they can perform useful tasks, even with limited scope. (04:00) Levine emphasizes that once robots are operating in the real world, they can collect experience and leverage that experience to improve continuously. The approach mirrors what we've seen with AI assistants - starting with basic competence and gradually expanding capabilities. This strategy allows for rapid iteration and improvement through real-world feedback loops, making it more valuable than trying to perfect systems in laboratory conditions before deployment.
The most effective approach to robotic deployment involves human-robot collaboration rather than fully autonomous operation. (14:54) Levine describes how robots can learn from various types of human feedback - from direct teleoperation to language instructions to natural workplace guidance. This collaborative approach provides multiple sources of supervision and creates natural learning opportunities when mistakes occur. Physical tasks offer better error correction opportunities than pure knowledge work because mistakes in the physical world are observable and recoverable, allowing robots to learn from corrections in real-time.
The breakthrough enabling current robotic progress comes from incorporating prior knowledge from large language models and vision-language models into robotic systems. (34:50) Physical Intelligence's approach involves adapting vision-language models with an "action expert" - essentially grafting a motor cortex onto existing AI systems. This allows robots to benefit from the vast world knowledge embedded in these foundation models, including common sense reasoning about physical interactions, object recognition, and task understanding that would be impossible to learn from robotic data alone.
Unlike general video prediction models that struggle with the overwhelming complexity of visual data, robots benefit from having a specific purpose that focuses their perception. (40:28) When a robot is trying to accomplish a specific task, its perception becomes purpose-driven, allowing it to filter out irrelevant information and focus on what matters for the goal at hand. This focusing mechanism, similar to human attention, makes robotic perception more robust and efficient than trying to predict everything that might happen in a scene without specific objectives.
Even current robotic models demonstrate emerging capabilities through compositional generalization - combining learned behaviors in novel ways to handle unexpected situations. (45:54) Levine shares examples of robots spontaneously handling scenarios they weren't explicitly trained for, such as picking up objects that fall during folding tasks or properly orienting inside-out clothing before folding. This suggests that with sufficient diversity of training behaviors, robotic foundation models can compose these behaviors intelligently as situations demand, similar to how language models can combine concepts in novel ways.