Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this comprehensive episode, the Meta AI team reveals SAM 3, their groundbreaking unified model for concept-prompted segmentation, detection, and tracking across both images and videos. (01:22) Unlike previous SAM versions, SAM 3 enables users to find all instances of objects using natural language prompts like "yellow school bus" or "watering can," eliminating the need to manually click on each instance. The model achieves remarkable real-time performance - 30ms per image with 100 detected objects on H200 GPUs, and scales to real-time video processing through parallel inference across multiple GPUs. (10:00) The episode covers the revolutionary data engine that reduced annotation time from 2 minutes to 25 seconds per image using AI verifiers, the new SACO benchmark with 200,000+ unique concepts, and how SAM 3 integrates with multimodal LLMs to solve complex visual reasoning tasks.
Lead researcher at Meta's FAIR team and head of the Segment Anything Model (SAM) project for nearly four years. She has been at Meta for eight and a half years, witnessing the evolution of computer vision firsthand, and has overseen the development of SAM 1, SAM 2, and now SAM 3 across this transformative period in AI.
Computer vision researcher at Meta's SAM team with nine years of experience in the field since 2017. He previously worked at Microsoft Research for five years and Meta Reality Labs on egocentric foundation models before joining the SAM team in 2023, contributing significantly to achieving human-level performance in detection, segmentation, and tracking.
Co-founder and CEO of Roboflow, a platform focused on making computer vision accessible to millions of developers and Fortune 100 companies. Under his leadership, Roboflow has become one of the largest hosted instances of SAM models, facilitating over 106 million smart polygon creations and saving an estimated 130+ years of human labeling time across diverse applications from cancer research to autonomous vehicles.
SAM 3 introduces "concept prompts" - short text phrases like "purple umbrella" or "watering can" that can find all instances of an object category without manual clicking on each instance. (06:00) This represents a fundamental shift from the interactive segmentation approach of previous SAM versions. The model focuses on atomic visual concepts rather than open-ended text input, ensuring reliable performance across diverse scenarios. Users can still refine prompts using visual exemplars like clicks or boxes when the model misses instances, creating a hybrid approach that combines the efficiency of text prompts with the precision of manual correction when needed.
SAM 3 achieves unprecedented speed with 30ms inference time per image handling 100 detected objects on H200 GPUs. (09:20) For video processing, the system scales linearly with parallel inference: 10 objects on 2×H200, 28 on 4×H200, and 64 on 8×H200 GPUs. The key architectural innovation separates detection and tracking components - the detector remains identity-agnostic to find all instances of a concept, while the tracker maintains unique identities for each object across video frames. This decoupling resolves the fundamental conflict between needing consistent representations for detection versus unique representations for tracking.
Meta developed a sophisticated data engine that reduced annotation time from 2 minutes per image to just 25 seconds through strategic AI integration. (39:48) The process uses model-in-the-loop proposals to generate candidate masks, then employs AI verifiers fine-tuned on Llama 3.2 to check mask quality and exhaustiveness. Human annotators only intervene when AI verification fails, focusing their expertise on the most challenging cases. This represents a breakthrough in automated data annotation, with over 70% of training annotations being negative examples to teach the model what not to detect.
The new Segment Anything with Concepts (SACO) benchmark contains over 200,000 unique concepts compared to the previous standard of 1,200 concepts in existing benchmarks. (12:12) This massive scale better reflects the diversity of natural language that humans use in real-world scenarios. The benchmark is designed to capture human-level exhaustivity in finding every instance of specified objects, with AI annotators ensuring comprehensive coverage. This scale enables SAM 3 to handle the breadth of visual concepts that users actually want to segment in practical applications.
SAM 3 Agents demonstrate how the model serves as a visual tool for multimodal LLMs like Gemini and Llama, enabling complex queries like "find the bigger character" or "what distinguishes male from female in this image." (30:18) This integration shows significant performance improvements over individual models, with SAM 3 providing precise visual grounding while LLMs contribute advanced language understanding and reasoning. The synergy works both ways - LLMs help correct SAM errors while SAM provides the visual precision that LLMs often lack, particularly for tasks requiring exact counting or spatial understanding.