Latent Space: The AI Engineer Podcast•December 18, 2025

SAM 3: The Eyes for AI — Nikhila & Pengchuan (Meta Superintelligence), ft. Joseph Nelson (Roboflow)

Meta researchers unveil SAM 3, a groundbreaking computer vision model that can detect, segment, and track objects across images and videos using natural language prompts, significantly advancing visual understanding with a unified architecture capable of handling multiple tasks with unprecedented accuracy and speed.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this comprehensive episode, the Meta AI team reveals SAM 3, their groundbreaking unified model for concept-prompted segmentation, detection, and tracking across both images and videos. (01:22) Unlike previous SAM versions, SAM 3 enables users to find all instances of objects using natural language prompts like "yellow school bus" or "watering can," eliminating the need to manually click on each instance. The model achieves remarkable real-time performance - 30ms per image with 100 detected objects on H200 GPUs, and scales to real-time video processing through parallel inference across multiple GPUs. (10:00) The episode covers the revolutionary data engine that reduced annotation time from 2 minutes to 25 seconds per image using AI verifiers, the new SACO benchmark with 200,000+ unique concepts, and how SAM 3 integrates with multimodal LLMs to solve complex visual reasoning tasks.

Core Themes: The episode focuses on concept-based segmentation, automated data annotation pipelines, real-time AI performance, and the convergence of computer vision with large language models for enhanced visual understanding.

Speakers

Nikhila Ravi

Lead researcher at Meta's FAIR team and head of the Segment Anything Model (SAM) project for nearly four years. She has been at Meta for eight and a half years, witnessing the evolution of computer vision firsthand, and has overseen the development of SAM 1, SAM 2, and now SAM 3 across this transformative period in AI.

Pengchuan Zhang

Computer vision researcher at Meta's SAM team with nine years of experience in the field since 2017. He previously worked at Microsoft Research for five years and Meta Reality Labs on egocentric foundation models before joining the SAM team in 2023, contributing significantly to achieving human-level performance in detection, segmentation, and tracking.

Joseph Nelson

Co-founder and CEO of Roboflow, a platform focused on making computer vision accessible to millions of developers and Fortune 100 companies. Under his leadership, Roboflow has become one of the largest hosted instances of SAM models, facilitating over 106 million smart polygon creations and saving an estimated 130+ years of human labeling time across diverse applications from cancer research to autonomous vehicles.

Key Takeaways

Concept Prompts Revolutionize Visual AI Interaction

SAM 3 introduces "concept prompts" - short text phrases like "purple umbrella" or "watering can" that can find all instances of an object category without manual clicking on each instance. (06:00) This represents a fundamental shift from the interactive segmentation approach of previous SAM versions. The model focuses on atomic visual concepts rather than open-ended text input, ensuring reliable performance across diverse scenarios. Users can still refine prompts using visual exemplars like clicks or boxes when the model misses instances, creating a hybrid approach that combines the efficiency of text prompts with the precision of manual correction when needed.

Real-Time Performance Through Architectural Innovation

SAM 3 achieves unprecedented speed with 30ms inference time per image handling 100 detected objects on H200 GPUs. (09:20) For video processing, the system scales linearly with parallel inference: 10 objects on 2×H200, 28 on 4×H200, and 64 on 8×H200 GPUs. The key architectural innovation separates detection and tracking components - the detector remains identity-agnostic to find all instances of a concept, while the tracker maintains unique identities for each object across video frames. This decoupling resolves the fundamental conflict between needing consistent representations for detection versus unique representations for tracking.

AI-Powered Data Engine Achieves 5x Annotation Speed Improvement

Meta developed a sophisticated data engine that reduced annotation time from 2 minutes per image to just 25 seconds through strategic AI integration. (39:48) The process uses model-in-the-loop proposals to generate candidate masks, then employs AI verifiers fine-tuned on Llama 3.2 to check mask quality and exhaustiveness. Human annotators only intervene when AI verification fails, focusing their expertise on the most challenging cases. This represents a breakthrough in automated data annotation, with over 70% of training annotations being negative examples to teach the model what not to detect.

SACO Benchmark Sets New Standard with 200,000+ Concepts

The new Segment Anything with Concepts (SACO) benchmark contains over 200,000 unique concepts compared to the previous standard of 1,200 concepts in existing benchmarks. (12:12) This massive scale better reflects the diversity of natural language that humans use in real-world scenarios. The benchmark is designed to capture human-level exhaustivity in finding every instance of specified objects, with AI annotators ensuring comprehensive coverage. This scale enables SAM 3 to handle the breadth of visual concepts that users actually want to segment in practical applications.

Integration with LLMs Unlocks Complex Visual Reasoning

SAM 3 Agents demonstrate how the model serves as a visual tool for multimodal LLMs like Gemini and Llama, enabling complex queries like "find the bigger character" or "what distinguishes male from female in this image." (30:18) This integration shows significant performance improvements over individual models, with SAM 3 providing precise visual grounding while LLMs contribute advanced language understanding and reasoning. The synergy works both ways - LLMs help correct SAM errors while SAM provides the visual precision that LLMs often lack, particularly for tasks requiring exact counting or spatial understanding.

Statistics & Facts

SAM 3 processes images in 30 milliseconds with 100 detected objects on H200 GPUs, and scales to handle 64 objects in real-time video on 8×H200 setups through parallel inference. (09:20)
The SACO benchmark contains over 200,000 unique concepts compared to previous benchmarks with only 1,200 concepts, representing a 167x increase in concept diversity. (12:12)
Roboflow has facilitated 106 million smart polygon creations using SAM models, saving an estimated 130+ years of human labeling time across applications from cancer research to autonomous vehicles. (14:34)