Command Palette

Search for a command to run...

PodMine
Training Data
Training Data•October 28, 2025

Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Michael Kagan, Nvidia's CTO, discusses how Mellanox transformed Nvidia's AI infrastructure by solving network scaling challenges, enabling massive GPU clusters and driving exponential computing performance beyond Moore's Law.
AI & Machine Learning
Tech Policy & Ethics
Developer Culture
Hardware & Gadgets
B2B SaaS Business
Jensen Huang
Michael Kagan
Pat Grady

Summary Sections

  • Podcast Summary
  • Speakers
  • Key Takeaways
  • Statistics & Facts
  • Compelling StoriesPremium
  • Thought-Provoking QuotesPremium
  • Strategies & FrameworksPremium
  • Similar StrategiesPlus
  • Additional ContextPremium
  • Key Takeaways TablePlus
  • Critical AnalysisPlus
  • Books & Articles MentionedPlus
  • Products, Tools & Software MentionedPlus
0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

This episode features Michael Kagan, CTO of NVIDIA and co-founder of Mellanox, discussing the critical role of networking in AI infrastructure and how the $7 billion Mellanox acquisition transformed NVIDIA from a chip company into the architect of AI infrastructure. (02:29) Kagan explains how AI workloads require exponential performance growth - from Moore's Law's 2x every two years to AI's demand for 10x or 16x performance annually - necessitating massive scale-up and scale-out solutions. (02:38) The conversation explores the technical challenges of building 100,000+ GPU data centers, the shift from training to inference workloads, and NVIDIA's culture of expanding markets rather than competing for existing ones.

  • Main themes: Network performance as the determining factor in AI system efficiency, the evolution from single chips to massive interconnected data centers, and AI's potential as humanity's "spaceship of the mind"

Speakers

Michael Kagan

Michael Kagan is the CTO of NVIDIA and co-founder of Mellanox, which NVIDIA acquired for $7 billion in March 2019. He previously served as chief architect at Intel for 16 years before co-founding Mellanox 25 years ago. Kagan has been a major driver of NVIDIA's dominance as the AI compute platform and has been pushing the compute frontier forward for more than four decades.

Sonya Huang

Sonya Huang is a host of the podcast and partner at Sequoia Capital, focusing on AI and enterprise technology investments.

Pat Grady

Pat Grady is a co-host of the podcast and partner at Sequoia Capital, specializing in enterprise software and AI infrastructure investments.

Key Takeaways

Network Performance Determines AI System Efficiency

The key insight from Kagan is that network performance, not just compute power, ultimately determines how efficiently AI systems can scale. (09:52) When splitting AI workloads across thousands of GPUs, communication latency becomes the bottleneck. If network communication takes too long, all GPUs must wait, drastically reducing system efficiency. Kagan explains that networks need consistent, low-latency performance across all connections - not just hero numbers for peak bandwidth. This insight reveals why NVIDIA's acquisition of Mellanox was so critical: without high-performance networking, even the most powerful GPUs become underutilized in large-scale deployments.

Data Centers Are Now Single Computing Units

Modern AI requires thinking of entire data centers as single computing units rather than collections of individual machines. (10:58) Kagan describes how NVIDIA architects its hardware and software at the data center level, designing for 100,000 GPU systems that work together as one massive computer. This shift in perspective means that reliability, cooling, power, and networking must all be optimized for the entire system. The practical implication is that component reliability becomes critical - with millions of components, something is always broken, so systems must be designed to continue operating efficiently despite failures.

Inference Workloads Now Exceed Training Demands

Contrary to popular belief, inference workloads now require as much or more computing power than training. (21:02) Kagan explains that modern generative AI requires multiple inference passes for each response, and reasoning capabilities add even more computational layers. Additionally, while you train a model once, you perform inference billions of times with users constantly interacting with the system. This shift means that organizations need to plan for inference-heavy workloads and optimize their infrastructure accordingly, potentially using specialized GPU configurations for different types of inference tasks.

NVIDIA's Win-Win Culture Drives Market Expansion

NVIDIA's success comes from expanding markets rather than competing for existing market share. (30:51) Kagan emphasizes that NVIDIA focuses on "baking a bigger pie for everybody" rather than taking a larger piece of the existing pie. This philosophy drove the Intel partnership, where instead of viewing Intel as competition, NVIDIA sees an opportunity to fuse accelerated computing with general-purpose computing. This approach has enabled NVIDIA to grow markets exponentially rather than engage in zero-sum competition, contributing to their 45x market cap growth since the Mellanox acquisition.

AI Will Become Humanity's "Spaceship of the Mind"

Kagan envisions AI as humanity's "spaceship of the mind" - far more powerful than Steve Jobs' description of computers as "bicycles of the mind." (39:32) He believes AI will enable people to accomplish 10 times more work, but will also inspire them to want to do 100 times more. This exponential amplification of human capability could revolutionize fields like experimental history through simulation, help discover new laws of physics, and fundamentally change how we approach problem-solving and innovation across all domains of human endeavor.

Statistics & Facts

  1. AI model performance requirements have shifted from Moore's Law's 2x growth every two years to demanding 10x or 16x performance growth annually. (03:03) This exponential acceleration began around 2010-2011 when GPUs transitioned from graphics processing to general AI processing units.
  2. NVIDIA's current GPU systems require a forklift to move and represent rack-sized machines rather than individual chips. A single "GPU" now consists of up to 72 individual processing units connected through NVLink technology. (05:14)
  3. The latest large-scale data centers consume 100-150 megawatts, with plans for gigawatt and even 10-gigawatt data centers in development. (28:39) These massive power requirements are driving the shift to liquid cooling to enable much denser compute configurations.

Compelling Stories

Available with a Premium subscription

Thought-Provoking Quotes

Available with a Premium subscription

Strategies & Frameworks

Available with a Premium subscription

Similar Strategies

Available with a Plus subscription

Additional Context

Available with a Premium subscription

Key Takeaways Table

Available with a Plus subscription

Critical Analysis

Available with a Plus subscription

Books & Articles Mentioned

Available with a Plus subscription

Products, Tools & Software Mentioned

Available with a Plus subscription

More episodes like this

In Good Company with Nicolai Tangen
January 14, 2026

Figma CEO: From Idea to IPO, Design at Scale and AI’s Impact on Creativity

In Good Company with Nicolai Tangen
We Study Billionaires - The Investor’s Podcast Network
January 14, 2026

BTC257: Bitcoin Mastermind Q1 2026 w/ Jeff Ross, Joe Carlasare, and American HODL (Bitcoin Podcast)

We Study Billionaires - The Investor’s Podcast Network
Uncensored CMO
January 14, 2026

Rory Sutherland on why luck beats logic in marketing

Uncensored CMO
This Week in Startups
January 13, 2026

How to Make Billions from Exposing Fraud | E2234

This Week in Startups
Swipe to navigate