Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
This episode features Michael Kagan, CTO of NVIDIA and co-founder of Mellanox, discussing the critical role of networking in AI infrastructure and how the $7 billion Mellanox acquisition transformed NVIDIA from a chip company into the architect of AI infrastructure. (02:29) Kagan explains how AI workloads require exponential performance growth - from Moore's Law's 2x every two years to AI's demand for 10x or 16x performance annually - necessitating massive scale-up and scale-out solutions. (02:38) The conversation explores the technical challenges of building 100,000+ GPU data centers, the shift from training to inference workloads, and NVIDIA's culture of expanding markets rather than competing for existing ones.
Michael Kagan is the CTO of NVIDIA and co-founder of Mellanox, which NVIDIA acquired for $7 billion in March 2019. He previously served as chief architect at Intel for 16 years before co-founding Mellanox 25 years ago. Kagan has been a major driver of NVIDIA's dominance as the AI compute platform and has been pushing the compute frontier forward for more than four decades.
Sonya Huang is a host of the podcast and partner at Sequoia Capital, focusing on AI and enterprise technology investments.
Pat Grady is a co-host of the podcast and partner at Sequoia Capital, specializing in enterprise software and AI infrastructure investments.
The key insight from Kagan is that network performance, not just compute power, ultimately determines how efficiently AI systems can scale. (09:52) When splitting AI workloads across thousands of GPUs, communication latency becomes the bottleneck. If network communication takes too long, all GPUs must wait, drastically reducing system efficiency. Kagan explains that networks need consistent, low-latency performance across all connections - not just hero numbers for peak bandwidth. This insight reveals why NVIDIA's acquisition of Mellanox was so critical: without high-performance networking, even the most powerful GPUs become underutilized in large-scale deployments.
Modern AI requires thinking of entire data centers as single computing units rather than collections of individual machines. (10:58) Kagan describes how NVIDIA architects its hardware and software at the data center level, designing for 100,000 GPU systems that work together as one massive computer. This shift in perspective means that reliability, cooling, power, and networking must all be optimized for the entire system. The practical implication is that component reliability becomes critical - with millions of components, something is always broken, so systems must be designed to continue operating efficiently despite failures.
Contrary to popular belief, inference workloads now require as much or more computing power than training. (21:02) Kagan explains that modern generative AI requires multiple inference passes for each response, and reasoning capabilities add even more computational layers. Additionally, while you train a model once, you perform inference billions of times with users constantly interacting with the system. This shift means that organizations need to plan for inference-heavy workloads and optimize their infrastructure accordingly, potentially using specialized GPU configurations for different types of inference tasks.
NVIDIA's success comes from expanding markets rather than competing for existing market share. (30:51) Kagan emphasizes that NVIDIA focuses on "baking a bigger pie for everybody" rather than taking a larger piece of the existing pie. This philosophy drove the Intel partnership, where instead of viewing Intel as competition, NVIDIA sees an opportunity to fuse accelerated computing with general-purpose computing. This approach has enabled NVIDIA to grow markets exponentially rather than engage in zero-sum competition, contributing to their 45x market cap growth since the Mellanox acquisition.
Kagan envisions AI as humanity's "spaceship of the mind" - far more powerful than Steve Jobs' description of computers as "bicycles of the mind." (39:32) He believes AI will enable people to accomplish 10 times more work, but will also inspire them to want to do 100 times more. This exponential amplification of human capability could revolutionize fields like experimental history through simulation, help discover new laws of physics, and fundamentally change how we approach problem-solving and innovation across all domains of human endeavor.