Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, NVIDIA's Ian Buck demystifies mixture-of-experts (MoE) architecture, the powerhouse behind today's leading frontier AI models. (00:54) Buck explains how MoE enables models to achieve higher intelligence scores while dramatically reducing computational costs by activating only the necessary "experts" rather than every parameter in the model. The discussion covers the DeepSeek moment that catalyzed widespread MoE adoption, (09:18) and explores how NVIDIA's extreme co-design approach - combining advanced hardware like NVLink connectivity with sophisticated software optimization - delivers exponential performance improvements that drive down the cost per token by up to 10x.
Ian Buck serves as Vice President of Hyperscale and High Performance Computing at NVIDIA, where he leads initiatives in AI infrastructure and extreme co-design strategies. He oversees the development of cutting-edge GPU architectures and works directly with leading AI companies to optimize performance and reduce costs for frontier AI models. Buck has been instrumental in advancing NVIDIA's platform capabilities from early Kepler GPUs to modern GB 200 NVL 72 systems.
Noah Krabitz hosts the NVIDIA AI podcast, bringing complex AI concepts to broader audiences through engaging conversations with industry leaders and technical experts.
Instead of activating all parameters in a neural network, MoE models strategically activate only the relevant "experts" needed for specific queries. (03:12) Buck illustrates this with a compelling comparison: GPT-4o has 120 billion total parameters but only activates about 5 billion when answering questions, achieving a 61 intelligence score compared to Llama's 405 billion fully-activated parameters scoring 28. This approach delivers superior performance at dramatically lower computational costs - reducing benchmark costs from $200 to $75 while improving intelligence scores. The key insight is that like human teams with specialized expertise, AI models work more efficiently when different experts handle different domains rather than one massive system handling everything.
The hidden cost of MoE isn't computation but communication between experts. (23:42) Buck reveals that while MoE models use 11% of parameters compared to dense models, they only achieve 3x cost reduction rather than the theoretical 10x due to communication overhead. NVIDIA's NVLink technology solves this by enabling every GPU to communicate with every other GPU at full speed without bottlenecks. The GB 200 NVL 72 system connects 72 GPUs as one unified system, delivering 15x performance improvements over previous generations while reducing cost per token by 10x - from $1 per million tokens to 10 cents.
NVIDIA's approach combines hardware innovation, software optimization, and direct collaboration with model builders to achieve breakthrough performance. (28:35) Buck describes working with customers to apply the latest optimization techniques, achieving 2x performance improvements within two weeks through kernel fusions and NVLink communication overlaps. This co-design philosophy ensures that each generation delivers exponential rather than incremental improvements, with NVIDIA employing more software engineers than hardware engineers to maximize end-to-end system performance across 72 GPUs running hundreds of experts simultaneously.
Modern MoE models like Kimi K2 demonstrate the power of extreme sparsity, featuring trillion parameters while activating only 32 billion (3%) for any given query. (25:56) This architecture requires sophisticated routing systems with dozens of experts per layer and multiple expert consultations, but the complexity pays off with frontier-level intelligence at dramatically reduced costs. The approach mirrors effective human organizations where specialized teams collaborate rather than having one person handle everything, proving that AI efficiency comes through intelligent resource allocation rather than brute force computation.
The sparsity optimization principles of MoE apply across vision, video, robotics, and scientific computing applications. (32:43) Buck emphasizes that any intelligent system benefits from using only the neural pathways needed for specific tasks, whether detecting objects, planning robot movements, or simulating physical phenomena. This universality means organizations investing in MoE-optimized infrastructure today are future-proofing their AI capabilities for emerging applications in drug discovery, materials science, and autonomous systems that will all benefit from the same efficient expert activation principles.