The MAD Podcast with Matt Turck•January 22, 2026

The End of GPU Scaling? Compute & The Agent Era — Tim Dettmers (Ai2) & Dan Fu (Together AI)

A deep dive into the future of AI, exploring computational constraints, the potential of AGI, the transformative power of agents, and predictions for technological progress in 2026, with Tim Dettmers and Dan Fu offering contrasting yet complementary perspectives on AI's trajectory.

AI & Machine Learning

Tech Policy & Ethics

Developer Culture

Hardware & Gadgets

Programming Interviews & Prep

Matt Turck

Tim Dettmers

Dan Fu

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this episode of the MAD Podcast, host Matt Turck explores two opposing perspectives on AGI with Tim Dettmers from the Allen Institute for AI and Dan Fu from Together AI. (01:08) Tim argues in his provocative essay "Why AGI Will Not Happen" that we're hitting fundamental physical constraints and diminishing returns in computation, particularly citing memory movement bottlenecks and hardware limitations. (08:20) Meanwhile, Dan counters with his essay "Yes, AGI Will Happen," contending that current models are severely underutilizing available hardware and are lagging indicators of computational progress. (16:16) The conversation then shifts to practical applications, with both experts agreeing that AI agents have already reached a critical threshold for transforming software engineering and knowledge work. (32:12) They discuss the "software singularity" where coding agents can now tackle even the most complex programming challenges, and emphasize that professionals who don't adapt to using agents effectively will be left behind.

The debate centers on whether physical constraints will limit AI progress versus the argument that massive computational headroom remains untapped in current systems

Speakers

Tim Dettmers

Tim is an Assistant Professor at Carnegie Mellon University in the Machine Learning and Computer Science departments and a Research Scientist at the Allen Institute for AI. He's renowned for his pioneering work in efficient deep learning and quantization, including the development of QLoRA, a breakthrough method for efficient fine-tuning that uses up to 16 times less memory than traditional approaches. He previously worked three years in Germany's automation industry before transitioning to AI research.

Dan Fu

Dan is an Assistant Professor at UC San Diego and VP of Kernels at Together AI, where he focuses on making AI models run faster through specialized GPU programming. During his PhD, he developed FlashAttention, a crucial optimization for transformer models, and researched alternative architectures like state-space models. At Together AI, he leads efforts to maximize hardware utilization and recently collaborated with Cursor to accelerate their models for the launch of Composer 2.0 on NVIDIA's Blackwell GPUs.

Key Takeaways

Physical Constraints Create Real Limits on AI Progress

Tim argues that computational progress faces fundamental physical barriers, particularly the von Neumann bottleneck that limits how efficiently data can be moved between memory and processors. (13:01) He explains that useful computation requires two key components: gathering data from different locations and transforming it into new information. The geometric constraints of moving information from large, slow memory (DRAM) to fast processing units create unavoidable latency issues. Modern optimizations like stacked HBM memory and quantization to 4-bit precision have reached their practical limits, with manufacturing yields becoming prohibitively difficult and no new breakthrough technologies on the horizon to overcome these bottlenecks.

Current Models Drastically Underutilize Available Hardware

Dan presents compelling evidence that today's best models operate at only 20% hardware utilization (MFU - Model Flop Utilization), compared to 50-60% achieved in earlier 2020s training runs. (18:05) He points to DeepSeek's model, trained on 2,000 nerfed H800 GPUs for about a month in 2024, as an example of this massive underutilization. Since then, companies like Poolside and Reflection have built clusters with tens of thousands of next-generation B200 chips that are 2-3x faster, creating potential for up to 100x more available compute when combined with optimization improvements.

Models Are Lagging Indicators of Hardware Progress

The models we interact with today were trained on hardware clusters built 1.5-2 years ago, creating a significant gap between current computational capabilities and deployed AI systems. (22:01) This lag occurs because large pre-training runs require substantial time for cluster setup, training execution, and post-training refinement including RLHF. As Dan explains, even OpenAI's GPT-4.5 Turbo was only partially trained on newer hardware, with most pre-training occurring on older H100 clusters while newer GB200 chips were used primarily for fine-tuning phases.

Agents Have Crossed the Threshold for Complex Programming Tasks

Dan describes his pivotal moment in June 2025 when AI agents became capable of writing GPU kernels - traditionally considered the "final boss" of programming challenges. (33:21) These highly specialized, parallel programs written in C++ typically required expert-level skills and weeks of development time. With agent assistance, Dan's team accomplished what previously took months in single days, achieving 5-10x productivity improvements. This breakthrough suggests agents have reached a level where they can accelerate even the most technically demanding programming work when guided by domain experts.

Domain Expertise Becomes More Critical, Not Less, in an Agent-Driven World

Both experts emphasize that using agents effectively requires treating them like junior team members who need clear context, task decomposition, and expert oversight. (44:02) Dan compares agent management to onboarding new interns - you wouldn't ask them to double company revenue, but with proper guidance and tools, they can be highly productive. Tim adds that 90% of code and text should be written by agents, but the critical 10% of human review and editing makes the difference between mediocre and excellent output. The key insight: agents amplify existing expertise rather than replace the need for deep domain knowledge.

Statistics & Facts

DeepSeek's model achieved only 20% hardware utilization (MFU) when trained on 2,000 H800 GPUs, compared to 50-60% utilization rates achieved in earlier 2020s training runs. (18:05)
Tim's team developed QLoRA, which uses up to 16 times less memory than traditional fine-tuning approaches while maintaining performance. (02:05)
Current inference workloads utilize less than 5% of available GPU hardware capacity, creating massive room for efficiency improvements. (55:20)