Latent Space: The AI Engineer Podcast•November 3, 2025

How Zyphra went all-in on AMD + Why Devs feel faster with AI but are slower — with Quentin Anthony

Quentin Anthony discusses Zyphra's transition to AMD MI300X GPUs, their hybrid transformer-Mamba models, and insights from a software engineering productivity study, highlighting the importance of understanding AI tools' limitations and maintaining an intellectually curious approach to model development.

AI & Machine Learning

Podcast Summary

In this episode, Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, shares his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious transition to AMD MI300X GPUs. (02:35) He reveals how they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs, particularly for non-FP8 dense transformers and MOE models.

The conversation dives deep into the technical challenges of kernel development, with Anthony explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. (10:14) He discusses how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops.

Anthony also candidly discusses his experience in the controversial METR software engineering productivity study, where he was one of the few developers who showed measurable speedup from AI tools while most participants were actually 20% slower despite feeling 20% faster. (29:47) He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context.

Main Theme: Technical excellence in AI model training and deployment, focusing on hardware optimization, kernel development, and practical AI productivity workflows

Speakers

Quentin Anthony

Quentin Anthony is the Head of Model Training at Zyphra, a startup building foundation models focused on edge deployment, and serves as an advisor at EleutherAI where he leads HPC initiatives. Previously, he worked on Oak Ridge National Lab's Frontier supercomputer and led development on GPT-NeoX, a widely-adopted model training framework used by academic labs. Anthony has extensive experience in GPU kernel development, working across both NVIDIA CUDA and AMD ROCm ecosystems, and has contributed significantly to open-source AI research with a focus on interpretability and model training dynamics.

Key Takeaways

Master Hardware-Specific Optimization for Competitive Advantage

Anthony emphasizes that understanding your hardware deeply can provide significant competitive advantages, particularly when everyone else assumes certain hardware won't work. (02:35) At Zyphra, they moved entirely to AMD MI300X GPUs and found they could beat NVIDIA H100 performance on certain workloads while reducing costs. The key insight is that AMD MI300X excels when you spend less time in dense compute (like tensor cores) and more time in parallelism or moving data to/from HBM, with 192GB of VRAM and higher memory bandwidth. This suggests that competitive advantage often comes from challenging conventional wisdom and doing the technical work others avoid.

Write Systems Code From First Principles When Performance Matters

When building performance-critical systems, Anthony advocates for working from the hardware up rather than from high-level frameworks down. (10:14) He often bypasses Triton and writes directly in ROCm or even GPU assembly because he needs total control over tensor placement and memory management. His approach involves understanding exactly what the hardware can do, designing the optimal algorithm on a whiteboard, and then finding the highest-level tool that will actually implement that design. This bottom-up approach ensures you extract maximum performance rather than being limited by compiler decisions.

Hire for Velocity and Curiosity Over Specific Skills

Anthony's hiring philosophy prioritizes intellectual velocity over position-based knowledge. (49:16) He looks for people who are intellectually curious, demonstrate high learning velocity, and have a track record of diving deep into hard problems. He specifically mentions that physicists make excellent ML engineers because they're like "embryonic stem cells" - they can adapt to any domain. The key is finding people who will eventually supersede more senior engineers who have plateaued, rather than hiring for current knowledge of specific technologies like CUDA.

Maintain Rigorous AI Hygiene to Avoid Productivity Traps

Anthony was one of the few developers in the METR study who actually achieved positive speedup from AI tools, and he attributes this to strict "digital hygiene" practices. (43:25) He avoids the "slot machine effect" by time-boxing AI interactions and being brutally honest about whether he's actually being sped up or going down rabbit holes. He maintains complete control over model context, preferring direct API access over tools like Cursor, and keeps conversations short (1-2 turns) before starting fresh to avoid context rot. The key is being self-aware about your own tendencies to want to do less work.

Focus Team Structure for Maximum Impact Over Collaboration

Rather than advocating for large collaborative efforts, Anthony believes in small, focused, well-funded teams with guaranteed resources from beginning to end. (55:56) He argues that most of EleutherAI's success has come from providing specific teams with compute, mentoring, and stability to work on very targeted problems, rather than attempting grand collaborative projects. This approach reduces coordination overhead and allows teams to go deep on specific technical challenges rather than getting lost in the noise of too many competing priorities and stakeholders.

Statistics & Facts

AMD MI300X GPUs have 192 gigabytes of VRAM compared to NVIDIA alternatives, providing significantly higher memory bandwidth for certain workloads. (03:57) This allows for much less time working on parallelism strategies and communication overlap.
Zyphra's Zamba 2 model achieves Llama 3 8B performance using only 7B parameters through their hybrid transformer-Mamba architecture. (02:08) They offer a spectrum from 1.2B models for phones to 7B for desktop deployment.
In the METR productivity study, most developers felt 20% faster while using AI coding tools but were actually 20% slower in reality. (29:47) Anthony was one of the few developers who showed measurable positive speedup from AI tools.