Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, Quentin Anthony, Head of Model Training at Zyphra and advisor at EleutherAI, shares his journey from working on Oak Ridge National Lab's Frontier supercomputer to leading Zyphra's ambitious transition to AMD MI300X GPUs. (02:35) He reveals how they're achieving performance that beats NVIDIA H100s on certain workloads while dramatically reducing costs, particularly for non-FP8 dense transformers and MOE models.
The conversation dives deep into the technical challenges of kernel development, with Anthony explaining why he often bypasses high-level frameworks like Triton to write directly in ROCm or even GPU assembly when necessary. (10:14) He discusses how Zyphra's hybrid transformer-Mamba models like Zamba 2 can match Llama 3 8B performance at 7B parameters, optimized specifically for edge deployment across a spectrum from 1.2B models for phones to 7B for desktops.
Anthony also candidly discusses his experience in the controversial METR software engineering productivity study, where he was one of the few developers who showed measurable speedup from AI tools while most participants were actually 20% slower despite feeling 20% faster. (29:47) He shares practical insights on avoiding the "slot machine effect" of endlessly prompting models, the importance of context rot awareness, and why he prefers direct API access over tools like Cursor to maintain complete control over model context.
Quentin Anthony is the Head of Model Training at Zyphra, a startup building foundation models focused on edge deployment, and serves as an advisor at EleutherAI where he leads HPC initiatives. Previously, he worked on Oak Ridge National Lab's Frontier supercomputer and led development on GPT-NeoX, a widely-adopted model training framework used by academic labs. Anthony has extensive experience in GPU kernel development, working across both NVIDIA CUDA and AMD ROCm ecosystems, and has contributed significantly to open-source AI research with a focus on interpretability and model training dynamics.
Anthony emphasizes that understanding your hardware deeply can provide significant competitive advantages, particularly when everyone else assumes certain hardware won't work. (02:35) At Zyphra, they moved entirely to AMD MI300X GPUs and found they could beat NVIDIA H100 performance on certain workloads while reducing costs. The key insight is that AMD MI300X excels when you spend less time in dense compute (like tensor cores) and more time in parallelism or moving data to/from HBM, with 192GB of VRAM and higher memory bandwidth. This suggests that competitive advantage often comes from challenging conventional wisdom and doing the technical work others avoid.
When building performance-critical systems, Anthony advocates for working from the hardware up rather than from high-level frameworks down. (10:14) He often bypasses Triton and writes directly in ROCm or even GPU assembly because he needs total control over tensor placement and memory management. His approach involves understanding exactly what the hardware can do, designing the optimal algorithm on a whiteboard, and then finding the highest-level tool that will actually implement that design. This bottom-up approach ensures you extract maximum performance rather than being limited by compiler decisions.
Anthony's hiring philosophy prioritizes intellectual velocity over position-based knowledge. (49:16) He looks for people who are intellectually curious, demonstrate high learning velocity, and have a track record of diving deep into hard problems. He specifically mentions that physicists make excellent ML engineers because they're like "embryonic stem cells" - they can adapt to any domain. The key is finding people who will eventually supersede more senior engineers who have plateaued, rather than hiring for current knowledge of specific technologies like CUDA.
Anthony was one of the few developers in the METR study who actually achieved positive speedup from AI tools, and he attributes this to strict "digital hygiene" practices. (43:25) He avoids the "slot machine effect" by time-boxing AI interactions and being brutally honest about whether he's actually being sped up or going down rabbit holes. He maintains complete control over model context, preferring direct API access over tools like Cursor, and keeps conversations short (1-2 turns) before starting fresh to avoid context rot. The key is being self-aware about your own tendencies to want to do less work.
Rather than advocating for large collaborative efforts, Anthony believes in small, focused, well-funded teams with guaranteed resources from beginning to end. (55:56) He argues that most of EleutherAI's success has come from providing specific teams with compute, mentoring, and stability to work on very targeted problems, rather than attempting grand collaborative projects. This approach reduces coordination overhead and allows teams to go deep on specific technical challenges rather than getting lost in the noise of too many competing priorities and stakeholders.