The Stack Overflow Podcast•November 14, 2025

The fastest agent in the race has the best evals

Benjamin Klieger discusses how Groq built Compound, an efficient AI agent with fast inference and effective evaluations that can search the web, execute code, and provide responses in under ten seconds.

AI & Machine Learning

Indie Hackers & SaaS Builders

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this episode of the Stack Overflow podcast, Ryan Donovan interviews Benjamin Klieger, lead engineer at Groq, about the infrastructure powering AI agents. (00:57) They explore how to transform a one-minute agent response into a ten-second response, diving deep into fast inference, effective evaluations, and Groq's Compound agent system. (02:23) The conversation covers the technical challenges of building efficient agent frameworks, from model routing and parallelization to the complexities of creating reliable evaluation systems for real-time agent performance.

• Main themes: Agent infrastructure optimization, fast inference techniques, evaluation methodologies, and the engineering challenges of building production-ready AI agent systems that balance speed, cost, and quality.

Speakers

Ryan Donovan

Ryan is the podcast host and blog editor at Stack Overflow. He focuses on covering software development trends and interviewing industry leaders about emerging technologies and best practices in the developer community.

Benjamin Klieger

Benjamin is the lead engineer for AI agents at Groq, specializing in building fast, efficient agent infrastructure. He started his career on the product side seven years ago, co-founding a startup focused on creating healthier social media spaces for teenagers, before transitioning to become increasingly technical and eventually leading engineering efforts in AI agent development.

Key Takeaways

Speed Is Critical for Agent Success

The difference between a 60-second agent response and a 10-second response is transformational for user experience and practical adoption. (04:14) Benjamin emphasizes that this speed improvement requires three key components: fast inference (using optimized hardware like Groq's LPU chip), fast tools (working directly with providers to reduce search latency from 10-15 seconds), and effective parallelization through sub-agents that can tackle different portions of complex tasks simultaneously.

Model Routing Beats One-Size-Fits-All Approaches

Rather than relying on a single model for all tasks, successful agent systems route different tasks to the models best suited for each job. (02:58) Benjamin explains that while this appears "model agnostic" to users, it requires sophisticated backend routing to leverage the strengths of different models - some excel at search tasks, others at code execution, and others at reasoning through complex problems.

Traditional Evaluations Are Broken for Modern Agents

Standard evaluation benchmarks like SimpleQA have become saturated and unrepresentative of real-world agent usage. (11:58) Benjamin developed a real-time evaluation system that generates fresh questions from recent news stories every 24-48 hours, ensuring agents are tested on novel information rather than potentially memorized training data, providing more accurate performance assessments.

Context Engineering Still Matters Despite Large Context Windows

Even with million-token context windows, simply stuffing more information doesn't improve performance due to "context rot" - models become overloaded and performance degrades. (20:34) Smart context management through auto-compacting, strategic summarization, and agentic loops for retrieving specific information when needed maintains both quality and efficiency while reducing costs.

Speculative Decoding Multiplies Inference Speed

This software technique uses a smaller "draft model" to generate multiple token predictions that are then validated by the full model, creating significant speed improvements without sacrificing quality. (09:03) Benjamin likens this to having productive interns - if they're right 70% of the time, the productivity boost is substantial compared to generating one token at a time.

Statistics & Facts

Agent response times can vary dramatically from 60 seconds to 10 seconds depending on infrastructure optimization, with search tools alone sometimes taking 10-15 seconds to respond. (04:14) This latency difference fundamentally changes user experience and practical adoption.
Models are achieving 97-99% accuracy on SimpleQA evaluation benchmarks, making them too saturated to meaningfully differentiate between different agent systems. (12:36) This supersaturation has made traditional evaluation methods obsolete for modern agent assessment.
Search tool performance can improve dynamically based on query frequency - providers become better at responding to commonly searched queries through improved caching and content indexing. (13:43) This creates evaluation challenges as repeated test queries become unrepresentative of average performance.