Command Palette

Search for a command to run...

PodMine
Latent Space: The AI Engineer Podcast
Latent Space: The AI Engineer Podcast•January 9, 2026

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

George Cameron and Micah Hill-Smith detail the journey of Artificial Analysis, an independent AI benchmarking platform that has evolved from a side project to a comprehensive resource for evaluating AI models across intelligence, performance, cost, and openness metrics.
AI & Machine Learning
Indie Hackers & SaaS Builders
Tech Policy & Ethics
Developer Culture
B2B SaaS Business
Nat Friedman
Daniel Gross
George Cameron

Summary Sections

  • Podcast Summary
  • Speakers
  • Key Takeaways
  • Statistics & Facts
  • Compelling StoriesPremium
  • Thought-Provoking QuotesPremium
  • Strategies & FrameworksPremium
  • Similar StrategiesPlus
  • Additional ContextPremium
  • Key Takeaways TablePlus
  • Critical AnalysisPlus
  • Books & Articles MentionedPlus
  • Products, Tools & Software MentionedPlus
0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

George Cameron and Micah Hill-Smith of Artificial Analysis discuss their journey from launching a side project in 2023 to becoming the independent gold standard for AI benchmarking. (04:00) The conversation covers their business model, which includes enterprise benchmarking subscriptions and private custom evaluations, while maintaining independence by ensuring no one pays to be on their public leaderboard. They detail their evolution from running standard evaluations to developing proprietary benchmarks like the Omissions Index for hallucination rates and GDP Val AA for agentic evaluation. (27:57) The discussion explores key industry trends including the "smiling curve" phenomenon where GPT-4-level intelligence costs 100-1000x less than at launch, yet frontier reasoning models in agentic workflows cost more than ever due to sparsity, long context, and multi-turn interactions.

  • Main themes: Independent AI benchmarking evolution, cost paradoxes in AI inference, hallucination measurement, agentic evaluation frameworks, and model transparency through openness scoring

Speakers

George Cameron

Co-founder of Artificial Analysis, an Australian who moved to San Francisco and has been involved in AI for several years. George focuses on the technical development of benchmarking methodologies and has spoken at major AI conferences including the AI Engineers World's Fair about industry trends and cost analysis.

Micah Hill-Smith

Co-founder of Artificial Analysis based in Sydney, Australia (though later moved to SF through AI Grant). Previously worked on building a legal AI research assistant in 2023, which led to the need for independent model evaluation that sparked the creation of Artificial Analysis. He handles much of the business development and enterprise customer relationships.

Key Takeaways

Independent Benchmarking Requires "Mystery Shopper" Methodology

Artificial Analysis discovered that labs often provide different models on private endpoints versus public APIs, prompting them to implement a "mystery shopper policy" where they register accounts not on their domain to run benchmarks incognito. (13:15) This ensures they're evaluating the same models that real users access, preventing labs from gaming results through special endpoints. The approach has been accepted by labs because they also want assurance that competitors can't manipulate benchmarks.

Hallucination Rate Doesn't Correlate with Intelligence

Through their Omissions Index, they discovered that smarter models aren't necessarily better at saying "I don't know" when they lack information. (31:54) Claude models consistently show the lowest hallucination rates despite not always being the most intelligent, suggesting this is a post-training optimization rather than an inherent capability. This finding challenges assumptions about model behavior and highlights the importance of measuring different dimensions of AI performance separately.

The Cost Paradox of AI: Smarter and More Expensive Simultaneously

While GPT-4-level intelligence costs 100-1000x less than at launch due to smaller models achieving similar performance, total AI spending has increased dramatically due to frontier reasoning models, agentic workflows, and longer context requirements. (57:57) This "smiling curve" phenomenon means organizations can achieve basic intelligence cheaply while advanced use cases become increasingly expensive, fundamentally changing how companies should budget for AI capabilities.

Sparsity May Have Further to Go Than Expected

Current frontier models operate at extremely low sparsity levels - GPT-4 at ~5% active parameters, while some models like Kimi K2 operate at just 3% active parameters. (64:04) Their data shows model accuracy correlates more with total parameters than active parameters, suggesting massive sparse models could be the future architecture, potentially pushing sparsity even lower than current levels.

Token Efficiency Matters More Than Raw Speed for Agentic Workflows

As models become more integrated into multi-turn agentic workflows, the ability to use more tokens only when needed becomes crucial for cost management. (68:45) Models like GPT-5 may cost more per token but solve complex tasks in fewer turns, making them cheaper overall for certain applications. This shift requires evaluating models on turn efficiency rather than just token efficiency for real-world deployment decisions.

Statistics & Facts

  1. GPT-4-level intelligence now costs 100-1000x less than GPT-4 did at launch, with Amazon Nova models providing extremely cheap access to this level of capability. (57:57) This dramatic cost reduction is driven by smaller models achieving similar performance levels.
  2. Current frontier models operate at extremely low sparsity levels: GPT-4 models at approximately 5% active parameters, while Kimi K2 operates at around 3% active parameters. (64:04) This suggests significant room for further sparsity improvements in future architectures.
  3. Artificial Analysis runs benchmarks with 95% confidence intervals requiring multiple repeats, significantly multiplying their actual costs beyond the single-run costs they report publicly. (12:31) This ensures statistical reliability but makes independent benchmarking increasingly expensive.

Compelling Stories

Available with a Premium subscription

Thought-Provoking Quotes

Available with a Premium subscription

Strategies & Frameworks

Available with a Premium subscription

Similar Strategies

Available with a Plus subscription

Additional Context

Available with a Premium subscription

Key Takeaways Table

Available with a Plus subscription

Critical Analysis

Available with a Plus subscription

Books & Articles Mentioned

Available with a Plus subscription

Products, Tools & Software Mentioned

Available with a Plus subscription

More episodes like this

Finding Mastery with Dr. Michael Gervais
January 14, 2026

How To Stay Calm Under Stress | Dan Harris

Finding Mastery with Dr. Michael Gervais
The James Altucher Show
January 14, 2026

From the Archive: Sara Blakely on Fear, Failure, and the First Big Win

The James Altucher Show
In Good Company with Nicolai Tangen
January 14, 2026

Figma CEO: From Idea to IPO, Design at Scale and AI’s Impact on Creativity

In Good Company with Nicolai Tangen
Uncensored CMO
January 14, 2026

Rory Sutherland on why luck beats logic in marketing

Uncensored CMO
Swipe to navigate