Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
George Cameron and Micah Hill-Smith of Artificial Analysis discuss their journey from launching a side project in 2023 to becoming the independent gold standard for AI benchmarking. (04:00) The conversation covers their business model, which includes enterprise benchmarking subscriptions and private custom evaluations, while maintaining independence by ensuring no one pays to be on their public leaderboard. They detail their evolution from running standard evaluations to developing proprietary benchmarks like the Omissions Index for hallucination rates and GDP Val AA for agentic evaluation. (27:57) The discussion explores key industry trends including the "smiling curve" phenomenon where GPT-4-level intelligence costs 100-1000x less than at launch, yet frontier reasoning models in agentic workflows cost more than ever due to sparsity, long context, and multi-turn interactions.
Co-founder of Artificial Analysis, an Australian who moved to San Francisco and has been involved in AI for several years. George focuses on the technical development of benchmarking methodologies and has spoken at major AI conferences including the AI Engineers World's Fair about industry trends and cost analysis.
Co-founder of Artificial Analysis based in Sydney, Australia (though later moved to SF through AI Grant). Previously worked on building a legal AI research assistant in 2023, which led to the need for independent model evaluation that sparked the creation of Artificial Analysis. He handles much of the business development and enterprise customer relationships.
Artificial Analysis discovered that labs often provide different models on private endpoints versus public APIs, prompting them to implement a "mystery shopper policy" where they register accounts not on their domain to run benchmarks incognito. (13:15) This ensures they're evaluating the same models that real users access, preventing labs from gaming results through special endpoints. The approach has been accepted by labs because they also want assurance that competitors can't manipulate benchmarks.
Through their Omissions Index, they discovered that smarter models aren't necessarily better at saying "I don't know" when they lack information. (31:54) Claude models consistently show the lowest hallucination rates despite not always being the most intelligent, suggesting this is a post-training optimization rather than an inherent capability. This finding challenges assumptions about model behavior and highlights the importance of measuring different dimensions of AI performance separately.
While GPT-4-level intelligence costs 100-1000x less than at launch due to smaller models achieving similar performance, total AI spending has increased dramatically due to frontier reasoning models, agentic workflows, and longer context requirements. (57:57) This "smiling curve" phenomenon means organizations can achieve basic intelligence cheaply while advanced use cases become increasingly expensive, fundamentally changing how companies should budget for AI capabilities.
Current frontier models operate at extremely low sparsity levels - GPT-4 at ~5% active parameters, while some models like Kimi K2 operate at just 3% active parameters. (64:04) Their data shows model accuracy correlates more with total parameters than active parameters, suggesting massive sparse models could be the future architecture, potentially pushing sparsity even lower than current levels.
As models become more integrated into multi-turn agentic workflows, the ability to use more tokens only when needed becomes crucial for cost management. (68:45) Models like GPT-5 may cost more per token but solve complex tasks in fewer turns, making them cheaper overall for certain applications. This shift requires evaluating models on turn efficiency rather than just token efficiency for real-world deployment decisions.