a16z Podcast•October 13, 2025

Columbia CS Professor: Why LLMs Can’t Discover New Science

A discussion with Columbia CS Professor Vishal Misra about the limitations of Large Language Models (LLMs) and why they cannot discover fundamentally new science or create entirely new paradigms of knowledge.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

In this compelling episode of the a16z podcast, Vishal Misra, a Columbia University professor and cricket enthusiast, shares how his attempt to fix a broken cricket stats website accidentally led to one of AI's biggest breakthroughs - retrieval-augmented generation (RAG). (13:58) The conversation with Martin Casado dives deep into Vishal's groundbreaking formal models that explain what large language models can and cannot do, using concepts from information theory and networking to understand AI capabilities. (02:49) They explore why LLMs might be hitting fundamental limits, what true reasoning looks like versus computational steps within existing knowledge, and the architectural advances needed to create artificial general intelligence. (34:29) The discussion provides rare mathematical clarity on LLM limitations while maintaining optimism about their transformative potential in productivity.

Main Theme: Understanding LLMs through formal mathematical models, exploring their capabilities and fundamental limitations, and discussing what architectural advances would be needed to achieve AGI beyond current transformer-based approaches.

Speakers

Vishal Misra

Professor at Columbia University with a distinguished background in computer networking and information theory. He co-founded CricInfo in the 1990s, which became the world's most popular cricket statistics website with more hits than Yahoo at its peak. (11:33) He is also a minority owner of the San Francisco Unicorns cricket team and accidentally invented retrieval-augmented generation (RAG) while trying to improve cricket statistics interfaces using GPT-3 in 2020.

Martin Casado

General Partner at Andreessen Horowitz (a16z) with a technical background in networking and systems. He has a high academic h-index and brings deep technical expertise to venture capital, focusing on understanding the fundamental capabilities and limitations of AI systems through formal models rather than purely empirical approaches.

Key Takeaways

LLMs Operate Through Bayesian Manifolds with Entropy Constraints

LLMs create confidence through what Vishal calls "Bayesian manifolds" - reduced-dimensional spaces where they can reason confidently. (04:09) When prompts have high information entropy (rare, specific contexts) but lead to low prediction entropy (few confident next token choices), models perform best. For example, "I'm going to dinner with Martin Casado" is information-rich but constrains likely restaurant choices to high-end establishments, reducing the model's uncertainty and improving performance quality.

Chain-of-Thought Works by Reducing Prediction Entropy

The effectiveness of chain-of-thought prompting stems from breaking complex problems into smaller, familiar steps that the model has seen during training. (08:33) Just as humans need to write down multiplication steps rather than guess the answer to 769 × 1,025, LLMs gain confidence by decomposing problems into algorithmic sequences where each step has low prediction entropy, allowing them to traverse familiar reasoning patterns rather than making high-uncertainty leaps.

LLMs Cannot Recursively Self-Improve Beyond Their Training Distribution

Using his matrix abstraction model, Vishal demonstrates that LLMs can only operate within the "inductive closure" of their training data. (28:37) While they can fill in missing pieces of knowledge through interpolation and unwrap embedded algorithms, they cannot create fundamentally new paradigms. Any LLM trained on pre-1915 physics could never discover relativity because that would require rejecting core assumptions and creating new conceptual frameworks - something beyond current architectures' capabilities.

In-Context Learning is Bayesian Reasoning with Evidence

Vishal's cricket database example reveals that in-context learning works through Bayesian posterior computation using prompt examples as evidence. (25:20) When he showed GPT-3 examples of his custom domain-specific language for cricket queries, the model learned it immediately despite never seeing this language before. The model used the examples as Bayesian evidence to create appropriate probability distributions for new queries, demonstrating that the same underlying mechanism handles both simple continuation and few-shot learning tasks.

AGI Requires Creating New Manifolds, Not Just Navigating Existing Ones

True artificial general intelligence will be achieved when systems can create entirely new knowledge domains rather than just becoming better at traversing existing ones. (34:04) Current LLMs excel at connecting known results and following learned patterns, but AGI would require the ability to formulate new axioms, discover fundamental principles, and create paradigm shifts like Einstein's relativity theory. This likely demands architectural breakthroughs beyond simply scaling existing transformer models with more data and compute.

Statistics & Facts

GPT-3's original context window was only 2,048 tokens, making it impossible to fit complex database schemas for natural language querying, which led to Vishal's accidental invention of RAG using 1,500 example queries for context. (14:24)
CricInfo became so popular in the 1990s that it had more website hits than Yahoo at its peak, demonstrating the global appetite for detailed sports statistics and complex data querying interfaces. (11:15)
The theoretical matrix representing all possible LLM prompts and token distributions would have more rows than the number of atoms across all known galaxies, illustrating the massive compression and approximation these models must perform. (22:04)