The MAD Podcast with Matt Turck•January 29, 2026

State of LLMs 2026: RLVR, GRPO, Inference Scaling — Sebastian Raschka

Sebastian Raschka provides an in-depth exploration of the LLM landscape in 2026, highlighting key developments in post-training techniques like RLVR and GRPO, inference scaling, tool use, and the ongoing importance of transformer architectures with incremental improvements.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

Sebastian Raschka joins the MAD Podcast for an in-depth exploration of what actually changed in LLMs in 2025 and what matters heading into 2026. (00:00) The conversation starts with examining whether transformers remain the winning architecture, covering world models, small "recursive" reasoning models, and text diffusion approaches. (01:05)

The real story of 2025 emerges as post-training and reasoning, with Sebastian breaking down RLVR (reinforcement learning with verifiable rewards) and GRPO, explaining why they pair well together and make scaling cheaper than classic RLHF. (18:03) He discusses how these techniques "unlock" reasoning already latent in base models and why "benchmaxxing" is warping evaluation methods.

Main themes: The shift from architectural innovation to post-training optimization, with inference-time scaling and tool use as underappreciated drivers of progress, alongside the emergence of private data as the new competitive moat

Speakers

Sebastian Raschka

Sebastian Raschka is an AI researcher and one of the best educators in the field, well known for his in-depth technical blog posts and his book "Build a Large Language Model from Scratch." He has a computational biology background and maintains a popular technical blog and substack where he breaks down complex AI concepts into understandable content. His work focuses on making cutting-edge AI research accessible through both theoretical explanations and practical code implementations.

Matt Turck

Matt Turck is the Managing Director at FirstMark, a venture capital firm focused on early-stage technology investments. He hosts the MAD Podcast and maintains a popular blog covering data, AI, and technology trends. He is known for his annual MAD (Machine Learning, Artificial Intelligence & Data) Landscape, which maps the ecosystem of data and AI companies.

Key Takeaways

Pre-training Is Boring, Post-training Is Where the Action Is

Sebastian emphasizes that while pre-training isn't dead, it's no longer where the low-hanging fruit exists for LLM improvement. (17:37) The real advances in 2025 came from post-training techniques, particularly RLVR (Reinforcement Learning with Verifiable Rewards) and GRPO. As he puts it, "Pre training is not dead, but pre training is boring. It's not where the low hanging fruit is anymore." This represents a fundamental shift in how companies should allocate their AI development budgets, moving resources from massive pre-training efforts to sophisticated post-training optimization. The practical implication is that organizations can achieve better results by focusing on refining existing models rather than training larger ones from scratch.

RLVR Unlocks Reasoning Already Present in Base Models

One of the most significant discoveries is that reasoning capabilities already exist in base models but need to be unlocked through proper post-training techniques. Sebastian demonstrates this with a compelling example: taking the QUEN three model and training it for just 50 RLVR steps increased accuracy on math problems from 15% to 50%. (24:07) This suggests that base models contain latent reasoning abilities that traditional training methods fail to activate. The technique works by using verifiable rewards (like correct math answers) rather than human preferences, making it both cheaper and more scalable than traditional RLHF approaches.

Benchmarking Is Becoming Less Reliable Due to "Benchmaxxing"

Sebastian introduces the concept of "benchmaxxing" - where models are optimized specifically for benchmark performance rather than real-world utility. (38:41) He explains that leaderboards reward style over correctness because human evaluators can't always verify technical accuracy. This creates models that look impressive on paper but may not perform better in practice. His personal solution is telling: "Honestly personally I stopped looking at the benchmark numbers, I just use the model and see for a few days and I see if it's better or not." Organizations should prioritize real-world testing over benchmark scores when evaluating AI systems.

Inference-Time Scaling Is an Underappreciated Driver of Progress

While much attention focuses on training larger models, Sebastian highlights inference-time scaling as equally important for improving performance. (43:10) This includes reasoning models that generate more tokens, parallel sampling with majority voting, and self-refinement techniques. The key insight is that you can improve model performance by spending more compute during usage rather than just during training. This approach offers more flexibility since you can adjust the compute investment based on the complexity of each specific task, making it both more cost-effective and practically useful than simply scaling model size.

Private Data Will Become the Primary Competitive Moat

As public LLMs converge in capability, Sebastian identifies private data as the next competitive battleground. (49:57) He observes that large companies are starting to train LLMs in-house rather than sharing their proprietary data with external providers. Companies with decades of industry-specific data - whether in finance, healthcare, or manufacturing - have treasure troves that could significantly improve model performance in their domains. The practical implication is that organizations should focus on collecting, organizing, and leveraging their unique datasets rather than relying solely on general-purpose models.

Statistics & Facts

DeepSeek R1 training cost approximately $300,000 compared to DeepSeek version three's $5 million price tag, making post-training techniques more than 10 times cheaper than pre-training. (31:24) Sebastian notes this dramatic cost difference when discussing the scalability of RLVR techniques.
A simple 50-step RLVR training increased math accuracy from 15% to 50% on the QUEN three model, demonstrating a three-fold improvement with minimal additional training. (24:07) This statistic illustrates how post-training can unlock existing capabilities in base models.
Tool calling can improve model performance by approximately 1.2 times according to GPT-OSS benchmarks, showing measurable gains from allowing models to use external tools. (49:34) Sebastian cites this as evidence that inference-time enhancements provide concrete performance improvements.