Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
This episode dives deep into the world of AI inference with Simon Moe and Woosuk Kwon, the creators of vLLM and co-founders of Inferact. (03:28) The conversation explores how vLLM began as a simple optimization project for running Meta's OPT model in 2022 but evolved into one of the fastest-growing open source AI infrastructure projects. (22:00) The discussion reveals three major trends making AI inference increasingly complex: scale (models reaching trillions of parameters), diversity (varied model architectures and hardware), and agents (requiring persistent state management). The founders explain their vision for building a "universal inference layer" that can run any model on any chip efficiently, while maintaining their commitment to open source development. (35:00)
Simon Moe is co-founder of Inferact and a key contributor to the vLLM open source project. He previously worked on the Ray project and AnyScale, bringing extensive experience in managing large-scale open source communities and distributed systems. Simon has been instrumental in building vLLM's performance benchmarking systems and managing the project's rapid growth to over 2,000 contributors.
Woosuk Kwon is co-founder of Inferact and the original creator of vLLM during his PhD at UC Berkeley. He pioneered the PageAttention algorithm that became foundational to modern LLM inference optimization. His research began in 2022 with optimizing Meta's OPT model serving, which evolved into the widely-adopted vLLM project that now powers inference on 400-500k GPUs globally.
Matt Bornstein is a General Partner at Andreessen Horowitz, where he focuses on AI infrastructure investments. He was instrumental in providing early grant funding to the vLLM project and later led a16z's investment in Inferact. Matt brings deep expertise in enterprise software and AI infrastructure to help guide portfolio companies through critical scaling challenges.
The challenge of running AI models is becoming exponentially more difficult due to three key factors: scale, diversity, and agents. (22:52) Unlike traditional ML workloads with predictable, static inputs, LLMs handle dynamic requests ranging from single words to entire documents, with non-deterministic output lengths. As Woosuk explains, models are reaching trillion-parameter scale with architectures diverging significantly - some using sparse attention, others exploring linear attention mechanisms. This creates an n-by-m problem where every model must work efficiently on every type of hardware, requiring sophisticated scheduling and memory management solutions that didn't exist just years ago.
The success of vLLM demonstrates how open source projects can outpace proprietary solutions through community-driven development. (13:14) With over 2,000 contributors and 50+ full-time developers, vLLM benefits from diverse perspectives spanning model providers (who want their models to run efficiently), hardware vendors (who need compatibility), and infrastructure companies (who require reliable serving). Simon notes that many organizations tell them "we just cannot keep up with vLLM" despite having internal teams, highlighting how open source velocity can exceed what any single entity can achieve independently.
The invention of PageAttention solved one of the most fundamental problems in LLM inference: efficiently managing the key-value (KV) cache that stores conversation context. (29:06) Traditional approaches required pre-allocating memory for maximum possible sequence lengths, leading to massive waste. PageAttention dynamically allocates memory pages as needed, similar to how operating systems manage virtual memory. This breakthrough enables much higher throughput and better GPU utilization, particularly crucial as conversations become longer and more complex with agent-based applications.
The shift from single-turn interactions to multi-agent workflows is fundamentally changing inference requirements. (30:02) Agents can pause for seconds, minutes, or hours while interacting with external tools, making it impossible to predict when cache can be safely evicted. Unlike traditional human-to-AI interactions with predictable patterns, agent workflows involve external environment interactions with highly variable timing. This uncertainty forces inference systems to become much smarter about state management, creating new challenges that didn't exist in the simpler text-in, text-out paradigm.
Success in AI applications increasingly depends on tailoring the entire stack - from model architecture to hardware - for specific use cases, but this creates a need for universal infrastructure layers. (39:59) Simon argues that just as operating systems abstract CPUs and databases abstract storage, inference engines must abstract accelerated computing devices for AI models. Companies building vertical solutions need this horizontal abstraction layer to achieve optimal performance across diverse hardware and model combinations, making inference infrastructure as foundational as traditional systems software.