a16z Podcast•January 22, 2026

Inferact: Building the Infrastructure That Runs Modern AI

Woosuk and Simon from UC Berkeley discuss their open-source inference engine VLLM and their new company Inferact, which aims to build a universal infrastructure layer for running AI models efficiently across different hardware and model architectures.

AI & Machine Learning

0:00/0:00

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.

0:00/0:00

Podcast Summary

This episode dives deep into the world of AI inference with Simon Moe and Woosuk Kwon, the creators of vLLM and co-founders of Inferact. (03:28) The conversation explores how vLLM began as a simple optimization project for running Meta's OPT model in 2022 but evolved into one of the fastest-growing open source AI infrastructure projects. (22:00) The discussion reveals three major trends making AI inference increasingly complex: scale (models reaching trillions of parameters), diversity (varied model architectures and hardware), and agents (requiring persistent state management). The founders explain their vision for building a "universal inference layer" that can run any model on any chip efficiently, while maintaining their commitment to open source development. (35:00)

Main themes: The evolution of AI inference from a simple serving problem to one of the most complex challenges in modern computing, the critical role of open source infrastructure in democratizing AI deployment, and the technical challenges of running large language models at scale across diverse hardware and applications.

Speakers

Simon Moe

Simon Moe is co-founder of Inferact and a key contributor to the vLLM open source project. He previously worked on the Ray project and AnyScale, bringing extensive experience in managing large-scale open source communities and distributed systems. Simon has been instrumental in building vLLM's performance benchmarking systems and managing the project's rapid growth to over 2,000 contributors.

Woosuk Kwon

Woosuk Kwon is co-founder of Inferact and the original creator of vLLM during his PhD at UC Berkeley. He pioneered the PageAttention algorithm that became foundational to modern LLM inference optimization. His research began in 2022 with optimizing Meta's OPT model serving, which evolved into the widely-adopted vLLM project that now powers inference on 400-500k GPUs globally.

Matt Bornstein

Matt Bornstein is a General Partner at Andreessen Horowitz, where he focuses on AI infrastructure investments. He was instrumental in providing early grant funding to the vLLM project and later led a16z's investment in Inferact. Matt brings deep expertise in enterprise software and AI infrastructure to help guide portfolio companies through critical scaling challenges.

Key Takeaways

AI Inference Complexity Is Accelerating

The challenge of running AI models is becoming exponentially more difficult due to three key factors: scale, diversity, and agents. (22:52) Unlike traditional ML workloads with predictable, static inputs, LLMs handle dynamic requests ranging from single words to entire documents, with non-deterministic output lengths. As Woosuk explains, models are reaching trillion-parameter scale with architectures diverging significantly - some using sparse attention, others exploring linear attention mechanisms. This creates an n-by-m problem where every model must work efficiently on every type of hardware, requiring sophisticated scheduling and memory management solutions that didn't exist just years ago.

Open Source Drives Infrastructure Innovation

The success of vLLM demonstrates how open source projects can outpace proprietary solutions through community-driven development. (13:14) With over 2,000 contributors and 50+ full-time developers, vLLM benefits from diverse perspectives spanning model providers (who want their models to run efficiently), hardware vendors (who need compatibility), and infrastructure companies (who require reliable serving). Simon notes that many organizations tell them "we just cannot keep up with vLLM" despite having internal teams, highlighting how open source velocity can exceed what any single entity can achieve independently.

Memory Management Is the Critical Bottleneck

The invention of PageAttention solved one of the most fundamental problems in LLM inference: efficiently managing the key-value (KV) cache that stores conversation context. (29:06) Traditional approaches required pre-allocating memory for maximum possible sequence lengths, leading to massive waste. PageAttention dynamically allocates memory pages as needed, similar to how operating systems manage virtual memory. This breakthrough enables much higher throughput and better GPU utilization, particularly crucial as conversations become longer and more complex with agent-based applications.

Agent Workflows Disrupt Traditional Serving Patterns

The shift from single-turn interactions to multi-agent workflows is fundamentally changing inference requirements. (30:02) Agents can pause for seconds, minutes, or hours while interacting with external tools, making it impossible to predict when cache can be safely evicted. Unlike traditional human-to-AI interactions with predictable patterns, agent workflows involve external environment interactions with highly variable timing. This uncertainty forces inference systems to become much smarter about state management, creating new challenges that didn't exist in the simpler text-in, text-out paradigm.

Vertical Integration Requires Horizontal Infrastructure

Success in AI applications increasingly depends on tailoring the entire stack - from model architecture to hardware - for specific use cases, but this creates a need for universal infrastructure layers. (39:59) Simon argues that just as operating systems abstract CPUs and databases abstract storage, inference engines must abstract accelerated computing devices for AI models. Companies building vertical solutions need this horizontal abstraction layer to achieve optimal performance across diverse hardware and model combinations, making inference infrastructure as foundational as traditional systems software.

Statistics & Facts

vLLM currently runs on 400,000-500,000 GPUs globally across diverse deployments, representing a significant portion of the world's GPU infrastructure dedicated to AI inference. (24:38)
The project has grown to over 2,000 contributors with 50+ full-time developers working on it daily, making it one of the fastest-growing open source projects according to GitHub rankings. (13:22)
Running comprehensive CI testing for vLLM costs over $100,000 per month, scaling to over $1 million annually, demonstrating the massive infrastructure requirements for maintaining production-quality AI inference software. (19:12)