Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this episode, Alessio and Swyx sit down with Gurkum and Batuhan from FAL to explore how their company became the dominant force in generative media infrastructure. They reveal how their early insight about stable diffusion's terrible GPU utilization (04:27) led to a crucial pivot from Python runtime to specialized inference, eventually capturing 2 million developers and scaling beyond $100 million revenue. The conversation covers their aggressive optimization strategies using custom CUDA kernels, the explosive growth of video models reaching 50% of their revenue, and their partnerships with major AI labs like Google DeepMind for video models like Veo-3 (06:06)—offering both technical leaders and ambitious builders a blueprint for riding the generative AI wave while staying ahead of commoditization.
Co-founder of FAL (formerly Features and Labels), a generative media platform serving 2 million developers and powering over $100M in annual revenue. Former Amazon engineer who pivoted from building feature stores to becoming a leader in AI inference optimization, specializing in custom CUDA kernels and GPU performance engineering for diffusion models.
Head of Engineering at FAL and co-developer of the Python programming language. Joined FAL in 2021 before their seed round, bringing deep expertise in developer tools and Python runtime optimization. Now leads engineering efforts in building custom kernels, distributed systems, and the infrastructure serving thousands of H100 GPUs across 24 data centers.
Founder of Kernel Labs and co-host of the Latent Space podcast. Previously worked in venture capital and has extensive experience in AI infrastructure and developer tools.
Founder of Small AI and co-host of the Latent Space podcast. Former Temporal and Netlify engineer, known for his insights on AI engineering and developer experience.
When faced with the choice between competing against Google, OpenAI, and Anthropic in language models versus becoming leaders in a fast-growing niche like generative media, choose the niche. (09:08) Language model hosting becomes a losing battle when you're ultimately competing against Google, who can offer search-integrated models for free because it's existential to their business. Generative media was a net new market with no incumbent threatening their core business model.
Write custom kernels and optimize inference engines when the open source community hasn't caught up yet. (10:20) At Stable Diffusion 1.5's launch, basic convolutions on A100s only achieved 30% GPU utilization because nobody cared about optimization. This created a 10x performance gap that translated directly into revenue advantage. The key insight: new architectures and chips constantly create optimization gaps that justify dedicated performance teams.
Every 10% improvement in generation speed correlates directly with user engagement and monetization, mirroring Amazon's page load time studies. (16:38) One customer's AB test showed that artificially slowing down FAL's image generation significantly reduced user engagement and image creation volume. In generative media, users can't stream responses like with LLMs - they need the full artifact, making latency optimization critical for retention.
Recruit directly from community Discord servers and Twitter spaces where people are already building with your technology obsessively. (58:50) FAL's best applied ML engineers came from people already training LoRAs on their platform, posting viral Hugging Face spaces, and creating innovative workflows. These builders have proven product intuition and domain expertise that traditional hiring pipelines miss entirely.
When you dominate inference for one model type, expand to related technical challenges rather than building general-purpose platforms. (22:30) FAL evolved from optimizing Stable Diffusion into building distributed file systems, container runtimes, CDNs, and content moderation - all specialized for generative media workloads across 24 data centers and 6 cloud providers. Each adjacent problem leveraged their core competency while avoiding the generic platform trap.