Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
In this exclusive interview, we sat down with Pim de Witte, CEO and co-founder of General Intuition (GI), a groundbreaking AI startup that spun out of gaming platform Medal. GI has secured a $134 million seed round from Khosla Ventures—Vinod Khosla's largest single seed bet since OpenAI—to develop world models using Medal's unprecedented dataset of 3.8 billion action-labeled game clips. The conversation reveals how Medal's retroactive clipping technology has created one of the world's most valuable training datasets for spatial-temporal AI agents. (00:00)
CEO and co-founder of General Intuition and founder of Medal, a gaming platform with 12 million users and 3.8 billion action-labeled video clips. Previously built the largest RuneScape private server and worked at Doctor Web on satellite-based map generation for disaster response. Self-taught engineer who recently completed intensive AI coursework to master the fundamentals of deep learning and world models.
Medal's decision to capture actions rather than raw keystrokes was initially driven by privacy concerns, but this approach created one of the world's most valuable AI training datasets. (17:58) Instead of logging specific keys like W, A, S, D, Medal converts inputs to semantic actions (jump, walk left, aim up) which preserves user privacy while providing clean training signals. This approach required thousands of human labelers to map every possible action across different games over 18 months, creating ground truth action labels for 3.8 billion clips. The result is a dataset that captures the perception-action loop (perceive, act, state update, repeat) that's fundamental to training intelligent agents—without compromising user privacy.
Medal's core innovation is retroactive video recording—the system continuously records gameplay in memory, and players hit a button to save the last 30 seconds only after something interesting happens. (21:21) This approach is similar to Tesla's FSD bug reporting system and creates a natural selection bias toward exceptional moments. Unlike traditional recording where you must remember to start and stop, retroactive clipping captures authentic peak performance without changing player behavior. The baseline of Medal's dataset is peak human performance because players only clip their best moments, creating training data that represents the upper bounds of human capability rather than average gameplay.
True world models go beyond video generation—they must understand physics, maintain spatial memory, and handle partial observability like smoke or camera shake. (08:43) GI's world models demonstrate sophisticated capabilities: maintaining position through smoke clouds, handling rapid camera movements with mouse sensitivity, and even inheriting real-world physics like camera shake during explosions (which doesn't occur in the actual game). The models use 4-second memory windows and can unstick themselves from spatial errors, showing genuine spatial-temporal reasoning rather than simple pattern matching.
Video games offer advantages over YouTube videos for training spatial intelligence because they eliminate multiple layers of information loss. (13:09) With real-world videos, you must solve pose estimation, then inverse dynamics, then account for optical dynamics of eye movement—three levels of information loss. In games, players directly control the camera with their hands, simulating optical dynamics perfectly. Games also provide diverse environments (tens of thousands on PC vs. hundreds in VR) and represent every type of spatial reasoning task from navigation to tool use across different simulated worlds.
GI successfully demonstrated transfer from arcade-style games to realistic games to real-world video using the same perception-action architecture. (05:27) Their models can label any video on the internet by predicting what actions a human would take if controlling that scenario with keyboard and mouse. This transfer capability suggests that spatial intelligence learned in simulated environments can generalize to physical reality, making games a viable foundation for training general intelligence agents that could eventually control robots or navigate the real world.