Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
Pim de Witte, founder and CEO of General Intuition (GI), discusses how his company spun out from Medal, a gaming clip platform with 12 million users, to build world models trained on 3.8 billion action-labeled gaming highlights. The episode explores GI's vision-based AI agents that can play games at human or superhuman levels by seeing only pixels and predicting actions, their world models that understand physics and spatial reasoning, and how they're transferring these capabilities from games to real-world applications. (29:00) The conversation covers Pim's journey from RuneScape private servers to raising a $134 million seed round from Khosla Ventures, the technical foundations of their approach, and their ambitious 2030 vision of powering 80% of AI-driven atoms-to-atoms interactions through spatial-temporal foundation models.
Pim is the founder and CEO of General Intuition, which spun out from Medal, the gaming clip platform he built into a 12 million user social network with 3.8 billion highlight clips. Previously, he ran one of the largest private RuneScape servers and worked at Digital Globe on satellite-based map generation for disaster response. He's a self-taught engineer who took extensive coursework in deep learning fundamentals to prepare for launching GI.
Medal's unique dataset of 3.8 billion action-labeled gaming clips offers advantages over traditional video data for training spatial reasoning models. (13:33) Gaming environments capture the full perception-action loop where humans perceive, act, receive state updates, and perceive again - precisely what's needed to train agents. Unlike YouTube videos which require pose estimation and inverse dynamics modeling, gaming data provides direct action labels with optical dynamics already simulated through hand movements, creating cleaner training signal for spatial-temporal reasoning.
GI deliberately chose not to log specific keystrokes (WASD) for privacy reasons, instead converting inputs to semantic actions through thousands of human annotators. (18:17) This approach protects individual privacy while maintaining training utility, as models can learn from semantic actions and convert back to computer inputs at inference time. This privacy-preserving design became a competitive advantage, allowing GI to build ethical AI systems while maintaining the data quality needed for world model training.
True world models must go beyond video generation to understand the full range of possibilities based on actions taken. (08:43) GI's models demonstrate spatial memory, handle partial observability through smoke and occlusion, and maintain consistency across different camera views and rapid movements. The models can unstick themselves from errors, navigate complex environments, and even exhibit superhuman behaviors learned from peak gaming moments, showing they understand physics and spatial relationships.
GI's approach involves training on arcade-style games first, then transferring to realistic games, and finally to real-world video footage. (06:17) This graduated transfer allows the models to label any video on the internet as if it were controllable with keyboard and mouse inputs. The strategy leverages the fact that gaming environments already contain many real-world behaviors and physics, making them ideal stepping stones for developing general spatial intelligence.
Pim's transformation from business founder to technical AI leader demonstrates the value of deep, first-principles learning. (38:43) He completed François Fleuret's comprehensive deep learning course, covering linear algebra, calculus, neural networks, and the complete history of deep learning from first principles. This foundational understanding enabled him to contribute meaningfully to technical discussions and make informed strategic decisions about model architecture and research direction.