Search for a command to run...

Timestamps are as accurate as they can be but may be slightly off. We encourage you to listen to the full context.
At the AI Engineer Conference, we caught up with Bryan and Bill from OpenAI's Codex team right after their launch of Codex Max, a new long-running coding agent designed to work for 24+ hours straight while managing its own context window. (01:27) The discussion reveals how OpenAI is shifting from traditional model training to building agents with distinct personalities that developers can trust. (03:02) Both researchers shared insights on how they train models to exhibit specific behavioral characteristics like communication, planning, and self-checking—essentially turning software engineering best practices into measurable model behaviors. The conversation also explored how the abstraction layer is moving from individual models to complete agent systems that can spawn sub-agents and work in parallel across entire codebases. (12:03)
A key member of OpenAI's Codex training team who worked closely with the GPT-5 training team focusing on personality development for coding models. He has successfully launched open source projects written entirely by Codex and represents the bleeding edge of agent-first development at OpenAI, where approximately 50% of employees now use Codex daily.
Part of OpenAI's Codex team working on frontier coding model development and agent optimization. He specializes in the technical implementation of long-running coding agents and collaborates closely with coding partners to develop tool integrations and discover model capabilities that weren't initially anticipated by the training team.
OpenAI discovered that trust between developers and coding agents requires specific behavioral characteristics beyond raw capability. (03:02) They identified communication, planning, context gathering, and self-checking as essential personality traits that mirror best software engineering practices. This approach transforms abstract behaviors into measurable training targets, allowing models to act more like trusted colleagues than mere tools. The practical impact is significant—developers at OpenAI went from 50% adoption to daily usage when the model learned to communicate its thought process and planning steps.
Codex has developed specific tool preferences through training, such as strongly preferring "rg" (ripgrep) over "grep" for search operations. (07:48) Rather than fighting these preferences, successful implementations work with them by naming tools to match the model's training patterns. Partners discovered that renaming tools to match Codex's terminal-style expectations dramatically improved tool-call performance, demonstrating that understanding and accommodating model habits can unlock better results than trying to force generalization.
The future of AI development is shifting from optimizing individual model calls to packaging complete agent systems. (12:03) Rather than constantly adapting to new model releases and API changes, developers can build on top of complete agents like Codex that include their own harness, tooling, and behavioral patterns. This allows teams to focus on higher-level integration work while the agent handles the complexity of optimal model usage, sandboxing, and context management internally.
Codex Max was specifically designed to manage its own context window and spawn sub-agents for parallel work across different parts of a codebase. (14:45) This architectural approach enables agents to hand off context to specialized sub-agents, creating a network effect where complex problems can be decomposed and solved simultaneously. The practical application extends beyond coding—agents can create custom tools by spinning up Codex instances to write integrations or plugins for specific APIs, making software self-customizable at runtime.
OpenAI shifted focus from academic evaluations to "applied evals" that capture real-world use cases and customer needs. (18:03) This approach treats model development like hiring and mentoring, where models need job descriptions (prompts), mentorship (guardrails), and performance reviews (evals) to improve at specific tasks. Multi-turn evaluations using LLM-as-a-judge can assess entire agent trajectories, enabling models to self-improve by reviewing their own performance and updating their instructions for future tasks.