Reflections on OpenAI
I left OpenAI three weeks ago. I had joined the company back in May 2024.
I wanted to share my reflections because there's a lot of smoke and noise around what OpenAI is doing, but not a lot of first-hand accounts of what the culture of working there actually feels like.
Nabeel Quereshi has an amazing post called Reflections on Palantir, where he ruminates on what made Palantir special. I wanted to do the same for OpenAI while it's fresh in my mind. You won't find any trade secrets here, more just reflections on this current iteration of one of the most fascinating organizations in history at an extremely interesting time.
To put it up-front: there wasn't any personal drama in my decision to leave–in fact I was deeply conflicted about it. It's hard to go from being a founder of your own thing to an employee at a 3,000-person organization. Right now I'm craving a fresh start.
It's entirely possible that the quality of the work will draw me back. It's hard to imagine building anything as impactful as AGI, and LLMs are easily the technological innovation of the decade. I feel lucky to have seen some of the developments first-hand and also been a part of the Codex launch.
Obviously these aren't the views of the company–as observations they are my own. OpenAI is a big place, and this is my little window into it.
Culture
The first thing to know about OpenAI is how quickly it's grown. When I joined, the company was a little over 1,000 people. One year later, it is over 3,000 and I was in the top 30% by tenure. Nearly everyone in leadership is doing a drastically different job than they were ~2-3 years ago. 1
Of course, everything breaks when you scale that quickly: how to communicate as a company, the reporting structures, how to ship product, how to manage and organize people, the hiring processes, etc. Teams vary significantly in culture: some are sprinting flat-out all the time, others are babysitting big runs, some are moving along at a much more consistent pace. There's no single OpenAI experience, and research, applied, and GTM operate on very different time horizons.
An unusual part of OpenAI is that everything, and I mean everything, runs on Slack. There is no email. I maybe received ~10 emails in my entire time there. If you aren't organized, you will find this incredibly distracting. If you curate your channels and notifications, you can make it pretty workable.
OpenAI is incredibly bottoms-up, especially in research. When I first showed up, I started asking questions about the roadmap for the next quarter. The answer I got was: "this doesn't exist" (though now it does). Good ideas can come from anywhere, and it's often not really clear which ideas will prove most fruitful ahead of time. Rather than a grand 'master plan', progress is iterative and uncovered as new research bears fruit.
Thanks to this bottoms-up culture, OpenAI is also very meritocratic. Historically, leaders in the company are promoted primarily based upon their ability to have good ideas and then execute upon them. Many leaders who were incredibly competent weren't very good at things like presenting at all-hands or political maneuvering. That matters less at OpenAI then it might at other companies. The best ideas do tend to win. 2
There's a strong bias to action (you can just do things). It wasn't unusual for similar teams but unrelated teams to converge on various ideas. I started out working on a parallel (but internal) effort similar to ChatGPT Connectors. There must've been ~3-4 different Codex prototypes floating around before we decided to push for a launch. These efforts are usually taken by a small handful of individuals without asking permission. Teams tend to quickly form around them as they show promise.
Andrey (the Codex lead) used to tell me that you should think of researchers as their own "mini-executive". There is a strong bias to work on your own thing and see how it pans out. There's a corollary here–most research gets done by nerd-sniping a researcher into a particular problem. If something is considered boring or 'solved', it probably won't get worked on.
Good research managers are insanely impactful and also incredibly limited. The best ones manage to connect the dots between many different research efforts and bring together a bigger model training. The same goes for great PMs (shoutout ae).
The ChatGPT EMs I worked with (Akshay, Rizzo, Sulman) were some of the coolest customers I've ever seen. It really felt like they had seen everything at this point 3. Most of them were relatively hands-off, but hired good people and tried to make sure they were setup for success.
OpenAI changes direction on a dime. This was a thing we valued a lot at Segment–it's much better to do the right thing as you get new information, vs decide to stay the course just because you had a plan. It's remarkable that a company as large as OpenAI still maintains this ethos–Google clearly doesn't. The company makes decisions quickly, and when deciding to pursue a direction, goes all in.
There is a ton of scrutiny on the company. Coming from a b2b enterprise background, this was a bit of a shock to me. I'd regularly see news stories broken in the press that hadn't yet been announced internally. I'd tell people I work at OpenAI and be met with a pre-formed opinion on the company. A number of Twitter users run automated bots which check to see if there are new feature launches coming up.
As a result, OpenAI is a very secretive place. I couldn't tell anyone what I was working on in detail. There's a handful of slack workspaces with various permissions. Revenue and burn numbers are more closely guarded.
OpenAI is also a more serious place than you might expect, in part because the stakes feel really high. On the one hand, there's the goal of building AGI–which means there is a lot to get right. On the other hand, you're trying to build a product that hundreds of millions of users leverage for everything from medical advice to therapy. And on the other, other hand, the company is competing in the biggest arena in the world. We'd pay close attention to what was happening at Meta, Google, and Anthropic–and I'm sure they were all doing the same. All of the major world governments are watching this space with a keen interest.
As often as OpenAI is maligned in the press, everyone I met there is actually trying to do the right thing. Given the consumer focus, it is the most visible of the big labs, and consequently there's a lot of slander for it.
That said, you probably shouldn't view OpenAI as a single monolith. I think of OpenAI as an organization that started like Los Alamos. It was a group of scientists and tinkerers investigating the cutting edge of science. That group happened to accidentally spawn the most viral consumer app in history. And then grew to have ambitions to sell to governments and enterprises. People of different tenure and different parts of the org subsequently have very different goals and viewpoints. The longer you've been there, the more you probably view things through the "research lab" or "non-profit for good" lens.
The thing that I appreciate most is that the company is that it "walks the walk" in terms of distributing the benefits of AI. Cutting edge models aren't reserved for some enterprise-grade tier with an annual agreement. Anybody in the world can jump onto ChatGPT and get an answer, even if they aren't logged in. There's an API you can sign up and use–and most of the models (even if SOTA or proprietary) tend to quickly make it into the API for startups to use. You could imagine an alternate regime that operates very differently from the one we're in today. OpenAI deserves a ton of credit for this, and it's still core to the DNA of the company.
Safety is actually more of a thing than you might guess if you read a lot from Zvi or Lesswrong. There's a large number of people working to develop safety systems. Given the nature of OpenAI, I saw more focus on practical risks (hate speech, abuse, manipulating political biases, crafting bio-weapons, self-harm, prompt injection) than theoretical ones (intelligence explosion, power-seeking). That's not to say that nobody is working on the latter, there's definitely people focusing on the theoretical risks. But from my viewpoint, it's not the focus. Most of the work which is done isn't published, and OpenAI really should do more to get it out there.
Unlike other companies which freely hand out their swag at every career fair, OpenAI doesn't really give much swag (even to new employees). Instead there are 'drops' which happen where you can order in-stock items. The first one brought down the Shopify store, it had so much demand. There was an internal post which circulated on how to POST the right json payloads and circumvent this.
Nearly everything is a rounding error compared to GPU cost. To give you a sense: a niche feature that was built as part of the Codex product had the same GPU cost footprint as our entire Segment infrastructure (not the same scale as ChatGPT but saw a decent portion of internet traffic).
OpenAI is perhaps the most frighteningly ambitious org I've ever seen. You might think that having one of the top consumer apps on the planet might be enough, but there's a desire to compete across dozens of arenas: the API product, deep research, hardware, coding agents, image generation, and a handful of others which haven't been announced. It's a fertile ground for taking ideas and running with them.
The company pays a lot of attention to twitter. If you tweet something related to OpenAI that goes viral, chances are good someone will read about it and consider it. A friend of mine joked, "this company runs on twitter vibes". As a consumer company, perhaps that's not so wrong. There's certainly still a lot of analytics around usage, user growth, and retention–but the vibes are equally as important.
Teams at OpenAI are much more fluid than they might be elsewhere. When launching Codex, we needed some help from a few experienced ChatGPT engineers to hit our launch date. We met with some of the ChatGPT EMs to make the request. The next day we had two badass folks ready to dive in and help. There was no "waiting for quarterly planning" or "re-shuffling headcount". It moved really quickly.
Leadership is quite visible and heavily involved. This might be obvious at a company such as OpenAI, but every exec seemed quite dialed in. You'd see gdb, sama, kw, mark, dane, et al chime in regularly on Slack. There are no absentee leaders.
Code
OpenAI uses a giant monorepo which is ~mostly Python (though there is a growing set of Rust services and a handful of Golang services sprinkled in for things like network proxies). This creates a lot of strange-looking code because there are so many ways you can write Python. You will encounter both libraries designed for scale from 10y Google veterans as well as throwaway Jupyter notebooks newly-minted PhDs. Pretty much everything operates around FastAPI to create APIs and Pydantic for validation. But there aren't style guides enforced writ-large.
OpenAI runs everything on Azure. What's funny about this is there are exactly three services that I would consider trustworthy: Azure Kubernetes Service, CosmosDB (Azure's document storage), and BlobStore. There's no true equivalents of Dynamo, Spanner, Bigtable, Bigquery Kinesis or Aurora. It's a bit rarer to think a lot in auto-scaling units. The IAM implementations tend to be way more limited than what you might get from an AWS. And there's a strong bias to implement in-house.
When it comes to personnel (at least in eng), there's a very significant Meta → OpenAI pipeline. In many ways, OpenAI resembles early Meta: a blockbuster consumer app, nascent infra, and a desire to move really quickly. Most of the infra talent I've seen brought over from Meta + Instagram has been quite strong.
Put these things together, and you see a lot of core parts of infra that feel reminiscent of Meta. There was an in-house reimplementation of TAO. An effort to consolidate auth identity at the edge. And I'm sure a number of others I don't know about.
Chat runs really deep. Since ChatGPT took off, a lot of the codebase is structured around the idea of chat messages and conversations. These primitives are so baked at this point, you should probably ignore them at your own peril. We did deviate from them a bit in Codex (leaning more into learnings from the responses API), but we leveraged a lot of prior art.
Code wins. Rather than having some central architecture or planning committee, decisions are typically made by whichever team plans to do the work. The result is that there's a strong bias for action, and often a number of duplicate parts of the codebase. I must've seen half a dozen libraries for things like queue management or agent loops.
There were a few areas where having a rapidly scaled eng team and not a lot of tooling created issues. sa-server (the backend monolith) was a bit of a dumping ground. CI broke a lot more frequently than you might expect on master. Test cases even running in parallel and factoring in a subset of dependencies could take ~30m to run on GPUs. These weren't unsolvable problems, but it's a good reminder that these sorts of problems exist everywhere, and they are likely to get worse when you scale super quickly. To the credit of the internal teams, there's a lot of focus going into improving this story.
Other things I learned
What a big consumer brand looks like. I hadn't really internalized this until we started working on Codex. Everything is measured in terms of 'pro subs'. Even for a product like Codex, we thought of the onboarding primarily related to individual usage rather than teams. It broke my brain a bit, coming from predominantly a B2B / enterprise background. You flip a switch and you get traffic from day 1.
How large models are trained (at a high-level). There's a spectrum from "experimentation" to "engineering". Most ideas start out as small-scale experiments. If the results look promising, they then get incorporated into a bigger run. Experimentation is as much about tweaking the core algorithms as it is tweaking the data mix and carefully studying the results. On the large end, doing a big run almost looks like giant distributed systems engineering. There will be weird edge cases and things you didn't expect. It's up to you to debug them.
How to do GPU-math. We had to forecast out the load capacity requirements as part of the Codex launch, and doing this was the first time I'd really spent benchmarking any GPUs. The gist is that you should actually start from the latency requirements you need (overall latency, # of tokens, time-to-first-token) vs doing bottoms-up analysis on what a GPU can support. Every new model iteration can change the load patterns wildly.
How to work in a large Python codebase. Segment was a combination of both microservices, and was mostly Golang and Typescript. We didn't really have the breadth of code that OpenAI does. I learned a lot about how to scale a codebase based upon the number of developers contributing to it. You have to put in a lot more guardrails for things like "works by default", "keep master clean", and "hard to misuse".
Launching Codex
A big part of my last three months at OpenAI was launching Codex. It's unquestionably one of the highlights of my career.
To set the stage, back in November 2024, OpenAI had set a 2025 goal to launch a coding agent. By February 2025 we had a few internal tools floating around which were using the models to great effect. And we were feeling the pressure to launch a coding-specific agent. Clearly the models had gotten to the point where they were getting really useful for coding (seeing the new explosion of vibe-coding tools in the market).
I returned early from my paternity leave to help participate in the Codex launch. A week after I returned we had a (slightly chaotic) merger of two teams, and began a mad-dash sprint. From start (the first lines of code written) to finish, the whole product was built in just 7 weeks.
The Codex sprint was probably the hardest I've worked in nearly a decade. Most nights were up until 11 or midnight. Waking up to a newborn at 5:30 every morning. Heading to the office again at 7a. Working most weekends. We all pushed hard as a team, because every week counted. It reminded me of being back at YC.
It's hard to overstate how incredible this level of pace was. I haven't seen organizations large or small go from an idea to a fully launched + freely available product in such a short window. The scope wasn't small either; we built a container runtime, made optimizations on repo downloading, fine-tuned a custom model to deal with code edits, handled all manner of git operations, introduced a completely new surface area, enabled internet access, and ended up with a product that was generally a delight to use. 4
Say what you will, OpenAI still has that launching spirit. 5
The good news is that the right people can make magic happen. We were a senior team of ~8 engineers, ~4 researchers, 2 designers, 2 GTM and a PM. Had we not had that group, I think we would've failed. Nobody needed much direction, but we did need a decent amount of coordination. If you get the chance to work with anyone on the Codex team, know that every one of them is fantastic.
The night before launch, five of us stayed up until 4a trying to deploy the main monolith (a multi-hour affair). Then it was back to the office for the 8a launch announcement and livestream. We turned on the flags, and started to see see the traffic pour in. I've never seen a product get so much immediate uptick just from appearing in a left-hand sidebar, but that's the power of ChatGPT.
In terms of the product shape, we settled on a form factor which was entirely asynchronous. Unlike tools like Cursor (at the time, it now supports a similar mode) or Claude Code, we aimed to allow users to kick off tasks and let the agent run in its own environment. Our bet was in the end-game, users should treat a coding agent like a co-worker: they'd send messages to the agent, it gets some time to do its work, and then it comes back with a PR.
This was a bit of a gamble: we're in a slightly weird state today where the models are good, but not great. They can work for minutes at a time, but not yet hours. Users have widely varying degrees of trust in the models capabilities. And we're not even clear what the true capabilities of the models are.
Over the long arc of time, I do believe most programming will look more like Codex. In the meantime, it's going to be interesting to see how all the products unfold.
Codex (maybe unsurprisingly) is really good at working in a large codebase, understanding how to navigate it. The biggest differentiator I've seen vs other tools is the ability to kick off multiple tasks at once and compare their output.
I recently saw that there are public numbers comparing the PRs made by different LLM agents. Just at the public numbers, Codex has generated 630,000 PRs. That's about 78k public PRs per engineer in the 53 days since launch (you can make your own guesses about the multiple of private PRs). I'm not sure I've ever worked on something so impactful in my life.
Parting thoughts
Truth be told, I was originally apprehensive about joining OpenAI. I wasn't sure what it would be like to sacrifice my freedom, to have a boss, to be a much smaller piece of a much larger machine. I kept it fairly low-key that I had joined, just in case it wasn't the right fit.
I did want to get three things from the experience...
- to build intuition for how the models were trained and where the capabilities were going
- to work with and learn from amazing people
- to launch a great product
In reflecting on the year, I think it was one of the best moves I've ever made. It's hard to imagine learning more anywhere else.
If you're a founder and feeling like your startup really isn't going anywhere, you should either 1) deeply re-assess how you can take more shots on goal or 2) go join one of the big labs. Right now is an incredible time to build. But it's also an incredible time to peer into where the future is headed.
As I see it, the path to AGI is a three-horse race right now: OpenAI, Anthropic, and Google. Each of these organizations are going to take a different path to get there based upon their DNA (consumer vs business vs rock-solid-infra + data). 6 Working at any of them will be an eye-opening experience.
Thank you to Leah for being incredibly supportive and taking the majority of the childcare throughout the late nights. Thanks to PW, GDB, and Rizzo for giving me a shot. Thanks to the SA teammates for teaching me the ropes: Andrew, Anup, Bill, Kwaz, Ming, Simon, Tony, and Val. And thanks for the Codex core team for giving me the ride of a lifetime: Albin, AE, Andrey, Bryan, Channing, DavidK, Gabe, Gladstone, Hanson, Joey, Josh, Katy, KevinT, Max, Sabrina, SQ, Tibo, TZ and Will. I'll never forget this sprint.
Wham.
Footnotes
-
It's easy to try and read into a lot of drama whenever there's a departing leader, but I would chalk ~70% of them up to this fact alone. ↩
-
I do think we're in a slight phase change here. There's a lot of senior leadership hires being made from outside the company. I'm generally in favor of this, I think the company benefits a lot from infusing new external DNA. ↩
-
I get the sense that scaling the fastest growing consumer product ever tends to build a lot of muscle. ↩
-
Of course, we were also standing on the shoulders of giants. The CaaS team, core RL teams, human data, and general applied infra made this all possible. ↩
-
We kept it going too. ↩
-
We saw some big hires at Meta a few weeks ago. xAI launched Grok 4 which performs well on benchmarks. Mira and Ilya both have great talent. Maybe that will change things (the people are good). They have some catching up to do. ↩