What a week. The feeling isnāt just acceleration anymore; itās delirium. You blink, and another model is released, another rumour flies. The pace is truly accelerating beyond all reasonable expectation. Welcome to artificial general confusion.
This week felt like a microcosm of the entire AI race. OpenAI dropped o3 and o4 Mini, their latest reasoning models. o3 is really quite impressive: it can use an array of tools from Python to web search and image analysis to do your bidding.
o3 has shown its mettle in my early tests. A particularly tricky one is my flight challenge. The test involves making a transatlantic booking with specific constraints (like my preference for particular planes). o3 did really well, beating all other models, but, for now, Iām still better. Another has been a multi-factor, quite complex, real-world strategic problem. o3 worked its way through the study like a master strategist: pulling out the key issues and addressing them with just enough detail.
One of the more formal measures Iām tracking closely is METRās time horizon, which tests AIās ability to complete long tasks. Mastering long tasks is a key unlock for significant productivity enhancements.
On this o3 does not disappoint.
Benchmarks offer a glimpse but we should be cautious about simplistic readings. The true, messier reality of real-world application shows a palpable rate of improvement, even as capabilities feel like jagged edges pushed into the market.
Predictably, there was a stampede online, people breathlessly declaring this āartificial general intelligence,ā as if weāve finally tripped over some obvious finish line. Itās a line that, frankly, remains stubbornly undefined.
Within a day, Google fired back, launching Gemini 2.5 Flash. This is the faster, more efficient sibling to the 2.5 Pro model thatās become my go-to. At least on one benchmark, Google seems to be carving out a fascinating and potentially dominant space right now. Their price-performance frontier is helped, no doubt, by their hardware expertise.
Anthropicās releases this week went rather unnoticed. Claude can now search your email, Google Drive, and calendar, although it is ponderously slow, and I use other AI tools on my email.
Then rumours started that OpenAI is considering buying Windsurf, one of those increasingly indispensable code completion tools that software engineers use to magnify their output. The purchase price? A cool $3 billion. And what is more, rumours that Sam Altman might turn the billion-user base of ChatGPT into a social network full of yeets.
How do we make sense of this four-ring circus of releases and more?
First, the pace is unsustainable for traditional product cycles. Products are being released faster than they can be properly described, product-managed, or even benchmarked in a way thatās useful to anyone outside the lab. The capabilities, while impressive, feel like jagged edges pushed out into the market. As
says: