A benchmark published this week set out to measure something most AI teams quietly avoid measuring: how well the best available models perform on the kind of work people actually do at their desks. Not coding puzzles. Not trivia. Not summarizing a clean PDF. Real, multi-step knowledge work, the kind that requires judgment, context switching, and the ability to recover from your own earlier mistakes.
The result. the top-performing model completed roughly 3 percent of tasks fully and correctly.
That number will surprise people who have been watching demos. It will not surprise anyone who has tried to run a real workflow on top of a large language model.



