A new study published on November 4, 2025, by researchers at the Oxford Internet Institute has cast serious doubt on how we measure artificial intelligence. The study states that the benchmarks used to evaluate AI systems regularly oversell their performance.
It was co-authored by researchers from over three dozen institutions including the Weizenbaum Institute Berlin, the UK AI Security Institute, UC Berkeley, the Allen Institute for AI, and Stanford University.
That’s disturbing because it implies that AI models are not as advanced as their developers claim.
The Illusion of Progress
Benchmarks are standardized tests that measure an AI model’s capabilities: things like reasoning, coding, or its ability to understand languages. They’re what developers and researchers rely on to make claims of progress. “Our model scores higher than the last one, therefore it’s smarter.”
According to the study, these tests don’t actually measure what they say they do. They rely on recycled datasets, narrow tasks, or overly specific problems that fail to capture real-world intelligence, adaptability, and creativity.
This means AI systems are trained to excel at tests rather than to understand the world. Developers optimize models to perform well on these benchmarks, not for genuine reasoning or flexible problem-solving.
One example the researchers used was a model’s ability to speak Russian. Instead of using an endless stream of questions for evaluation, one benchmark measured whether a model could answer yes-or-no questions pulled from the Russian-language Wikipedia.
Why Is This a Problem?
Flawed benchmarks have real-world consequences.
When companies tout benchmark scores as proof of intelligence, they’re misleading their audience. Claiming that a model meets or surpasses benchmarks are used to attract investors and users. Worse, flawed benchmarks mean developers could overlook bias or failures that only appear outside of testing.
This study is coming out at a time as tech companies race to achieve artificial general intelligence (AGI). That’s a type of AI that would match or even surpass human intelligence.
The whole point of the hype surrounding AI is that it has the ability to do things that are too complicated or tedious for a human. People are relying on AI for tasks like medical diagnostics, help with complex mathematical equations, or coding. Everyone is rushing to integrate this technology into every corner of our lives.
Now researchers are warning it might not be capable of doing the things tech companies said it would do?
How We Should Measure Intelligence
The study’s authors argue that benchmarks need a major overhaul. Instead of testing how well AI models can recall facts or perform niche tasks, new standards should focus on skills like creativity, adaptability and contextual reasoning. New benchmarks should have clear definitions of the concepts they’re supposed to measure.
Experts recommend supplementing benchmark tests with “uplift” studies. Evaluations where AI models are tested alongside humans in real-world settings to see if they’re actually as intelligent as developers claim. These kinds of tests could reveal whether AI can truly match a human’s capabilities or if it can only simulate it.
Until then, it’s best to be skeptical whenever OpenAI says ChatGPT can rival a human in certain areas. Claims of superiority based on some benchmark should be viewed as marketing, instead of a legitimate measurement.