A new competition in the AI world: who can really think?

Artificial intelligence has long been more than just a buzzword. It is shaping our future in many areas, from medicine to business and entertainment. But one question remains unanswered: Can machines actually think? Or are they just imitating human thinking without understanding it? A new challenge from Google DeepMind could answer this question with a clear "no" - at least for now.

In March 2025, Google DeepMind presented a new benchmark for AI models: BIG-Bench Extra Hard (BBEH). This is intended to put the capabilities of language models to the test, and one thing is already certain: even the most advanced AI models still seem a long way from reaching the human mind.

Why the new benchmark was necessary

The original benchmark, BIG-Bench, already placed enormous demands on the models and was launched in 2021 to develop a kind of universal test. However, even the best models, such as the Gemini 2.0 Flash model from Google, already achieve astonishingly high accuracies of over 90 percent. This is of course an impressive figure, but also a problem: the AI models are beginning to reach the limits of the test procedure. To prevent this progress from stagnating, the researchers developed the BBEH.

Compared to its predecessor, the BBEH goes much further. It replaces the existing tasks with significantly more demanding variants that require a broader range of thinking skills. Now the AI models not only have to solve simple tasks, but also perform complex, lengthy thought processes in which even minor errors have serious consequences.

What does the comparison of the models show?

It was a real showdown between the AI giants: Google DeepMind's models like Gemini 2.0 Flash and GPT-4o met the specialized reasoning models of OpenAI and others. The results? Surprising.

The best general-purpose model from Google, Gemini 2.0 Flash, only achieved 9.8 percent accuracy in the tests, which is shockingly low for an AI model in this class. Even less surprising is the result of the Chinese model DeepSeek R1, which performs so poorly in several tasks that it cannot even provide an answer. And yet there is a winner: OpenAI's o3-mini (high) outperforms the competition in many tests, especially in solving formal tasks. However, this model also shows clear weaknesses when it comes to more complex, soft thinking requirements - such as humor or causal understanding.

The tests clearly show that current AI is excellent at formalized tasks such as counting or logical problems, but it quickly reaches its limits when it comes to the complexity of human thought, such as distinguishing between relevant and irrelevant information.

Why "putting your head in the clouds" is far from enough

What does all this mean for the future of AI? Despite impressive progress, AI falls far short of our expectations in many areas. It can excel in formulas and structures, but fails when it comes to interpreting contexts or processing emotions and experiences - things that make up the human mind.

The road to an AI that really "thinks" still seems long and rocky. Google DeepMind and OpenAI have made enormous progress, but it is obvious that research still has a long way to go before we have an AI that even comes close to the complexity of human thinking.

The real leap in AI: why speed alone is not enough

Although the advances in AI research are undeniable, one may wonder if we are not simply "making the machines sharper" to get ever faster and more accurate answers - while losing sight of the real goal. It's not just about machines completing tasks faster and more accurately, it's about them understanding what they are doing. The real leap will not come when machines can count faster, but when they actually understand what they are doing - and that still requires a lot of work. Until then, all we can do is watch with interest to see how this race develops.

Subscribe to the newsletter

and always up to date on data protection.