Mathematicians designed challenging problems to test the thinking of Gemini, Claude and GPT-4o - they failed almost all tests

The latest artificial intelligence models have successfully solved only 2% of the complex mathematical problems created by leading mathematicians worldwide.

The Epoch AI Research Institute has introduced a new test set called FrontierMath, which demands a doctoral level of mathematical expertise. Professors of mathematics, including Fields Medal laureatesThe Fields Medal is the most prestigious international award in mathematics, presented every four years to mathematicians under the age of 40 for outstanding achievements. It is often referred to as the "Nobel Prize of Mathematics.", contributed to its development. Solving such problems can take mathematicians, particularly those with doctoral degrees, anywhere from several hours to days.

While previous tests like MMLUMMLU (Measuring Massive Multitask Language Understanding) is a standardized test set for evaluating AI models' capabilities. It covers over 57 subject areas, including mathematics, physics, history, law, medicine, and more. It is used to compare various AI models and assess their ability to understand and apply knowledge across different fields. saw AI models accurately solving 98% of high school and university level math problems, the new challenges present a vastly different scenario.

“These problems are exceptionally difficult. Currently, they can only be solved with the involvement of a specialist in the field or with the help of a graduate student in a related area, in conjunction with modern AI and other algebraic tools,” noted 2006 Fields Medal laureate Terence Tao.

The study evaluated six top AI models. Google’s Gemini 1.5 Pro (002) and Anthropic’s Claude 3.5 Sonnet achieved the best results with 2% correct answers. OpenAI's models o1-preview, o1-mini, and GPT-4o managed to solve only 1% of the problems, while xAI's Grok-2 Beta failed to solve any.

FrontierMath encompasses various areas of mathematics, from number theory to algebraic geometry. All test problems are available on the Epoch AI website. The developers have created unique problems that are not included in the training data of the AI models.

Researchers note that even when a model provided a correct answer, it did not always indicate a sound reasoning process; sometimes the result could be achieved through simple simulations without a deep understanding of the mathematics involved.

Source: Livescience