Claude 3 artificial intelligence model outperforms GPT-4 for the first time at Chatbot Arena

The large language model (LLM) Claude 3 Opus from Anthropic has surpassed GPT-4 from OpenAI for the first time on Chatbot Arena.

"The king is dead," wrote software developer Nick Dobos in a post on X (Twitter), comparing GPT-4 Turbo and Claude 3 Opus.

The king is dead
RIP GPT-4
Claude opus #1 ELo
Haiku beats GPT-4 0613 & Mistral large
That's insane for how cheap & fast it is https://t.co/XWmvTE6h75 pic.twitter.com/fAwzJScLTH
— Nick Dobos (@NickADobos) March 26, 2024

Chatbot Arena is a crowdsourcing open platform for evaluating large language models. To compile the ranking, a large number of human reviews of models' performance are evaluated using the Elo rating system. The test works as follows: people enter a query and select the best answer from several options from different models. Based on thousands of user tests, a leaderboard is formed and ranked.

The Chatbot Arena leaderboard was launched on May 3, 2023, and GPT-4 was included in the ranking on May 10th. Since then, various variations of GPT-4 have consistently topped the ranking. Until now. Therefore, the appearance of a new leader in this field attracts attention. Moreover, one of Anthropic's smaller models, Haiku, also drew attention with its performance on the leaderboard.

"For the first time, the best available models - Opus for complex tasks, Haiku for efficiency and cost-effectiveness - are available from a provider other than OpenAI," said independent AI researcher Simon Wilson. "It's reassuring - we all benefit from diversity in leading providers in this field. But GPT-4 has been around for over a year, and it took this year for someone to catch up to it."

Following Claude 3 Opus and two versions of GPT-4 in the ranking is the model Bard (Gemini Pro) from Google. However, while the difference in Elo points between the top three positions is insignificant (2-3 points), Bard lags behind third place by 45 points. All other competitors scored less than 1200 points.

Source: arstechnica