Agent Arena With the growing interest in Large Language Model (LLM) agents, there is a need for a unified and systematic way to evaluate agents.
Statistical Extensions of the Bradley-Terry and Elo Models Chatbot Arena uses the Bradley-Terry model for the purposes of statistical inference on the model strength. Recently, we have developed some extensions of the Bradley-Terry model, and the closely related Elo model, for the purpose of binary-comparison inference problems.
RedTeam Arena We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community!
Does Style Matter? We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.
Chatbot Arena Conversation Dataset Release Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models. In this blog post, we are releasing an
The Multimodal Arena is Here! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against each other. Contributors: Christopher Chou* Lisa Dunlap* Wei-Lin Chiang Ying Sheng Lianmin Zheng Anastasios Angelopoulos Trevor Darrell Ion Stoica Joseph E. Gonzalez
Introducing Hard Prompts Category in Chatbot Arena Introducing Hard Prompts, a new and challenging category in the Chatbot Arena Leaderboard. Contributors: Tianle Li Wei-Lin Chiang Lisa Dunlap Background Introducing Hard Prompts, a new and challenging category in the Chatbot Arena Leaderboard. Over the past few months, the community has shown a growing interest in more challenging prompts