Copilot Arena Copilot Arena has been downloaded 2.5K times on the VSCode Marketplace, served over 100K completions, and accumulated over 10K code completion battles.
Chatbot Arena Categories By grouping tasks into categories, we can assess models’ strengths and weaknesses in a more granular way.
Preference Proxy Evaluations Most LLMs are optimized using an LLM judge or reward model to approximate human preference. These training processes can cost hundreds of thousands or millions of dollars. How can we know whether to trust an LLM judge or reward model, given its critical role in guiding LLM training?
Agent Arena With the growing interest in Large Language Model (LLM) agents, there is a need for a unified and systematic way to evaluate agents.
Statistical Extensions of the Bradley-Terry and Elo Models Chatbot Arena uses the Bradley-Terry model for the purposes of statistical inference on the model strength. Recently, we have developed some extensions of the Bradley-Terry model, and the closely related Elo model, for the purpose of binary-comparison inference problems.
RedTeam Arena We are excited to launch RedTeam Arena, a community-driven redteaming platform, built in collaboration with Pliny and the BASI community!
Does Style Matter? We controlled for the effect of length and markdown, and indeed, the ranking changed. This is just a first step towards our larger goal of disentangling substance and style in Chatbot Arena leaderboard.