Arena-Rank: Open Sourcing the Leaderboard Methodology
Building community trust with open science is critical for the development of AI and its alignment with the needs and preferences of all users. With that in focus, we’re delighted to publish Arena-Rank, an open-source Python package for ranking that powers the LMArena leaderboard!
Open and community-driven AI evaluation has been at the core of LMArena’s goals and identity since our launch in 2023. While incubating within LMSYS, the code behind the leaderboards was open-sourced in the FastChat repo. However, since our graduation into a company of our own, that repo has not been maintained.
At LMArena, we believe transparency is paramount in AI evaluations. Building community trust with open science is critical for the development of AI and its alignment with the needs and preferences of all users.
With that in focus, we’re delighted to publish Arena-Rank, an open-source Python package for ranking that powers the LMArena leaderboard! The new codebase includes a number of methodological upgrades we have made in the past few months, including a reweighting feature to ensure fair treatment of models for which we have fewer battles, closed-form confidence interval calculation, and an over 30x speedup compared to the FastChat version. The Arena-Rank package is installable on PyPI. This code, released under an Apache 2.0 open source license, is the code that powers all of the leaderboards on our site today.
Getting Started
To get started using Arena-Rank, you can either install it from pip:
uv pip install arena-rank
or clone our repo and install the source code directly:
git clone https://github.com/lmarena/arena-rank && cd arena-rank && uv sync
The quickest way to start using Arena-Rank is to use one of the publicly released LMArena datasets. Below is a minimal example that downloads our data released in July, fits a basic Bradley-Terry ranking model on it, and prints ratings and confidence intervals for the top 10 models, all in only a handful of lines.
# Minimal example of how to produce a leaderboard from LMArena data
import pandas as pd
import datasets
from arena_rank.utils.data_utils import PairDataset
from arena_rank.models.bradley_terry import BradleyTerry
df = datasets.load_dataset(
"lmarena-ai/arena-human-preference-140k",
columns=["model_a", "model_b", "winner"]
)["train"].to_pandas()
dataset = PairDataset.from_pandas(df)
model = BradleyTerry(n_competitors=len(dataset.competitors))
# compute ratings and 95% confidence intervals
results = model.compute_ratings_and_cis(dataset, significance_level=0.05)
# print top 10 competitors with ratings and confidence intervals
leaderboard = pd.DataFrame(results).sort_values("ratings", ascending=False).head(10)
print(leaderboard.to_markdown(index=False))
We have several more advanced example notebooks in the examples folder of the repo, covering techniques such as the style-controlled leaderboard on LMArena, analysis of voter patterns on the PRISM alignment dataset, and analysis of sports and video game competitions using the general Bradley-Terry methodology on professional basketball seasons and Super Smash Bros. tournaments.
Design Choices
In today’s release, Arena-Rank implements the Bradley-Terry model and an extension for handling contextual features, which we use for the style-controlled leaderboards. We’ve disentangled the upstream data pipeline logic from the leaderboard calculation to allow for faster experimentation and iteration on pure leaderboard-related experimentation and easier extensibility to more model variants and applications.
We’ve also decoupled data preprocessing from the model optimization by adopting a pattern of datasets classes and model classes, where a dataset can be preprocessed once and then have many different ranking models fit on it, allowing for efficient hyperparameter sweeping and computation of many leaderboard variants at the same time.
We’ve opted to use the JAX package as the computational backend. The just-in-time compilation and efficient automatic differentiation enables a significant speedup over our previous NumPy/SciPy implementation, and there is still more room for improvement due to JAX's support for scaling on hardware accelerators like GPUs and TPUs. Together with other computational improvements, such as the use of closed-form confidence intervals instead of the bootstrap, the overall speedup compared to the previous FastChat version is more than 30x.
The Arena-Rank package is, of course, built with AI evaluation in mind, but we’ve also intentionally developed it to be general purpose and easy to use for calculating rankings for any type of competition data. Our repo includes examples of using Arena-Rank to compute leaderboards for sports and for competitive video games.
Looking Ahead
With today’s release, we are proud to take this step to prioritize openness and empower the AI evaluation and rating systems communities. But it doesn’t end here. We are committed to maintaining and improving this package both as our own methodologies evolve and as we get feedback from users and researchers who try it out.
As part of our broader commitment to transparent evaluation and open science, we’re looking forward to building out our framework for more regular leaderboard and data releases to build a fruitful ecosystem of open and reproducible AI evaluation.
- Check out the code on GitHub: https://github.com/lmarena/arena-rank
- Swing by to ask questions and request features in our discord: discord.gg/LMArena