LMArena Leaderboard Policy

Last Updated: September 8, 2025

Live and Community-Driven LLM Evaluation

Transparency. The model evaluation and ranking pipelines have been open sourced in the FastChat repository. We release a fraction of the data collected from the platform, as well. Together, this means that anyone can audit our leaderboard using publicly released data. The methodology and technical details behind LMArena have been published in a sequence of academic papers (1, 2, 3). As of September 2025, all updates to the leaderboard methodology are also logged in our Leaderboard Changelog. Many of the changes and improvements to our evaluation process are driven by community feedback.

Listing models on the leaderboard. The leaderboard will only include models that are generally available to the public. Specifically, models must meet at least one of the following criteria to qualify as publicly available:

Open weights: The model’s weights are publicly accessible.
Public APIs: The model is accessible via an API (e.g., OpenAI’s GPT-4o, Anthropic’s Claude) with transparent pricing and documentation.
Public services: The model is available through a widely accessible public-facing service (e.g., Gemini App, ChatGPT).

Evaluating publicly released models. Evaluating a public model consists of the following steps:

Add the model to Arena for testing and let the community know it was added. The model provider may choose a system prompt for configuration.
Accumulate enough votes until the model’s rating stabilizes (at least 1,000; typically more).
After the rating stabilizes, list the model on the leaderboard. If the votes were collected while the model was unreleased (see “Evaluating unreleased models” section), we will mark the model score as preliminary until enough fresh votes have been collected after the model’s public release.

Evaluating unreleased models. We collaborate with model providers to bring their unreleased models to our community for preview testing.

Model providers can test an unreleased model with the model’s name anonymized. A model is considered “unreleased” if its weights are neither open nor available via a public API or service. Evaluating an unreleased model consists of the following steps:

Add the model to Arena with an anonymous label. Each anonymous model has its own unique label.
Keep testing the model until we accumulate enough votes for its rating to stabilize (at least 1,000; typically more) or until the model provider withdraws it.
Share the results privately with the model provider, once we accumulate enough votes.
Remove the model from Arena.

If a model is tested anonymously and is subsequently released publicly, we mark its score as preliminary until enough fresh votes have been collected after the model’s public release (see “Evaluating publicly released models”). Model providers are all allowed to test multiple variants of their models before making it public, subject to our system’s constraints.

Sampling policy. The policy with which we sample model pairs in a battle is based on several principles:

In every battle, at least one of the models is a publicly available model. At least 20% of battles will be between publicly available models only.
We reserve the right to deprecate models. This may happen, for example, because a model is no longer publicly accessible, there is a more recent model in the same series (e.g., gpt-4o-0513 vs gpt-4o-0806), or multiple model providers offer cheaper and strictly better models according to the overall Arena score. To ensure transparency, all models that have been retired from battle mode are recorded in a public list.
A publicly available model’s probability of being sampled increases with its overall Arena score and the uncertainty around its score, captured by the confidence interval size. This is to ensure the best community experience as well as accurate evaluation for all public models. The regression for computing Arena scores uses reweighting, such that no matter how the sampling probabilities are set, the Arena scores remain unbiased.

Sharing data. We periodically share portions of our data with the community to support research and transparency. When we test unreleased models, we share conversation data with model providers to help them improve their models (see “Evaluating unreleased models”). Before sharing any data, we remove user PII via GCP’s Sensitive Data Protection API service.

Any feedback?

Feel free to send us email at contact@lmarena.ai or leave feedback on Github!