LMArena Leaderboard Policy
Last Updated: December 1, 2025
Live and Community-Driven LLM Evaluation
Transparency. The model evaluation and ranking pipelines have been open sourced in the FastChat repository. We release a fraction of the data collected from the platform, as well. Together, this means that anyone can audit our leaderboard using publicly released data. The methodology and technical details behind LMArena have been published in a sequence of academic papers (1, 2, 3). As of July 2025, all updates to the leaderboard methodology are also logged in our Leaderboard Changelog. Many of the changes and improvements to our evaluation process are driven by community feedback.
Listing models on the leaderboard. The leaderboard will only include models that are generally available to the public. Specifically, models must meet at least one of the following criteria to qualify as publicly available:
- Open weights: The model’s weights are publicly accessible.
- Public APIs: The model is accessible via an API (e.g., OpenAI’s GPT-4o, Anthropic’s Claude) with transparent pricing and documentation.
- Public services: The model is available through a widely accessible public-facing service (e.g., Gemini App, ChatGPT).
- Public early release on LMArena: The model is made available in Direct Chat on LMArena at the time of release and the following conditions are met:
- The model provider creates a public commitment (e.g. blog post or X post) about the early access on LMArena, noting that the model will be available for public access at a later date.
- The model provider must confirm in writing that the pre-release model is identical to the model they intend to release publicly.
- If it is determined that the publicly released model differs from the pre-release version tested on LMArena, Arena will remove the model from the leaderboard until the model can be re-evaluated under the requirements of this policy.
- The score will be added to the leaderboard at launch as preliminary until the official public release (See “Evaluating unreleased models” section).
- The model provider must provide model access to LMArena for a minimum of 30 days.
- If model access is revoked prior to 30 days, Arena will remove the model from the leaderboard until the model can be re-evaluated under the requirements of this policy. Evaluating publicly released models. Evaluating a public model consists of the following steps:
Evaluating unreleased models. We collaborate with model providers to bring their unreleased models to our community for preview testing.
Model providers can test an unreleased model with the model’s name anonymized. A model is considered “unreleased” if its weights are neither open nor available via a public API or service. Evaluating an unreleased model consists of the following steps:
- Add the model to Arena with an anonymous label. Each anonymous model has its own unique label.
- Keep testing the model until we accumulate enough votes for its rating to stabilize (at least 1,000; typically more) or until the model provider withdraws it.
- Share the results privately with the model provider, once we accumulate enough votes.
- Remove the model from Arena.
If a model is tested anonymously and is subsequently released publicly, we mark its score as preliminary until enough fresh votes have been collected after the model’s public release (see “Evaluating publicly released models”). Model providers are all allowed to test multiple variants of their models before making it public, subject to our system’s constraints.
Sampling policy. The policy with which we sample model pairs in a battle is based on several principles:
- In every battle, at least one of the models is a publicly available model. At least 20% of battles will be between publicly available models only.
- We reserve the right to deprecate models. This may happen, for example, because a model is no longer publicly accessible, there is a more recent model in the same series (e.g., gpt-4o-0513 vs gpt-4o-0806), or multiple model providers offer cheaper and strictly better models according to the overall Arena score. To ensure transparency, all models that have been retired from battle mode are recorded in a public list.
- A publicly available model’s probability of being sampled increases with its overall Arena score and the uncertainty around its score, captured by the confidence interval size. This is to ensure the best community experience as well as accurate evaluation for all public models. The regression for computing Arena scores uses reweighting, such that no matter how the sampling probabilities are set, the Arena scores remain unbiased.
Sharing data. We periodically share portions of our data with the community to support research and transparency. When we test unreleased models, we share conversation data with model providers to help them improve their models (see “Evaluating unreleased models”). Before sharing any data, we use tools (e.g. GCP’s Sensitive Data Protection API service) to remove personal and sensitive data.
Any feedback?
Feel free to send us email at contact@lmarena.ai or leave feedback on Github!