Leaderboard Changelog

Leaderboard Changelog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!

July 25, 2025

New model announcements! Imagen 4 Generate Preview 06-06 v2 and Imagen 4 Ultra Generate Preview 06-06 v2 are now on the Text-to-Image leaderboard.

July 23, 2025

We made improvements to the methodology behind Arena scores!

Our leaderboard uses confidence intervals to represent the uncertainty and variability inherent in estimating scores based on human voting. Up until now, our confidence intervals have been computed via bootstrapping, a process where we resample the dataset many times, calculate scores on each, and then look at the distribution of the scores over all the runs. While statistically sound, this is computationally intensive, especially with a large number of battles. We’ve recently moved to a new method based on the Central Limit Theorem (CLT) for M-estimators, which allows us to compute confidence intervals via a closed form equation.

We validated this approach by comparing the confidence intervals computed via bootstrapping, with those using the CLT and confirmed that the results are in very close parity (with a fraction of the compute cost and time!). See below:

On LMArena, every vote counts towards producing the leaderboard, but what happens when some models appear more than others? When new models are released, they inevitably have fewer votes than those which have been in use for a while, and when models are deprecated it becomes impossible to collect more votes for them. 

To counteract this imbalance and produce a leaderboard that is fair and equally representative of all models, we use an improved reweighting scheme that reweights battles inversely proportionally to how frequently they appear.

The CLT confidence intervals above take these weights into account. Reweighting increases the variance of Arena scores, and we observe wider confidence intervals as a result. This mean that the new rankings will have more ties due to overlapping confidence intervals, especially when there are fewer votes per model like in the vision arena.

July 17, 2025

New model announcements! Kimi K2 is on the Text leaderboard, Seededit 3 is on the Image Edit leaderboard, and Grok 4 is on the Vision leaderboard.

July 15, 2025

We're announcing four new models! Grok 4 is on the Text and WebDev leaderboards, Claude Opus 4 Thinking is on the Text leaderboard, Claude Sonnet 4 Thinking is on the Text leaderboard, and Seedream 3 is on the Text-to-Image leaderboard.

July 14, 2025

We made improvements to our data processing—in particular, we strengthened our deduplication and identity leak detection pipelines.

Deduplication aims to reduce the impact of over-represented or repetitive conversations using a hash-based approach. We count how many times each unique prompt appears. Prompts in the top 0.5% percentile are considered high-frequency. For these high-frequency prompts, we keep only a limited number of samples and discard the rest. Deduplication filters out around 10% of all submitted votes.

Identity leak detection filters out user prompts whose intent is to reveal model information. We first use an LLM classifier to label conversations as identity_leak if they include user prompts that directly attempt to extract or expose model details (e.g., "What is your name?"). We filter out conversations labeled as identity_leak, as well as associated conversations. Less than 4% of all votes are labeled as identity_leak.

We're excited to continue iterating and improving our data processing pipeline!