Leaderboard Changelog

This page documents notable updates to our leaderboard—new models, new arenas, updates to the methodology, and more. Stay tuned!
For model deprecations, check the public updates on GitHub.
September 8, 2025
New model announcements:
Qwen3-max-preview and Kimi-K2-0905-preview have been added to the Text Leaderboard.
We also enabled filtering for the mistaken image generation and image edit requests for text arena.
September 2, 2025
Due to the increase in image generation traffic brought by nano-banana, we noticed there were prompts in our vision arena data which were asking for image generation but did not have image output enabled. We've implemented an LLM based rule to filter these rows out from the vision leaderboard calculation.
August 29, 2025
New model announcements:
Diffbot-small-xl has been added to the Search Leaderboard
Qwen-3-Image-Prompt-Extend has been added to the Text-to-Image Leaderboard.
The following have been added to the Text Leaderboard:
- DeepSeek V3.1 (thinking and non-thinking)
- Hunyuan-t1-20250711
August 28, 2025
New model announcement: MAI-1-preview has been added to the Text Leaderboard.
August 26, 2025
New model announcement: Gemini-2.5-Flash-Image-Preview ("nano-banana") has been added to the Text-to-Image and Image Edit leaderboards.
GPT-5 and Claude Opus 4.1 have been added to the Search Leaderboard.
August 22, 2025
New model announcements
The following have been added to the Text and Vision leaderboards:
Lucid Origin has been added to the Text-to-Image leaderboard.
Ray 2 has been added to the Text-to-Video and Image-to-Video leaderboards
Runway Gen 4 Turbo has been added to the Image-to-Video leaderboard
August 20, 2025
New model announcement: Qwen-Image-Edit has been added to the Image Edit leaderboard.
August 18, 2025
New model announcement: Claude Opus 4.1 Thinking as been added to the Text and WebDev Arena leaderboards. Sora has been added to the Text-to-Image leaderboard.
August 15, 2025
New model announcement: three additional gpt-5 models are on the Text Leaderboard. These three reasoning models were configured with the highest reasoning setting.
August 13, 2025
New model announcement: gpt-0ss-120b and gpt-oss-20b have been added to the Text and WebDev leaderboards. Hailuo 2 Pro versions have been added to the Text-to-Video and Image-to-Video leaderboards.
August 11, 2025
New model announcement: Claude Opus 4.1 is on Text and WebDev leaderboards.
August 7, 2025
New model announcement: GPT-5 is on the Text, WebDev, and Vision leaderboards.
August 6, 2025
Big update: three new leaderboards!
Check out the Search, Text-to-Video, and Image-to-Video leaderboards.
Since the video arenas are used through our discord server, there are a few considerations we made for handling the votes. Currently, the model identities are revealed after two votes are cast on a generation. For fairness, we only use the votes cast before the model names are revealed when constructing the leaderboard.
The video arenas are also the first arenas where multiple votes can be cast on the same pair of generations, so unlike the other arenas, some votes are cast by people other than the author of the prompt. The overall leaderboard is computed using all anonymous votes, and we've created a new category which uses only the votes cast by the prompt's author.
August 5, 2025
We have updated the "total votes" counts to include battles involving models not included on the leaderboard (for example, due to being deprecated). The battles between these models and models present on the leaderboard are informative of model strengths, even if the former are not shown, and thus help reduce the variance of the scores. The leaderboard computation is not changing; you will only see a change in the vote counts.
August 4, 2025
New model announcements: GLM-4.5 and GLM-4.5 Air are now on the Text leaderboard.
August 1, 2025
New model announcement: Qwen3-235b-a22b-instruct-2507 is now on the Text leaderboard.
July 28, 2025
New model announcements: Qwen3-Coder and Kimi K2 are now on the WebDev leaderboard.
July 25, 2025
New model announcements! Imagen 4 Generate Preview 06-06 v2 and Imagen 4 Ultra Generate Preview 06-06 v2 are now on the Text-to-Image leaderboard.
July 23, 2025
We made improvements to the methodology behind Arena scores!
Our leaderboard uses confidence intervals to represent the uncertainty and variability inherent in estimating scores based on human voting. Up until now, our confidence intervals have been computed via bootstrapping, a process where we resample the dataset many times, calculate scores on each, and then look at the distribution of the scores over all the runs. While statistically sound, this is computationally intensive, especially with a large number of battles. We’ve recently moved to a new method based on the Central Limit Theorem (CLT) for M-estimators, which allows us to compute confidence intervals via a closed form equation.
We validated this approach by comparing the confidence intervals computed via bootstrapping, with those using the CLT and confirmed that the results are in very close parity (with a fraction of the compute cost and time!). See below:

On LMArena, every vote counts towards producing the leaderboard, but what happens when some models appear more than others? When new models are released, they inevitably have fewer votes than those which have been in use for a while, and when models are deprecated it becomes impossible to collect more votes for them.
To counteract this imbalance and produce a leaderboard that is fair and equally representative of all models, we use an improved reweighting scheme that reweights battles inversely proportionally to how frequently they appear.
The CLT confidence intervals above take these weights into account. Reweighting increases the variance of Arena scores, and we observe wider confidence intervals as a result. This mean that the new rankings will have more ties due to overlapping confidence intervals, especially when there are fewer votes per model like in the vision arena.
July 17, 2025
New model announcements! Kimi K2 is on the Text leaderboard, Seededit 3 is on the Image Edit leaderboard, and Grok 4 is on the Vision leaderboard.
July 15, 2025
We're announcing four new models! Grok 4 is on the Text and WebDev leaderboards, Claude Opus 4 Thinking is on the Text leaderboard, Claude Sonnet 4 Thinking is on the Text leaderboard, and Seedream 3 is on the Text-to-Image leaderboard.
July 14, 2025
We made improvements to our data processing—in particular, we strengthened our deduplication and identity leak detection pipelines.
Deduplication aims to reduce the impact of over-represented or repetitive conversations using a hash-based approach. We count how many times each unique prompt appears. Prompts in the top 0.5% percentile are considered high-frequency. For these high-frequency prompts, we keep only a limited number of samples and discard the rest. Deduplication filters out around 10% of all submitted votes.
Identity leak detection filters out user prompts whose intent is to reveal model information. We first use an LLM classifier to label conversations as identity_leak if they include user prompts that directly attempt to extract or expose model details (e.g., "What is your name?"). We filter out conversations labeled as identity_leak, as well as associated conversations. Less than 4% of all votes are labeled as identity_leak.
We're excited to continue iterating and improving our data processing pipeline!