A Deep Dive into Recent Arena Data

Today, we're excited to release a new dataset of recent battles from LMArena! The dataset contains 140k conversations from the text arena.

A Deep Dive into Recent Arena Data

Today, we're excited to release a new dataset of recent battles from LMArena! The dataset contains 140k conversations from the text arena. Download or view the data here! For a more detailed overview to the data, please refer to this dedicated section.

We take this opportunity to dive deeper into the characteristics of our data, with special attention to its dynamic changes over time.

In this blog, we will cover:

The analysis is based on data collected between April 17th and July 25th inclusive. To illustrate the changes in trends over time, we study differences between the data collected from the new and improved UI, officially launched on May 27th, and the data from the old UI.

Data Overview

We explore the distribution of categories, languages, and topics in our data, focusing both on their shifts over time and their recent distribution.

In this section, for statistics calculated using data collected from the new UI, only votes with evaluation order equal to one are included. Evaluation order is a new feature; we define the nth evaluation order as the nth vote cast in a single continuous conversation (an evaluation session). The underlying assumption is that all votes within the same evaluation session generally come from a single user engaging in a continued conversation about the same topic. To avoid over-representing individual sessions, each evaluation session contributes only one data point to the analyses in this section.

For a more detailed analysis of evaluation order, go to the section here.

Category Distribution

In Table 1, we show the recent distribution of prompt categories. About a third of submitted prompts are hard prompts.

Category Percentage
Hard Prompt 37.33%
Instruction Following 21.01%
Math 6.94%
Creative Writing 11.65%
Coding 25.78%
Long Query 13.46%
Multi-turn 17.76%

Table 1. Distribution of new prompts across categories on data collected from the new UI only.

In Figure 1 we look at how the category distribution has changed over time. We see a substantial increase in hard prompts and coding. We also see a decrease in instruction following, math, and creative writing. The decrease in instruction following is explained by the fact that we implemented a more stringent classifier for labeling prompts as instruction following (details to follow).

Categories Slope Graph

Figure 1. Comparison of category distribution before vs. after 2025-04-17.

In Figure 2 we provide a more fine-grained analysis: we plot the seven-day rolling average of the category distribution starting the beginning of 2024.

Categories Distribution Shift Over Time

Figure 2. Change in category distribution over time.

Why does instruction following drop so much? This category distribution shift is largely explained by a change in the classification model (the LLM used to assign categories to individual prompts) from llama-v3-70b-instruct to deepseek-v3-0324.

To select the classification model, we run label_bench – evaluating each candidate’s labeling accuracyprecision, and recall in a small dataset against gold labels obtained via four large LLMs: claude-3-7-sonnetdeepseek-r1gemini-2.0-pro, and gpt-4.5. We define gold labels as True for a particular category if at least two models vote True. More details about the label_bench method can be found here.

Results for the candidate models are shown below.

Category Model Accuracy Precision Recall
IF llama-v3-70b-instruct 0.856 0.566 0.886
IF deepseek-v3-0324 0.929 0.886 0.698
Hard llama-v3-70b-instruct 0.730 0.772 0.487
Hard deepseek-v3-0324 0.770 0.791 0.599
Math llama-v3-70b-instruct 0.962 0.798 0.896
Math deepseek-v3-0324 0.968 0.832 0.898
Creative Writing llama-v3-70b-instruct 0.962 0.838 0.819
Creative Writing deepseek-v3-0324 0.970 0.876 0.856

Table 2. Comparison of the old and new classifier models across four categories, showing accuracy, precision, and recall.

As shown in Table 2, deepseek-v3-0324 exhibits significantly higher precision and lower recall in the instruction following (IF) category, strongly suggesting that it is more selective in its labeling, resulting in fewer prompts being classified as IF.

Language Distribution

Recent data from the new UI comes from over 100 languages, with slightly more than half being English. We show the comparison in Figure 3.

Language Distribution Bar Graph

Figure 3. Language distribution comparison between the old and new UI for the top 15 languages on the new UI, ordered by their percentage.

In Figure 4, we show the seven-day rolling average of the language distribution starting the beginning of 2024.

Language Distribution Over time

Figure 4. Language distribution over time.

Topic Distribution

Using our Arena Explorer pipeline, we build an interactive topic distribution graph below.

Score Changes

To take a deeper look into personal preference shifts, we explore model score changes over time.

Recall that Arena scores are invariant to shifts: the relative strengths of models are captured by the differences between their Arena scores, while the absolute value of the scores is determined by anchoring. Anchoring is a method that fixes a particular model to have a particular score; all other scores are derived relative to that model.

The graphs below anchor mixtral-8x7b-instruct-v0.1 at 1114.0: each time the leaderboard is calculated, mixtral's score always stays at 1114. This fixes the scale of the leaderboard across time periods, making the scores comparable.

In Figure 5, we show Arena score changes over the full distribution of our prompts.

model_ratings_over_time

Figure 5. Model ratings over time.

Expand for selected categorical shifts (coding, creative writing, hard prompt)

Coding
coding_model_ratings_over_time

Figure 6. Coding ratings over time.

Creative Writing
creative_writing_model_ratings_over_time

Figure 7. Creative writing ratings over time.

Hard Prompts
hard_model_ratings_over_time

Figure 8. Hard prompt ratings over time.

For score shift graphs for other categories, please refer to the further analysis section.

In addition to shifts in user preference, several other factors also influence score changes.

In the next section, we dive into the main difference between the old and new UI from a data perspective–Evaluation Order, a new feature on the new UI, and a main contributor to the score changes.

For models that show a strong sensitivity to increased evaluation order, we conduct model-specific analysis here. We find that models such as gemini-2.5-flash and gemini-2.5-pro tend to perform better when the evaluation context is restricted to model outputs from the same family.

Evaluation Order

The new UI supports follow-up questions after voting. After a vote is cast and model identities are revealed, a new pair of models will be sampled. To ensure fairness, the winning response from the previous round is passed as context to both models in the next round. If the vote resulted in a tie or a both_bad outcome, then each newly-sampled model instead receives its respective side’s previous response as context.

We formally define the nth evaluation order as the nth vote-casting event within a single evaluation session.

Each evaluation order within a complete evaluation session has its own vote taken into account when leaderboard scores are calculated

The maximum evaluation order in our data is 118. Around 70% of votes have evaluation order equal to one. See Figure 9.

Evaluation Order Histogram

Figure 9. Histogram of evaluation order distribution.

To visualize the effect of evaluation order, in Figure 10 we compare the scores from data where evaluation order is one only with the scores that use all evaluation orders.

Style Controlled Overall Score Old vs. New UI

Figure 10. Comparison of style-controlled overall model scores between evaluation order = 1 only and all evaluation orders.

We observe that including all evaluation orders decreases the scores of all Gemini models. Also experiencing a notable decrease are the o4-mini-2025-04-16 , gpt-4.1-2025-04-14 and qwen3-235b-a22b. Claude-3-7-sonnet-20250219 and llama-4-maverick-17b-128e-instruct exhibit a significant increase.

Model-Specific Analysis

A natural question arising from Figure 10 is: why and how are some models' scores affected by evaluation order?

Win Rate Difference by Evaluation Order

Figure 11. Comparison of change in win rate across evaluation orders and different models.

A Deeper Look into Gemini Models.

All Gemini models experience a notable drop in performance—less wins, more losses—as evaluation order increases beyond one.

For a deeper look into the Gemini models, we divide the data into three groups: evaluation order equal to one, two, and greater than two. We compare the performance of Gemini models under two context conditions: all-context (where the previous context can come from any model) and restricted-context.

In the restricted-context setting, we only include battles where the last context (i.e., the prior winning side) also came from a Gemini model. This ensures that for votes with an evaluation order of two, the context Gemini models received came exclusively from Gemini models. However, votes with evaluation order greater than two may include earlier context turns involving non-Gemini models.

Gemini Pro Battle Outcome by Evaluation Order

Figure 12. Comparison of change in battle outcome for gemini-2.5-pro under all-context vs. restricted Gemini-only context conditions.

Gemini Flash Battle Outcome by Evaluation Order

Figure 13. Comparison of change in battle outcome for gemini-2.5-flash under all-context vs. restricted Gemini-only context conditions.

Notably, when the context is restricted to Gemini-only, gemini-2.5-pro shows a much smaller decline going from evaluation order one to two compared to all context, and gemini-2.5-flash shows an even smaller decline. Overall, both models perform better at higher evaluation orders when the incoming context comes from other Gemini models. This suggests that Gemini models perform better when their context contains outputs primarily from other Gemini models.

A Deeper Look into GPT Models.

While o4-mini and o3 both experience notable declines when evaluation order increases to above one, chatgpt-4o-latest's performance barely changes. Following the same approach used in the Gemini analysis, we divide the data into three groups—evaluation order equal to one, two, and greater than two—and compare the performance shifts for these models under both all-context (where prior context can come from any model) and restricted-context (where the previous winning model is limited to the same model family). We observe mixed results even with restricted context: the win rates sometimes increase and sometimes decrease as evaluation order increases.

Gemini Flash Battle Outcome by Evaluation Order

Figure 14. Comparison of change in battle outcome for chatgpt-4o-latest under all-context vs. restricted GPT-only context conditions.

Gemini Flash Battle Outcome by Evaluation Order

Figure 15. Comparison of change in battle outcome for o3 under all-context vs. restricted GPT-only context conditions.

Gemini Flash Battle Outcome by Evaluation Order

Figure 16. Comparison of change in battle outcome for o4-mini under all-context vs. restricted GPT-only context conditions.

A Deeper Look into Claude-3-7-Sonnet.

Claude-3-7-sonnet experiences a notable increase in performance as evaluation order increases to above one, unlike most other models. We perform the same analysis here as for the previous families of models.

Gemini Flash Battle Outcome by Evaluation Order

Figure 17. Comparison of change in battle outcome for claude-3-7-sonnet under all-context vs. restricted GPT-only context conditions.

We observe that the win rate of Claude-3-7-sonnet is higher for higher evaluation orders than for evaluation order equal to one, suggesting strong performance in longer conversations.

Style Coefficients

We take a look at coefficients quantifying the importance of style on recent data. We observe a fairly stable distribution over time; see Figure 18.

model_ratings_over_time

Figure 18. Style coefficients over time.

We also look at a categorical breakdown of style coefficients in Figure 19.

Style Coefficient New UI

Figure 19. Categorical breakdown of style coefficients on the new UI

Further Analysis

Categorial Score Shifts

if_model_ratings_over_time

Figure 20. Instruction following ratings over time.

math_model_ratings_over_time

Figure 21. Math ratings over time.

multiturn_model_ratings_over_time

Figure 22. Multiturn ratings over time.

long_query_model_ratings_over_time

Figure 23. Long query ratings over time.

Released Dataset Overview

This dataset contains user votes collected in the text-only category. Each row represents a single vote judging two models (model_a and model_b) on a user conversation, along with the full conversation history and metadata. Key fields include:

  • id: Unique feedback ID of each vote/row.
  • evaluation_session_id: Unique ID of each evaluation session, which can contain multiple separate votes/evaluations.
  • evaluation_order: Evaluation order of the current vote.
  • winner: Battle result containing either model_a, model_b, tie, or both_bad.
  • conversation_a/conversation_b: Full conversation of the current evaluation order.
  • full_conversation: The entire conversation, including context prompts and answers from all previous evaluation orders. Note that after each vote new models are sampled, thus the responding models vary across the full context.
  • conv_metadata: Aggregated markdown and token counts for style control.
  • category_tag: Annotation tags including the categories math, creative writing, hard prompts, and instruction following.
  • is_code: Whether the conversation involves code.

To download the dataset, run

from datasets import load_dataset
data = load_dataset("lmarena-ai/arena-human-preference-140k")

We’re excited to share this dataset and hope it serves as a valuable resource for advancing research into real-world AI behavior and human preference.