Research

Studying the Frontier: Arena Expert

Arena Expert is a great way to differentiate between frontier models. In this analysis, we compare how models perform on 'general' vs 'expert' prompts, focusing on 'thinking' vs 'non-thinking' models.

Arena Expert is a great way to differentiate between frontier models. In November we launched Arena Expert, where only the hardest prompts from expert users are included.

In this analysis, we compare how models perform on 'general' vs 'expert' prompts, focusing on 'thinking' vs 'non-thinking' models. We use LMArena data from December 1, 2025 with style control applied, filtering to models with 1300+ ELO. This includes 139 models: 42 thinking and 97 non-thinking.

TL;DR

Expert rankings differentiate the best models: Models that seem similar in General rankings appear very differently in Expert rankings. "Expert Advantage" measures how much better (or worse) a model performs with experts vs general users (e.g. Opus 4.5 at +85 vs Grok 4.1 at -25)
Thinking models have a clear advantage: Median advantage of +15 for thinking models vs -9 for non-thinking (a 24-point gap)
Opus 4.5 dominates Expert rankings: The non-thinking version scores +85, the highest of any model

Claude Opus 4.5 (non-thinking) is a massive outlier at +85, outperforming even its own thinking version (+52). Sonnet 4.5 Thinking (+57) also does well. On the flip side, Grok 4.1 (-25), GPT-4o (-29 to -39), and ChatGPT-4o-latest (-18) all underperform.

Expert Model Preference

Points above the line outperform with experts; below underperform

Thinking models

Non-thinking models

Expected line (General = Expert)

Thinking models have a median advantage of +15, while non-thinking models sit at -9 - a gap of 24 points. But there's huge variance: Claude Opus 4.5 (non-thinking) is at +85, while o1-preview (thinking) is at -11. The thinking label alone doesn't guarantee expert preference.

The Thinking Model Effect

Distribution of expert advantage by model type

Anthropic leads at +22 with strong Opus and Sonnet 4.5 models. Alibaba follows at +14 with Qwen3 models. OpenAI is near neutral - GPT-5.x does well but older models drag down the average. xAI, Google, and DeepSeek all have negative averages.

Expert Advantage by Company

Average expert advantage across all models per company. (n) = number of models

Thinking models outperform non-thinking at most companies. The gap is biggest at Google (+16 vs -29) and xAI (+14 vs -23). Anthropic is strong across the board - both thinking (+35) and non-thinking (+16) are positive, thanks to Opus and Sonnet 4.5.

Thinking vs Non-Thinking by Company

Companies with both model types. (n/n) = thinking/non-thinking models

Performance by Model Family

We grouped models by their version (e.g., Opus 4.5, Opus 4.5 Thinking, Sonnet 4.5, etc.). Numbers in parentheses indicate model variants per family.

Anthropic: Newer model families perform best with experts - Opus 4.5 leads at +69, with Claude 3.5 and Claude 3 Opus being the only families below the line.
OpenAI: Clear divide between new and old - GPT-5.x and o3/o4 models are positive, while GPT-4.5 (-17) and GPT-4o/Turbo significantly underperform.
Google: Gemini 3 and 2.5 models outperform (+9 to +18), but the open-source Gemma models struggle significantly (-48) with expert prompts.
xAI: Interestingly, older Grok 3 and Grok 4 families outperform the newer Grok 4.1 (-8).
Alibaba: Qwen3 and QwQ reasoning models perform well (+20 to +25), with older Qwen versions near neutral.
DeepSeek: V3.1/V3.2 are positive (+8), but older models including R1 are below the line.

(n) = number of models

Anthropic

6 thinking - 11 non-thinking

OpenAI

13 thinking - 13 non-thinking

Google

5 thinking - 11 non-thinking

xAI

7 thinking - 3 non-thinking

Alibaba

4 thinking - 16 non-thinking

DeepSeek

4 thinking - 6 non-thinking

Head-to-Head Comparisons

On General rankings, some models sit within a few points of each other, making it hard to differentiate between them. Expert rankings differentiate the models much better, due to higher standards that experts expect from these models. Hence models that seem similar in the General rankings appear very differently in the Expert rankings (e.g. Opus 4.5 at 1460 vs Grok 4.1 at 1465 on General, but 1545 vs 1440 on Expert - a 105 point gap).

Anthropic: Sonnet 4.5 Thinking vs Non-Thinking

OpenAI: GPT-5.1-high vs ChatGPT-4o

Google: Gemini 3 Pro vs Gemini 2.5 Pro

xAI: Grok 4.1 Thinking vs Non-Thinking

Alibaba: Qwen3-235B Thinking vs Non-Thinking

Cross-Company: Opus 4.5 vs Grok 4.1

Company Deep Dives

Points above the trend line are gaining ground with experts. Each chart includes a visual legend showing what the colors and fill styles mean.

Anthropic: Newer models consistently outperform with experts. Opus 4.5 leads at +85 (non-thinking) - interestingly higher than its thinking variant (+53). Earlier Claude 3.5 and 3 Opus models are the only negatives (-5 to -15).
OpenAI: Sharp generational divide. GPT-5.x and o3/o4 reasoning models are positive (+14 to +35), while the GPT-4 family underperforms (-13 to -39). Even GPT-4.5 Preview sits at -17.
Google: Gemini 3 Pro (+9) and Gemini 2.5 models excel with experts (+14 to +40), but the open-source Gemma models struggle significantly (-33 to -70).
xAI: Mixed results. Grok 3 Mini High (+37) and Grok 4 (+22) do well, but the later Grok 4.1 variant shows a decline in performance with expert prompts (-25).
Alibaba: Strongest overall performance. Qwen3 models dominate, with the thinking variant at +64 leading all Chinese models. Older Qwen2.5 models are mixed, with some negative.
DeepSeek: Tends to perform worse with experts overall. Only the V3.1 generation shows positive results (+12 to +14), while both R1 reasoning models are slightly negative (-1 to -6) and older V2.5/V3 models underperform (-14 to -19).