Search Arena & What We’re Learning About Human Preference
Search Arena on LMArena goes live today, read more about what we've learned so far about human preference with the search-augmented data.

Contributors:
Logan King
Mihran Miroyan
Patrick Wu
Search-augmented language models are quickly becoming the go-to interface for looking up information, making decisions, and navigating the web. But the real question remains: which of these systems genuinely meet the needs of users with complex, real-world questions?
To address this, we launched Search Arena, initially available on our legacy site in April 2025, and now fully live at lmarena.ai. Search Arena is a real-time evaluation platform that lets the community directly compare search-augmented models. Users vote on the responses they find most helpful, trustworthy, or insightful, bringing authentic human feedback into the heart of AI progress.
Access Search Arena with the Web/Globe icon in the chat box
Why We Built Search Arena
Traditional benchmarks like SimpleQA are built around a specific goal: did the model retrieve the correct answer to a factual question?
Q: "Who invented the lightbulb?"
A: "Thomas Edison" ✅
But this captures only a small part of how people actually use language models. Real users come with messier, more open-ended questions. They ask for comparisons between products, personalized recommendations, explanations of unfamiliar concepts, and help with tasks that span multiple steps.
These kinds of queries don’t have one right answer. They require reasoning, synthesis, and judgment. Factual accuracy still matters, but it’s no longer enough.
This is where human preference becomes essential. When answers are subjective, context-dependent, or nuanced, only real people can judge what’s actually helpful. That’s why we built Search Arena. It’s designed to surface the kinds of signals that traditional benchmarks miss, and to show which models are truly aligned with what people care about.
Who is in the Arena?
When Search Arena first launched it featured just three model providers, but to keep up with the pace of progress, the Arena now features a broader set of models from the top players in the field, each with its own strengths, quirks, and search strategies. We now host 7 models across 5 model providers:
- xAI’s Grok 4
- Anthropic’s Claude Opus 4
- Perplexity’s Sonar Pro High & Reasoning Pro High
- OpenAI’s o3 & GPT 4o-Search Preview
- Google DeepMind’s Gemini 2.5 Pro Grounding
Disclaimer & Attribution: The analyses and figures in this blog distill findings from the paper: “Search Arena: Analyzing Search‑Augmented LLMs”. Please consult the paper for full methodology, statistical details, and supplementary material.
Authors: Mihran Miroyan* · Tsung‑Han Wu* · Logan King · Tianle Li · Jiayi Pan · Xinyan Hu · Wei‑Lin Chiang · Anastasios N. Angelopoulos · Trevor Darrell · Narges Norouzi · Joseph E. Gonzalez
What are People Actually Asking?
To understand how people interact with search-augmented LLMs, we analyzed over 24,000 real-world prompts from more than 11,000 users across 136 countries.
Only about 1 in 5 queries were simple factual questions. The rest spanned a wide range of needs, from analytical breakdowns to creative outputs, product recommendations to real-time news lookups.
This insight led us to build a new classification framework, a taxonomy of user intent, tailored specifically to how people interact with search-enabled models. It includes nine distinct categories ("Other" excluded):

Search Arena queries also spanned more than 70 languages, with over 11% being multilingual. From English and Russian to Japanese and Persian, this dataset reflects a truly global, real-time snapshot of how people search with LLMs across numerous use cases.

What Makes a Model Win? 3 Patterns of Preference
It’s not just what people ask, but how they vote that reveals what actually matters.
We studied thousands of head-to-head model comparisons to understand what drives human preference. Three patterns stood out:
1. Detailed, Comprehensive Answers
Users consistently favor longer, more thorough responses, even when it means more to read. Verbosity is a strong signal of preference. The more complete the answer, the more confident users feel in its usefulness.

2. Frequent Citations
Simply having more citations makes a model more likely to win a vote.
Even if users don’t click the links, the presence of references makes responses feel grounded and credible.

3. Specific, Trusted Sources
Users prefer answers that cite concrete, recognizable domains: YouTube, Reddit, Substack, tech blogs. On the flip side, vague or generic sources like Wikipedia can hurt a model’s chances, especially for queries that demand timeliness or specificity.

What This Tells Us About Human-AI Interaction
Search Arena offers a clearer view into how people evaluate AI responses in practice. As more users cast their votes, we’re building a better understanding of what makes these systems genuinely useful.
Here’s what we’ve learned so far:
- Longer responses feel more thorough. Especially for complex topics, verbosity builds confidence.
- Trust is earned through transparency. Citations help people evaluate and believe in what they’re reading.
- People trust answers that come from familiar sources. Citing recognizable domains signals credibility and search reliability.
So, which model gives the best search-based answers today? That’s not something we decide. It’s something you help us discover. Jump into Search Arena, try side-by-side comparisons, and help shape the future of search-augmented AI. The new Search leaderboard will be live in the next coming weeks.