Search Arena & What We’re Learning About Human Preference Search Arena on LMArena goes live today, read more about what we've learned so far about human preference with the search-augmented data.
Hello from LMArena: The Community Platform for Exploring Frontier AI At LMArena, everything starts with the community. There have been a lot of new members joining us in the past few months so we thought it would be a good time to reintroduce ourselves! Created by researchers from UC Berkeley’s SkyLab, LMArena is an open platform where everyone can
LMArena and The Future of AI Reliability About a month ago, we announced that LMArena was becoming a company to better support our growing community platform. As we take this next step, we're staying true to our original mission of rigorous, neutral, and community-driven evaluations. Today, we’re excited to share that we’ve raised
Celebrating Community Impact: 3M+ votes, 400+ models, and 300+ pre-release tests To date, the community has evaluated over 400+ public models on LMArena as well as 300+ pre-release tests. Tens of millions of battle pairings have been served to users across the world, and each vote has shaped real-world AI performance and development. Around this time two years ago, the community
Does Sentiment Matter Too? Introducing Sentiment Control: Disentangling Sentiment and Substance Contributors: Connor Chen Wei-Lin Chiang Tianle Li Anastasios Angelopoulos Introduction You may have noticed that recent models on Chatbot Arena appear more emotionally expressive than their predecessors. But does this added sentiment actually improve their rankings on the leaderboard? Our previous exploration revealed
How Many User Prompts are New? We investigate 355,575 LLM battles from May 2024 to Dec 2024 to answer the following questions: 1. What proportion of prompts have never been seen before (aka “fresh”)? 2. What are common duplicate prompts? 3. How many prompts appear in widely used benchmarks?
LMArena is Growing to Support our Community Platform LMArena started as a scrappy academic project from UC Berkeley: just a handful of PhD students and undergrads working day and night on a research prototype. Today, we have two announcements: 1. We are starting a company to support LMArena! LMArena will stay neutral, open, and accessible to everyone. We
Introducing the Search Arena: Evaluating Search-Enabled AI Authors Mihran Miroyan* Tsung-Han Wu* Logan King Tianle Li Anastasios N. Angelopoulos Wei-Lin Chiang Narges Norouzi Joseph E. Gonzalez TL;DR 1. We introduce Search Arena, a crowdsourced in-the-wild evaluation platform for search-augmented LLM systems based on human preference. Unlike LM-Arena or SimpleQA, our data focuses on current events and
LMArena Community Updates: Looking Ahead Today, we’re excited to begin sharing community updates in our blog as we continue to make progress towards long-term growth.
WebDev Arena: A Live LLM Leaderboard for Web App Development WebDev Arena allows users to test LLMs in a real-world coding task: building interactive web applications.
RepoChat Arena RepoChat lets models automatically retrieve relevant files from the given GitHub repository. It can resolve issues, review PRs, implement code, as well as answer higher level questions about the repositories-all without requiring users to provide extensive context.
Arena Explorer We developed a topic modeling pipeline and the Arena Explorer. This pipeline organizes user prompts into distinct topics, structuring the text data hierarchically to enable intuitive analysis. We believe this tool for hierarchical topic modeling can be valuable to anyone analyzing complex text data.
Code Editing in Copilot Arena Copilot Arena enables not only paired code completions but also paired code edits as well. Unlike code completions—which automatically appear after short pauses—code edits are manually triggered by highlighting a code snippet and then writing a short task description.
Catch me if you can! How to beat GPT-4 with a 13B model Authors Shuo Yang* Wei-Lin Chiang* Lianmin Zheng* Joseph E. Gonzalez Ion Stoica Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)! To ensure result validity, we followed OpenAI’s decontamination method and found no evidence of data contamination. What’s the trick behind it? Well, rephrasing