Introducing BiomedArena.AI: Evaluating LLMs for Biomedical Discovery LMArena is honored to partner with the team at DataTecnica to advance the expansion of BiomedArena.ai: a new domain-specific evaluation track.
A Deep Dive into Recent Arena Data Today, we're excited to release a new dataset of recent battles from LMArena! The dataset contains 140k conversations from the text arena.
Does Sentiment Matter Too? Introducing Sentiment Control: Disentangling Sentiment and Substance Contributors: Connor Chen Wei-Lin Chiang Tianle Li Anastasios Angelopoulos Introduction You may have noticed that recent models on Chatbot Arena appear more emotionally expressive than their predecessors. But does this added sentiment actually improve their rankings on the leaderboard? Our previous exploration revealed
How Many User Prompts are New? We investigate 355,575 LLM battles from May 2024 to Dec 2024 to answer the following questions: 1. What proportion of prompts have never been seen before (aka “fresh”)? 2. What are common duplicate prompts? 3. How many prompts appear in widely used benchmarks?
WebDev Arena: A Live LLM Leaderboard for Web App Development WebDev Arena allows users to test LLMs in a real-world coding task: building interactive web applications.
RepoChat Arena RepoChat lets models automatically retrieve relevant files from the given GitHub repository. It can resolve issues, review PRs, implement code, as well as answer higher level questions about the repositories-all without requiring users to provide extensive context.
Arena Explorer We developed a topic modeling pipeline and the Arena Explorer. This pipeline organizes user prompts into distinct topics, structuring the text data hierarchically to enable intuitive analysis. We believe this tool for hierarchical topic modeling can be valuable to anyone analyzing complex text data.