Research

Re-introducing Vision Arena Categories

Since we first introduced categories over two years ago, and Vision Arena last year, the AI evaluation landscape has evolved. New categories have been added, existing ones have been updated, and the leaderboards they power are becoming more insightful with each round of community input.

If you need a refresher: categories are how we organize conversations on LMArena. Each message can be tagged into one or more categories, though tagging isn’t required. This flexibility ensures that we capture the richness of real-world use cases without forcing conversations into boxes.

When you view a category leaderboard, you’re seeing the same ranking methodology as the overall Arena leaderboard, but filtered only to conversations within that category. This makes categories a powerful way to zoom in on specific domains of model performance, whether that’s coding, creative writing, or multimodal reasoning.

In Vision Arena, users upload images and give text prompts asking all sorts of questions, from homework to storytelling. Like in Text Arena, many insights can be gained by looking at the categories of these conversations, both in terms of the prompts and the accompanying images.

Today, we’ll define and explain the new Vision Arena categories paired with example prompts and images.

Captioning

The Captioning category covers requests for general descriptions of images. This measures a model’s capacity to understand and describe images, often for users interested in getting overall information about an image, rather than seeking specific information from it.

Prompt: Please describe the photo in English.

Prompt: Translate, analyze, and summarize the contents of this document.

Creative Writing

The Creative Writing category is designed to capture the models’ skills in compositions outside of academic, professional, and technical areas, and typically involve creativity, imagination or storytelling.

Prompt: give 4 titles retrospectively to the abstract pieces depicting the emotional consciousness of Ulysses

Prompt: Give a sensory rich description of this Wabi sabi space

Diagrams

For testing recognition and understanding of technical information in images, we have the Diagrams category. This includes instances where the image contains graphs, charts, or figures, and the user is asking questions which require understanding of the concepts of the visuals.

Prompt: The graph of a parent function is shown. Which function belongs to the same function family as the graphed function? A f\left ( x \right ): =: 5^{x}: - : 4 B f\left ( x \right ): =: \frac{1}{5}\sqrt[3]{x: +: 4} C f\left ( x \right ): =: 5x^{2}: +: 4 D f\left ( x \right ): =: 5\sqrt{x}: -: 4

Prompt: Please answer the question from the attached file with an explanation. Thanks.

Entity Recognition

For recognizing, identifying, and explaining specific objects, items, or people from images, we have the Entity Recognition category. This tests the models on their ability to associate visual features of particular things with the associated knowledge from their pretraining.

Prompt: What kind of pokemon is this

Prompt: who are the people in this image?

Homework

This category measures models’ abilities to understand and solve problems from assignments based on pictures of the assignments. These vary widely in topics and across the level of education.

Prompt: answer this correctly

Prompt: explain the answer

Humor

The Humor category is a lighthearted and fun category which contains images and prompts where users ask the models to identify funny components or explain the joke. Understanding and explaining humor has historically been a difficult task for AI, and adding image understanding makes it even more challenging.

Prompt: whats a better caption to write for us on weekdays vs us on fridays?

Prompt: Can you explain the humor in this meme?

Optical Character Recognition (OCR)

The OCR category contains images with text, with prompts asking questions for information from the textual components of the image. This is the test of the models’ abilities to read and understand text in images.

Prompt: extract the article text

Prompt: send me the text in the image.

Conclusion

These examples highlight the variety of inputs and tasks that people are using Vision Arena’s multimodal capabilities for, while also providing context for the categories.

You can now explore how your favorite AI models compare across these different categories on the Vision Leaderboard!

Re-introducing Vision Arena Categories

Read next

Introducing BiomedArena.AI: Evaluating LLMs for Biomedical Discovery

A Deep Dive into Recent Arena Data

Does Sentiment Matter Too?