RepoChat Arena

RepoChat lets models automatically retrieve relevant files from the given GitHub repository. It can resolve issues, review PRs, implement code, as well as answer higher level questions about the repositories-all without requiring users to provide extensive context.

RepoChat Arena

A Live Benchmark for AI Software Engineers


Contributors:
Yifan Song
Naman Jain
Manish Shetty
Wei-Lin Chiang
Valerie Chen
Wayne Chi
Anastasios N. Angelopoulos
Ion Stoica


Introduction

An integral part of using LLMs as part of a user’s coding workflow is navigating and understanding complex code bases. We launched RepoChat – a new arena that answers code-related queries using a user-provided github repository link.

RepoChat lets models automatically retrieve relevant files from the given GitHub repository. It can resolve issues, review PRs, implement code, as well as answer higher level questions about the repositories-all without requiring users to provide extensive context.

0:00
/0:54

Video 1. A demo of RepoChat.

So far, RepoChat has collected around 20k battles and over 4k votes. All statistics calculated in this blog use conversations and votes collected between November 30, 2024 to Feburary 11, 2025 inclusive.

Metric

Value

# of conversations

12,732

# of votes

4,857

# of unique GitHub links (in convos)

5,037

# of unique GitHub links (in votes)

2,465

^ Table 1. Number of conversations, votes, and unique GitHub links logged from 11/30/2024 - 02/11/2025.

In this blog we will cover:

Initial Leaderboards

Since there are two separate components (retriever and answer), we produce two separate leaderboards. Jump to this section for details about how the leaderboards are calculated, and to further analysis for more leaderboards such as style-control. All leaderboards can be reproduced using our google colab notebook.

Rank

Model

Arena Score

95% CI

Votes

Org.

1

GPT 4o (20241120)

1088

+12 | -10

1647

OpenAI

2

DeepSeek V3

1032

+24 | -21

362

DeepSeek

3

Claude 3.5 Sonnet (20241022)

1015

+16 | -10

1737

Anthropic

4

Gemini 2.0 Flash Exp

1005

+15 | -13

1066

Google

5

Gemini 1.5 Pro

974

+19 | -16

789

Google

6

GPT 4o mini (20240718)

970

+24 | -17

521

OpenAI

7

LLaMA 3.1 405B Instruct FP8

914

+13 | -22

750

Meta

^ Table 2. Arena ratings of seven popular models based on over 4K votes collected between November 30, 2024 to Feburary 11, 2025. This ranks the models for generating model response

Rank

Model

Arena Score

95% CI

Org.

1

Gemini 1.5 Flash 002

1013

+14 | -21

Google

2

GPT 4o mini (20240718)

987

+21 | -14

OpenAI

^ Table 3. Arena ratings of the two retrievers based on 4K votes collected between November 30, 2024 to Feburary 11, 2025. This ranks the models for retrieving relevant files.

How do people use RepoChat?

What types of Github links do users input? RepoChat features a wide range of GitHub links, including repositories, issues, pull requests, and others. We find that the vast majority (almost 80%) of user input links are repository links, followed by issues and branches.

Figure 1. Breakdown of the GitHub link categories in RepoChat by percentage.

What programming languages do people ask about? The following statistic is calculated based on the file extensions of the relevant files. This serves as an indicator of the programming languages users are most frequently inquiring about/coding in. The abundance of markdown files is primarily due to README files, which are often extracted due to containing critical descriptions and instructions for the repository.

Figure 2. Breakdown of retrieved file types by percentage.

What natural languages do the user queries contain? Most of our votes contain user queries in English, followed by Russian and Chinese.

Figure 3. Breakdown of user query’s language by count.

How long are the user queries? The user query length varies significantly, ranging from a minimum of 1 token to a maximum of 1,406,325 tokens, with a median of 22 tokens. Short-length queries mostly consist of prose requesting implementations or instructions, whereas longer queries often include extensive code blocks.

Figure 4. Distribution of user query length.

How long are the retrieved file contexts? The distribution of retrieved file content is more stable, with a median of 8,870, an average of 12,508, and a maximum of 126,329 tokens, with the occasional empty retrieval.

Figure 5. Distribution of retrieved file context length.
GitHub - sktime/sktime: A unified framework for machine learning with time series
A unified framework for machine learning with time series - sktime/sktime

debug the code as a professional would do. The code is for running in jupyter.

import pandas as pd
import numpy as np
from sktime.forecasting.model_selection import (
ForecastingOptunaSearchCV,
ExpandingWindowSplitter,
temporal_train_test_split
)
from sktime.forecasting.base import ForecastingHorizon
from sktime.performance_metrics.forecasting import MeanAbsolutePercentageError
from sktime.forecasting.compose import TransformedTargetForecaster
from sktime.forecasting.statsforecast import (
StatsForecastMSTL,
StatsForecastAutoETS,
StatsForecastAutoARIMA,
StatsForecastAutoTheta
)
from sktime.transformations.series.detrend import Detrender
from sktime.transformations.series.deseasonalize import Deseasonalizer
import optuna
import warnings

warnings.filterwarnings('ignore')

# Load your time series data

# Ensure 'pivot_table' is defined and contains the 'PAN4_PIBPMG4' series

y = pivot_table['PAN4_PIBPMG4']

# Split the data into train and test sets

y_train, y_test = temporal_train_test_split(y, test_size=8)

# Define the forecasting horizon

fh = ForecastingHorizon(np.arange(1, 9), is_relative=True)

# Set up cross-validation with an expanding window splitter

cv = ExpandingWindowSplitter(fh=fh, initial_window=len(y_train) - 8)

# Define the parameter space for tuning

param_distributions = {
'forecaster**season_length': optuna.distributions.CategoricalDistribution([(4,), (8,)]),
'forecaster**trend_forecaster': optuna.distributions.CategoricalDistribution([
StatsForecastAutoETS(model="ZZZ"),
StatsForecastAutoARIMA(seasonal=True),
StatsForecastAutoTheta()
]),
'forecaster\_\_stl_kwargs': {
'robust': optuna.distributions.CategoricalDistribution([True, False]),
'period': optuna.distributions.IntUniformDistribution(4, 8)
}
}

# Initialize the MSTL forecaster

mstl_forecaster = StatsForecastMSTL()

# Create a pipeline with optional transformations

forecaster = TransformedTargetForecaster(steps=[
("detrender", Detrender()),
("deseasonalizer", Deseasonalizer()),
("mstl_forecaster", mstl_forecaster)
])

# Set up the OptunaSearchCV

optuna_search = ForecastingOptunaSearchCV(
forecaster=forecaster,
cv=cv,
param_distributions=param_distributions,
scoring=MeanAbsolutePercentageError(symmetric=True),
n_trials=100,
random_state=42
)

# Fit the model

optuna_search.fit(y_train)

# Predict

y_pred = optuna_search.predict(fh)

# Evaluate

mape = MeanAbsolutePercentageError(symmetric=True)
final_mape = mape(y_test, y_pred)
print(f"Final sMAPE: {final_mape:.2f}")

# Plot results

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 7))
plt.plot(y_train.index, y_train.values, label='Training Data', color='blue')
plt.plot(y_test.index, y_test.values, label='Test Data', color='green')
plt.plot(y_pred.index, y_pred.values, label='Predictions', color='red', linestyle='--')
plt.title('MSTL Forecast Results with Optuna Optimization')
plt.legend()
plt.grid(True)
plt.show()

# Save the best model

from joblib import dump
dump(optuna*search.best_forecaster*, 'best_mstl_model_optuna.joblib')
print("\nBest model saved as 'best_mstl_model_optuna.joblib'")

# Print additional optimization results

print("\nOptimization Results:")
print("="\*50)
print(f"Number of completed trials: {len(optuna*search.cv_results*)}")
print(f"Best trial number: {optuna*search.best_index*}")
print(f"Best sMAPE achieved during optimization: {optuna*search.best_score*:.2f}")

# Print best parameters

print("\nBest Parameters Found:")
print("="\*50)
for param, value in optuna*search.best_params*.items():
print(f"{param}: {value}")

How Does It Work?

Each generated answer is the collaborative effort between two separate LLM models. The retriever model extracts the relevant files from the github repository according to the given user input. The extracted file contents are then concatenated together with the user query and used as a prompt for the answer model, which then generates the response.

Figure 6. A graphical representation of the full RepoChat retriever->answer pipeline.

File Retrieval

We select two models as retrievers: gemini-1.5-flash-002 and gpt-4o-mini.

When a valid GitHub link is provided, the repository is cloned, and its file structure is extracted and formatted in an indented layout. The retriever model is then prompted to strategically identify files from the file structure that are potentially useful for answering the query. It then outputs these files in decreasing order of relevance and a specified markdown format.

Click to review retriever prompt

The following files are found in the repository:
{file structure}
Please provide a list of files that you would like to search for answering the user query.
Enclose the file paths in a list in a markdown code block as shown below:
```
1. [[ filepath_1 ]]\n
2. [[ filepath_2 ]]\n
3. [[ filepath_3 ]]\n
...
```
Think step-by-step and strategically reason about the files you choose to maximize the chances of finding the answer to the query. Only pick the files that are most likely to contain the information you are looking for in decreasing order of relevance. Once you have selected the files, please submit your response in the appropriate format mentioned above (markdown numbered list in a markdown code block). The filepath within [[ and ]] should contain the complete path of the file in the repository.
{query}

Model Response Generation

The contents of relevant files are extracted, concatenated with the user query, and provided to the responding LLM as a prompt in a specified format. If the provided link is not a direct repository link but instead links to issues or pull requests, a query context section containing the content of the issue or PR thread is also included.

Click to view answer model prompt format

Here is a list of files in the repository that may help you answer the query:
{pairs of (file_name, file_content)}
___

[INSTRUCTION]

You are an expert software engineer. Answer the following user query using provided context retrieved from the {repository_name} repository.

[QUERT CONTEXT]

{issue/PR content}

[USER QUERY]

{user query}

Further Analysis and Results

Style Control

Style (length and Markdown formatting) significantly impacts model ranking. A better formatted or visually appealing answer, though not necessarily better, is more likely to win. The style-controlled leaderboard isolates writing style from content quality. For details on our methodology, refer to this blog post.

Rank

Model

Arena Score

95% CI

Org.

1

Claude 3.5 Sonnet (20241022)

1031

+28 | -17

Anthropic

2

GPT 4o (20241120)

1028

+23 | -24

OpenAI

3

DeepSeek V3

1016

+27 | -25

DeepSeek

4

Gemini 1.5 Pro

1007

+21 | -25

Google

5

Gemini 2.0 Flash Exp

997

+20 | -23

Google

6

GPT 4o mini (20240718)

963

+26 | -24

OpenAI

7

LLaMA 3.1 405B Instruct FP8

956

+23 | -23

Meta

^ Table 4. Style-controlled arena ratings of the answer models.

Claude 3.5 Sonnet’s score and ranking has improved significantly with style control, claiming the top spot. GPT-4o’s score has decreased, greatly narrowing its lead over other models. Gemini 1.5 Pro has also seen a notable boost in score, climbing one rank higher.

Figure 7. A comparison of the original and the style-controlled leaderboards of the answer models.

Why does style control affect models like Claude and GPT 4o so much?

This is likely because a significant portion of user queries are not focused on strictly code-writing tasks. Instead, many focus on code understanding—such as how to use the repository, its purpose, or other high-level questions—rather than specific code implementations. These types of queries shift the focus away from pure coding abilities, and instead place greater emphasis on organization and presentation, and overall style.

Below is an example battle between Claude 3.5 Sonnet (20241022) and GPT 4o (20241120) with identical retrievals.

The GitHub link is https://github.com/xtekky/gpt4free, and the user query is “Explain why in the new 0.4.0.0 version Blackbox provider is several times slower than in version 0.3.9.7”

Retriever Control

Just like style, the retrievers too have an effect on model answer. A model with with a more complete or relevant retrieval will likely produce a better answer. Similar to style-control, the retriever-controlled leaderboard separates the effect of the retrievers from the answer model ranking.

Rank

Model

Arena Score

95% CI

Org.

1

GPT 4o (20241120)

1088

+12 | -10

OpenAI

2

DeepSeek V3

1032

+24 | -21

DeepSeek

3

Claude 3.5 Sonnet (20241022)

1015

+16 | -10

Anthropic

4

Gemini 2.0 Flash Exp

1005

+15 | -13

Google

5

Gemini 1.5 Pro

973

+20 | -16

Google

6

GPT 4o mini (20240718)

971

+23 | -17

OpenAI

7

LLaMA 3.1 405B Instruct FP8

914

+14 | -22

Meta

^ Table 5. Retriever-controlled arena ratings of the answer models.

The retriever-controlled leaderboard shows only slight differences from the original, as the two chosen retrievers perform similarly and have little influence on the rankings

Style + Retriever Control

Rank

Model

Arena Score

95% CI

Org.

1

Claude 3.5 Sonnet (20241022)

1031

+28 | -17

Anthropic

2

GPT 4o (20241120)

1028

+23 | -24

OpenAI

3

DeepSeek V3

1016

+27 | -25

DeepSeek

4

Gemini 1.5 Pro

1007

+21 | -25

Google

5

Gemini 2.0 Flash Exp

997

+20 | -23

Google

6

GPT 4o mini (20240718)

963

+26 | -24

OpenAI

7

LLaMA 3.1 405B Instruct FP8

956

+23 | -23

Meta

^ Table 6. Style-and-retriever-controlled arena ratings of the answer models.

Leaderboard Calculation

How do we get separate leaderboards for the retrievers and the answer models from one set of battles? Extra features!

Chabot Arena leverages the Bradley-Terry model for scoring model strength using pairwise battles. We adopt its statistical extension to evaluate the additional subparts–the retrievers–by integrating them as extra features just like the answer models. Each retriever feature takes values from {-1, 0, 1}, indicating whether it was active in the file retrieval for model_a, neither/both, or model_b, respectively. By performing Logistic Regression on these additional retriever features along with the original model features, we obtain coefficients that are later scaled to become the leaderboard scores.

Figure 8. Fraction of model A wins for all non-tied battles.

What’s Next?

We are actively collecting more votes and integrating new models, with plans for more comprehensive analysis down the line. Additionally, we are exploring ways to enhance RepoChat by incorporating features such as support for private repositories, GitLab integration, and improvements to our retrieval process. Community contributions are welcome—feel free to ping us if you’d like to get involved!

Appendix

https://github.com/julien-blanchon/arxflix/tree/feat/groq_whisper

when using the gradio interface, the stage of generating video get me this error :

INFO:__main__:Generated assets successfully.
INFO:backend.main:Generating video to output.mp4 from tmpmpc_r1vm
Free port: 55427
Exposed directory /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm
Starting up http-server, serving /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm

http-server version: 14.1.1

http-server settings:
CORS: true
Cache: 3600 seconds
Connection Timeout: 120 seconds
Directory Listings: visible
AutoIndex: visible
Serve GZIP Files: false
Serve Brotli Files: false
Default File Extension: none

Available on:
http://localhost:55427
Hit CTRL-C to stop the server

INFO:backend.utils.generate_video:Exposed directory /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm
INFO:backend.utils.generate_video:Generating video to /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm/output.mp4
(node:93666) ExperimentalWarning: CommonJS module /opt/homebrew/lib/node_modules/npm/node_modules/debug/src/node.js is loading ES Module /opt/homebrew/lib/node_modules/npm/node_modules/supports-color/index.js using require().
Support for loading ES Module in require() is an experimental feature and might change at any time
(Use `node --trace-warnings ...` to show where the warning was created)
Bundled code ━━━━━━━━━━━━━━━━━━ 550ms
[Fri Dec 06 2024 00:02:39 GMT+0100 (Central European Standard Time)] "GET /audio.wav" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.6778.87 Safari/537.36"
(node:93654) [DEP0066] DeprecationWarning: OutgoingMessage.prototype.\_headers is deprecated
(Use `node --trace-deprecation ...` to show where the warning was created)
Composition Arxflix
Codec h264
Output /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm/output.mp4
Concurrency 1x
Rendering frames ━━━━━━━━━━━━━━━━━━ 0/2353
Encoding video ━━━━━━━━━━━━━━━━━━ 0/2353[Fri Dec 06 2024 00:02:39 GMT+0100 (Central European Standard Time)] "GET /subtitles.srt" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.6778.87 Safari/537.36"
[Fri Dec 06 2024 00:02:39 GMT+0100 (Central European Standard Time)] "GET /rich.json" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/131.0.6778.87 Safari/537.36"
node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:187 Error: No "src" prop was passed to .
at ImgRefForwarding (http://localhost:3000/bundle.js:26884:15)
at Nh (http://localhost:3000/bundle.js:25838:137)
at Yi (http://localhost:3000/bundle.js:25870:57)
at Vk (http://localhost:3000/bundle.js:25971:275)
at Uk (http://localhost:3000/bundle.js:25957:389)
at Tk (http://localhost:3000/bundle.js:25957:320)
at Ik (http://localhost:3000/bundle.js:25957:180)
at Nk (http://localhost:3000/bundle.js:25948:88)
at Gk (http://localhost:3000/bundle.js:25945:429)
at J (http://localhost:3000/bundle.js:34117:203)

An error occurred while rendering frame 0:
Error No "src" prop was passed to .

at node_modules/.pnpm/remotion@4.0.171_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/remotion/dist/cjs/Img.js:19
17 │ const { delayPlayback } = (0, use_buffer_state_js_1.useBufferState)();
18 │ const sequenceContext = (0, react_1.useContext)(SequenceContext_js_1.SequenceContext);
19 │ if (!src) {
20 │ throw new Error('No "src" prop was passed to .');
21 │ }
22 │ (0, react_1.useImperativeHandle)(ref, () => {

at Nh (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:159)
at Yi (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:191)
at Vk (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:292)
at Uk (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:278)
at Tk (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:278)
at Ik (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:278)
at Nk (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:269)
at Gk (node_modules/.pnpm/react-dom@18.3.1_react@18.3.1/node_modules/react-dom/cjs/react-dom.production.min.js:266)
at J (node_modules/.pnpm/scheduler@0.23.2/node_modules/scheduler/cjs/scheduler.production.min.js:12)
at R (node_modules/.pnpm/scheduler@0.23.2/node_modules/scheduler/cjs/scheduler.production.min.js:13)
INFO:backend.utils.generate_video:Generated video to /var/folders/3j/jv5_hbgn59g9yxccxtfcvz4r0000gp/T/tmpmpc_r1vm/output.mp4
http-server stopped.
INFO:**main**:Generated video successfully.
^CKeyboard interruption in main thread... closing server.

GitHub - feathers-studio/telegraf-docs: Documentation for Telegraf - the modern Bot API framework
Documentation for Telegraf - the modern Bot API framework - feathers-studio/telegraf-docs

bot.js

require("dotenv").config();

const { Telegraf, session } = require("telegraf");
const { registerCommands } = require("./commands");
const { registerHears } = require("./hears");
const { registerScenes } = require("./scenes");
const sequelize = require("./database");
const User = require("./models/user");
const { t, i18next } = require("./i18n");
const Plans = require("./models/plans");

// sync database
(async () => {
try {
await sequelize.sync({ force: true });
console.log("database synchronized successfully.");

const bot = new Telegraf(process.env.BOT_TOKEN);

const defaultPlan = await Plans.findOne({ where: { isDefault: true } });
if (!defaultPlan) {
await Plans.create({
name: "atom",
isDefault: true,
price: 0,
});
console.log("default plan created.");
}

// telegraf.js sessions
bot.use(
session({
defaultSession: () => ({ counter: 0, userName: "" }), // Initialize session variables
})
);
// user check/creation middleware
bot.use(async (ctx, next) => {
const telegramId = ctx.from.id;
try {
let user = await User.findOne({ where: { telegramId } });
if (!user) {
user = await User.create({
name: ctx.from.first_name || "unknown",
telegramId: telegramId,
});
console.log(
`new user created: ${user.name} (id: ${user.telegramId})`
);
}

ctx.session.isAuthorized = true;
ctx.session.user = user;
ctx.session.language = user.language || "ru";

i18next.changeLanguage(ctx.session.language);
return next();
} catch (error) {
console.error("error during user check/creation:", error);
return ctx.reply(t("error_user_creation"));
}
});

// register commands, hears, and scenes
registerCommands(bot);
registerHears(bot);
registerScenes(bot);

// start the bot
bot.launch();
console.log("bot is running.");

// enable graceful stop
process.once("SIGINT", () => bot.stop("SIGINT"));
process.once("SIGTERM", () => bot.stop("SIGTERM"));
} catch (error) {
console.error("error initializing bot:", error);
}
})();