LLM Evals

LLM Arena

Compare two models side by side on the same prompt.

What is LLM Arena?

LLM Arena lets you compare two models side by side on the same prompt. You send a question to both models simultaneously and see their responses together, making it easy to spot differences in quality, style, and accuracy.

Starting a comparison

Navigate to the Arena tab in your evals project. Select two models to compare by choosing a provider and model name for each side. You can compare models from different providers (e.g., GPT-4 vs Claude) or different versions of the same provider's models.

Running comparisons

Type your prompt in the input field and click Send. Both models receive the same prompt and generate responses independently. Responses appear side by side so you can compare them directly.

You can run multiple comparisons in a single session. Each prompt and response pair is saved, letting you build up a collection of comparison results.

Evaluating responses

After both models respond, you can optionally run an automated evaluation using a judge LLM. The judge scores each response on metrics like relevancy, helpfulness, and accuracy, giving you objective data alongside your subjective assessment.

Viewing results

Arena results are saved and can be reviewed later. Each result shows the prompt, both model responses, and any evaluation scores. Use this to build evidence for model selection decisions.

Arena is great for quick qualitative comparisons. For systematic evaluation across many prompts, use Experiments instead.

PreviousManaging models