LLM Evals

LLM Evals overview

Introduction to the LLM evaluation platform and key concepts.

What is LLM Evals?

Building an LLM application is one thing—knowing whether it actually works well is another. LLM Evals gives you a systematic way to measure how your models perform before they reach users, and to catch regressions as you iterate on prompts, fine-tune models, or swap providers.

New to LLM Evals? Start with [[Running experiments]](llm-evals/running-experiments) to create your first evaluation, or explore [[Managing datasets]](llm-evals/managing-datasets) to set up test cases.

Think of it as automated quality assurance for your AI. Instead of manually testing outputs or waiting for user complaints, you can run structured evaluations that check for the things that matter: Is the response relevant? Is it accurate? Does it contain harmful content? Is the model making things up?

How it works

LLM Evals uses what's called a Judge LLM approach. Here's the basic idea: you send prompts to your model, collect its responses, and then have a separate (usually more capable) model evaluate those responses against your criteria. Learn more about setting up judges in [[Running experiments]](llm-evals/running-experiments).

For example, if you're building a customer support chatbot, you might have a dataset of 100 common questions with ideal answers. LLM Evals sends each question to your chatbot, then asks GPT-4 or Claude to score how well your chatbot's response matches the expected answer, whether it stays on topic, and whether it contains any problematic content.

Why use a judge model? Human evaluation is the gold standard but doesn't scale. LLM judges provide consistent, repeatable evaluations that correlate well with human judgment—especially for clear-cut quality signals like relevancy and toxicity.

Key concepts

Before diving in, it helps to understand how the pieces fit together:

Projects: Your workspace for a specific application or use case. A customer support bot would be one project; an internal knowledge assistant would be another. Each project has its own experiments, datasets, and configuration.
Use cases: The type of application you're building: Chatbot (conversational AI), RAG (retrieval-augmented generation), or Agent (tool-using assistants). The use case determines which metrics are available and how evaluations are structured.
Experiments: A single evaluation run. Each experiment tests a specific model configuration against a dataset and produces scores. Run experiments whenever you change prompts, switch models, or want to compare approaches.
Datasets: Collections of test cases. Single-turn datasets have isolated prompts with expected outputs. Multi-turn datasets contain conversations where the model generates responses at each turn. See [[Managing datasets]](llm-evals/managing-datasets) for details.
Scorers: The metrics you're measuring. Core metrics include answer relevancy, bias, and toxicity. Multi-turn chatbot evaluations add conversational metrics like knowledge retention, coherence, and task completion. See [[Configuring scorers]](llm-evals/configuring-scorers) to customize.
Judge LLM: The model that evaluates your model's outputs. This is typically a frontier model like GPT-4 or Claude that can reliably assess quality. You configure the judge separately from the model being evaluated.

When to run evaluations

Evaluations are most valuable at these moments:

Before launch: Establish baseline performance and catch issues before users see them
After prompt changes: Verify that improvements in one area don't cause regressions elsewhere
When switching models: Compare performance across providers or model versions objectively
Periodically in production: Catch drift or degradation over time as the underlying models update

Access LLM Evals from the main sidebar. Once inside, you'll see a project dropdown at the top and a navigation panel on the left:

Overview: Your project dashboard. See recent experiments, quick stats, and jump into common actions.
Experiments: The full history of evaluation runs. Track performance over time with the built-in chart, dig into individual results, or start new experiments.
Datasets: Manage your test cases. Browse built-in datasets, upload your own, or edit existing ones to better match your use cases.
Scorers: Configure what you're measuring. Enable or disable metrics, adjust thresholds, or create custom scorers.
Configuration: Project-level settings like default models and evaluation preferences.
Organizations: Manage API keys for different providers. These are stored securely and used when running evaluations.

Running your first evaluation

Ready to try it out? Here's the quickest path to your first results:

Create a project: — Click the project dropdown and select "Create new project." Give it a descriptive name like "Customer Support Bot" or "Document Q&A."
Add your API keys: — Go to Organizations and add API keys for the model you're testing and the judge model. You'll need at least one of each.
Start a new experiment: — Click "New experiment" and follow the wizard. Select your model, pick a built-in dataset to start, choose your judge, and select which metrics to evaluate.
Review results: — Once the experiment completes, click into it to see per-metric scores and drill down into individual test cases.

Start with a small dataset (10-20 prompts) and the default metrics. Once you understand how the results look, expand to larger datasets and customize the scorers for your specific needs.

What's next

Once you're comfortable with the basics, explore these areas:

Learn how to configure experiments in detail, including use case selection and metric customization
Try multi-turn datasets to evaluate how your chatbot handles real conversations
Upload custom datasets that reflect your actual user queries and conversation patterns
Explore conversational metrics like knowledge retention and task completion for deeper chatbot insights
Set up regular evaluation runs to track model performance over time

Continue learning

Running experiments

Step-by-step guide to creating and running evaluations

Managing datasets

Upload custom datasets or use built-in test suites

Configuring scorers

Set up metrics including conversational evaluation

NextRunning experiments