LangSmith Evaluation

LangSmith supports two types of evaluations based on when and where they run:

Offline Evaluation

Test before you shipRun evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.

Online Evaluation

Monitor in productionEvaluate real user interactions in real-time to detect issues and measure quality on live traffic.

Evaluation workflow

Offline evaluation flow
Online evaluation flow

Create a dataset

Create a dataset with from manually curated test cases, historical production traces, or synthetic data generation.

Define evaluators

Create to score performance:

Human review
Code rules
LLM-as-judge
Pairwise comparison

Run an experiment

Execute your application on the dataset to create an . Configure repetitions, concurrency, and caching to optimize runs.

Analyze results

Compare experiments for benchmarking, unit tests, regression tests, or backtesting.

For more on the differences between offline and online evaluation, refer to the Evaluation concepts page.

Get started

Evaluation quickstart

Get started with offline evaluation.

Manage datasets

Create and manage datasets for evaluation through the UI or SDK.

Run offline evaluations

Explore evaluation types, techniques, and frameworks for comprehensive testing.

Analyze results

View and analyze evaluation results, compare experiments, filter data, and export findings.

Run online evaluations

Monitor production quality in real-time from the Observability tab.

Follow tutorials

Learn by following step-by-step tutorials, from simple chatbots to complex agent evaluations.

To set up a LangSmith instance, visit the Platform setup section to choose between cloud, hybrid, or self-hosted. All options include observability, evaluation, prompt engineering, and deployment.

Edit this page on GitHub or file an issue.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

Offline Evaluation

Online Evaluation

Evaluation workflow

Get started

Evaluation quickstart

Manage datasets

Run offline evaluations

Analyze results

Run online evaluations

Follow tutorials

Datasets

Set up evaluations

Analyze experiment results

Annotation & human feedback

Common data types

Documentation Index

Offline Evaluation

Online Evaluation

​Evaluation workflow

​Get started

Evaluation quickstart

Manage datasets

Run offline evaluations

Analyze results

Run online evaluations

Follow tutorials

Evaluation workflow

Get started