Skip to main content
LangSmith supports two types of evaluations based on when and where they run:

Offline Evaluation

Test before you shipRun evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions.

Online Evaluation

Monitor in productionEvaluate real user interactions in real-time to detect issues and measure quality on live traffic.

Evaluation workflow

1

Create a dataset

Create a dataset with from manually curated test cases, historical production traces, or synthetic data generation.
2

Define evaluators

Create to score performance:
3

Run an experiment

Execute your application on the dataset to create an . Configure repetitions, concurrency, and caching to optimize runs.
4

Analyze results

Compare experiments for benchmarking, unit tests, regression tests, or backtesting.
For more on the differences between offline and online evaluation, refer to the Evaluation concepts page.

Get started

Evaluation quickstart

Get started with offline evaluation.

Manage datasets

Create and manage datasets for evaluation through the UI or SDK.

Run offline evaluations

Explore evaluation types, techniques, and frameworks for comprehensive testing.

Analyze results

View and analyze evaluation results, compare experiments, filter data, and export findings.

Run online evaluations

Monitor production quality in real-time from the Observability tab.

Follow tutorials

Learn by following step-by-step tutorials, from simple chatbots to complex agent evaluations.
To set up a LangSmith instance, visit the Platform setup section to choose between cloud, hybrid, or self-hosted. All options include observability, evaluation, prompt engineering, and deployment.

Connect these docs to Claude, VSCode, and more via MCP for real-time answers.