Evaluations

Evaluations let you test and measure an AI chatbot agent's response quality with reusable evaluation sets. Use them to check important answers, review subjective quality, or replay real conversations before and after changes to your agent.

Accessing Evaluations

In the left sidebar, select the AI chatbot agent you want to test.
Open Evaluations.

The page contains two sections:

Evaluation Sets — saved tests you can edit, duplicate, run, or delete.
Run History — previous and running evaluation runs with status, score, started-by user, credit usage, date, and result actions.

Evaluations page with evaluation sets and run history

Evaluation Types

Each evaluation set uses one type. Choose the type based on what you want to measure.

Type	Best For	Setup
Predefined	Accuracy checks against reference answers.	Add one or more user questions and an expected answer for each question.
AI-Driven	Qualitative checks such as tone, helpfulness, coverage, or style.	Describe the goals the AI should be judged against.
Session Replay	Regression and repeatability checks using a real conversation.	Enter a recorded Chat Session ID to replay and compare new output against historical replies.

Creating a Predefined Evaluation

Use Predefined when there are specific questions your agent must answer correctly.

Click Create Evaluation.
Enter a Name.
Select Predefined.
Add each test Question.
Add the required Expected Answer for each question.
Click Create Evaluation.

Create Evaluation modal with Predefined evaluation selected

Creating an AI-Driven Evaluation

Use AI-Driven when you want to judge quality that is not a strict reference answer match, such as tone, helpfulness, coverage, or style.

Click Create Evaluation.
Enter a Name.
Select AI-Driven.
Describe the Evaluation Goals.
Click Create Evaluation.

AI-Driven evaluations are useful for qualitative checks, but they are less reliable than predefined evaluations for source-of-truth factual accuracy.

Create Evaluation modal with AI-Driven evaluation selected and Evaluation Goals field

Creating a Session Replay Evaluation

Use Session Replay when you want to check whether a real conversation still behaves correctly after prompt, training, model, or action changes.

Click Create Evaluation.
Enter a Name.
Select Session Replay.
Enter the Chat Session ID from a recorded conversation.
Click Create Evaluation.

Create Evaluation modal with Session Replay selected and Chat Session ID field

Creating Evaluations from Chats

You can also start from a real conversation in Chats.

Open Chats for the selected agent.
Select a conversation.
Click Create Evaluation.
Choose Replay eval to create a session replay draft from the conversation, or Predefined eval to build an editable predefined evaluation from the chat.

Chatislav opens the Evaluations page with the draft ready to review before saving.

Chats screen with Create Evaluation menu showing Replay eval and Predefined eval

Running an Evaluation

From Evaluation Sets, click Run on the set you want to test. Chatislav asks you to confirm before starting because evaluation runs use credits while the agent and judge calls execute.

While a run is active, it appears in Run History with a running status. Running evaluations can be cancelled. If a run is cancelled after some results were written, those partial results remain available.

Reading Results

Open a run from Run History to review the result details.

Run details can include:

Status and Score.
Started by and run date.
Progress and number of result items.
Bot Credits, Judge Credits, optional Question Generation credits, and Total Credits.
A summary of average metrics and overall score.
Detailed result rows with the tested question, agent answer, expected or original answer, metric badges, and feedback.

Evaluation Run Details modal with score, credits, summary, and detailed result item

Best Practices

Use Predefined evaluations for business-critical factual answers.
Use AI-Driven evaluations for subjective quality checks such as tone and coverage.
Use Session Replay before publishing major prompt, training data, action, or model changes.
Duplicate an evaluation set before making large edits so you can compare old and new versions.
Review Run History over time to spot regressions after agent changes.

Accessing Evaluations​

Evaluation Types​

Creating a Predefined Evaluation​

Creating an AI-Driven Evaluation​

Creating a Session Replay Evaluation​

Creating Evaluations from Chats​

Running an Evaluation​

Reading Results​

Best Practices​