Skip to main content

Evaluations

Evaluations let you test and measure an AI chatbot agent's response quality with reusable evaluation sets. Use them to check important answers, review subjective quality, or replay real conversations before and after changes to your agent.

Accessing Evaluations

  1. In the left sidebar, select the AI chatbot agent you want to test.
  2. Open Evaluations.

The page contains two sections:

  • Evaluation Sets — saved tests you can edit, duplicate, run, or delete.
  • Run History — previous and running evaluation runs with status, score, started-by user, credit usage, date, and result actions.
Evaluations page with evaluation sets and run history

 

Evaluation Types

Each evaluation set uses one type. Choose the type based on what you want to measure.

TypeBest ForSetup
PredefinedAccuracy checks against reference answers.Add one or more user questions and an expected answer for each question.
AI-DrivenQualitative checks such as tone, helpfulness, coverage, or style.Describe the goals the AI should be judged against.
Session ReplayRegression and repeatability checks using a real conversation.Enter a recorded Chat Session ID to replay and compare new output against historical replies.

 

Creating a Predefined Evaluation

Use Predefined when there are specific questions your agent must answer correctly.

  1. Click Create Evaluation.
  2. Enter a Name.
  3. Select Predefined.
  4. Add each test Question.
  5. Add the required Expected Answer for each question.
  6. Click Create Evaluation.
Create Evaluation modal with Predefined evaluation selected

 

Creating an AI-Driven Evaluation

Use AI-Driven when you want to judge quality that is not a strict reference answer match, such as tone, helpfulness, coverage, or style.

  1. Click Create Evaluation.
  2. Enter a Name.
  3. Select AI-Driven.
  4. Describe the Evaluation Goals.
  5. Click Create Evaluation.

AI-Driven evaluations are useful for qualitative checks, but they are less reliable than predefined evaluations for source-of-truth factual accuracy.

Create Evaluation modal with AI-Driven evaluation selected and Evaluation Goals field

 

Creating a Session Replay Evaluation

Use Session Replay when you want to check whether a real conversation still behaves correctly after prompt, training, model, or action changes.

  1. Click Create Evaluation.
  2. Enter a Name.
  3. Select Session Replay.
  4. Enter the Chat Session ID from a recorded conversation.
  5. Click Create Evaluation.
Create Evaluation modal with Session Replay selected and Chat Session ID field

 

Creating Evaluations from Chats

You can also start from a real conversation in Chats.

  1. Open Chats for the selected agent.
  2. Select a conversation.
  3. Click Create Evaluation.
  4. Choose Replay eval to create a session replay draft from the conversation, or Predefined eval to build an editable predefined evaluation from the chat.

Chatislav opens the Evaluations page with the draft ready to review before saving.

Chats screen with Create Evaluation menu showing Replay eval and Predefined eval

 

Running an Evaluation

From Evaluation Sets, click Run on the set you want to test. Chatislav asks you to confirm before starting because evaluation runs use credits while the agent and judge calls execute.

While a run is active, it appears in Run History with a running status. Running evaluations can be cancelled. If a run is cancelled after some results were written, those partial results remain available.

 

Reading Results

Open a run from Run History to review the result details.

Run details can include:

  • Status and Score.
  • Started by and run date.
  • Progress and number of result items.
  • Bot Credits, Judge Credits, optional Question Generation credits, and Total Credits.
  • A summary of average metrics and overall score.
  • Detailed result rows with the tested question, agent answer, expected or original answer, metric badges, and feedback.
Evaluation Run Details modal with score, credits, summary, and detailed result item

 

Best Practices

  • Use Predefined evaluations for business-critical factual answers.
  • Use AI-Driven evaluations for subjective quality checks such as tone and coverage.
  • Use Session Replay before publishing major prompt, training data, action, or model changes.
  • Duplicate an evaluation set before making large edits so you can compare old and new versions.
  • Review Run History over time to spot regressions after agent changes.