Skip to content

Add AI Evaluations documentation in Glific#618

Merged
tanuprasad530 merged 7 commits into
mainfrom
add-new-ai-evals-documentation
May 22, 2026
Merged

Add AI Evaluations documentation in Glific#618
tanuprasad530 merged 7 commits into
mainfrom
add-new-ai-evals-documentation

Conversation

@mahajantejas
Copy link
Copy Markdown
Collaborator

@mahajantejas mahajantejas commented May 21, 2026

Added documentation for AI Evaluations in Glific, covering prerequisites, navigation, running evaluations, reviewing results, understanding cosine similarity, and best practices.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guide on AI Evaluations feature, including setup instructions, evaluation creation workflow, and results review with cosine similarity scoring details.
    • Included documentation on Golden QA library usage and dataset management best practices.

Review Change Stack

Added documentation for AI Evaluations in Glific, covering prerequisites, navigation, running evaluations, reviewing results, understanding cosine similarity, and best practices.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@mahajantejas has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 52 minutes and 18 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 200c3246-7cb4-4867-a04c-972aa482ad11

📥 Commits

Reviewing files that changed from the base of the PR and between ff8044f and bbeea80.

📒 Files selected for processing (1)
  • docs/5. Integrations/AI Evaluations in Glific.md
📝 Walkthrough

Walkthrough

New documentation page for AI Evaluations in Glific covering setup, running evaluations with Golden QA datasets, reviewing results via cosine similarity scoring, and managing the evaluation dataset library. 141 lines added across introduction, workflow guides, result interpretation, and reference material.

Changes

AI Evaluations in Glific

Layer / File(s) Summary
Overview and Running Evaluations
docs/5. Integrations/AI Evaluations in Glific.md (lines 1–75)
Page header with read time and difficulty metadata; overview of the AI Evaluations feature; prerequisites and navigation instructions to reach the AI Evals area; end-to-end guide to creating an evaluation by selecting or uploading a Golden QA CSV, setting duplication factor, choosing an AI assistant version, naming the run, and executing it.
Reviewing Results and Golden QA Library
docs/5. Integrations/AI Evaluations in Glific.md (lines 75–141)
Instructions for downloading and analyzing completed evaluation results and interpreting cosine similarity scores with guidance for low and high scoring cases; best practices for running evaluations; and comprehensive Golden QA library section describing how to upload, browse, search, sort, and download datasets for reuse in future evaluations.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Suggested reviewers

  • SangeetaMishr
  • mdshamoon

Poem

🐰 A guide for evaluations, thorough and bright,
Golden QA datasets now shine in the light,
Cosine similarity shows the way,
AI Assistants help every day,
Glific's truth measured with care and delight! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding documentation for AI Evaluations in Glific.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch add-new-ai-evals-documentation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

@github-actions github-actions Bot temporarily deployed to pull request May 21, 2026 08:42 Inactive
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/5`. Integrations/AI Evaluations in Glific.md:
- Line 94: Fix the typo "roq" to "row" in the sentence describing question_id in
the results csv: locate the line containing "In the results csv \"question_id\"
is referring to the question number from the golden QA list. This means question
id of the question in the first roq of the Golden QA csv will be 1 and so on."
and change "roq" to "row" so it reads "first row of the Golden QA csv". Ensure
the corrected text preserves existing punctuation and casing for "question_id"
and "Golden QA csv".
- Line 34: Change the Step headings from h4 to h3 so headings increment
correctly (h2 → h3 → h4...). Specifically update the "Step 1: Click \"+ Create
AI Evaluation\"" heading and apply the same change to the subsequent Step 2–Step
5 headings in the "AI Evaluations in Glific" section so each step uses an h3
instead of h4.
- Line 27: Fix the garbled text in the "Evaluation Name" column description:
update the sentence that currently reads "The name you gave the evaluation,
along with the AI Assistant version and Golden QA dataset used along with its
duplication factor.extra “.”." to a clean, grammatically correct form (e.g.,
"The name you gave the evaluation, plus the AI Assistant version and the Golden
QA dataset used, including its duplication factor."). Locate the "Evaluation
Name" description in the same document and replace the corrupted fragment
"duplication factor.extra “.”." with the corrected phrasing.
- Line 53: Replace the non-descriptive link text "here" in the Tip line ("Tip:
Your CSV must follow the format... Access the template from the link
[here](https://docs.google.com/.../copy)") with a descriptive label such as "CSV
template" or "CSV template (Google Sheets)"; update the markdown link so the
visible text clearly describes the destination (e.g., "CSV template (Google
Sheets)") while keeping the existing URL intact to address the MD059
descriptive-link-text violation.
- Line 88: The sentence containing "Open the results CSV in a google spreadsheet
to perform further analysis and interpret the results of the evaluation."
incorrectly lowercases the proper noun "Google"; update that string to "Open the
results CSV in a Google spreadsheet to perform further analysis and interpret
the results of the evaluation." — edit the markdown where that sentence appears
(search for the exact phrase) and replace "google" with "Google".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eb88ab06-8184-45fc-b301-d2d39fd66861

📥 Commits

Reviewing files that changed from the base of the PR and between 925f03f and ff8044f.

📒 Files selected for processing (1)
  • docs/5. Integrations/AI Evaluations in Glific.md

Comment thread docs/5. Integrations/AI Evaluations in Glific.md Outdated
Comment thread docs/5. Integrations/AI Evaluations in Glific.md Outdated
Comment thread docs/5. Integrations/AI Evaluations in Glific.md Outdated
Comment thread docs/5. Integrations/AI Evaluations in Glific.md Outdated
Comment thread docs/5. Integrations/AI Evaluations in Glific.md Outdated
mahajantejas and others added 6 commits May 21, 2026 14:16
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions Bot temporarily deployed to pull request May 21, 2026 08:50 Inactive
<img width="1408" height="771" alt="Screenshot 2026-05-21 at 10 50 06 AM" src="https://github.com/user-attachments/assets/4601f2e2-23b0-49f2-9507-fc8fbbe334ca" />

The page shows a table of all past evaluations with the following columns:
- Evaluation Name — The name you gave the evaluation, along with the AI Assistant version and Golden QA dataset used along with its duplication factor.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here golden QnA dataset and duplication factor are all new terms for the users. It might be good add something like ("More on this below"), like you added for cosine similarity.

### Step 5: Run the Evaluation
Click the "Run Evaluation" button to start the evaluation.

Glific will now send each question from your Golden QA dataset to the selected AI Assistant and compare the responses against the expected answers. The evaluation will appear in the AI Evaluations list with a "**Completed**" status once it finishes. Time taken to complete the evaluation run depends on the number of golden questions and answers. A good estimation of time range would be 15-30 mins, can even go to 45 mins.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "compare the responses generated by this version of AI Assistant against the expected answer" or similar.

Right now, it is slightly unclear which responses we are referring to.

## Part 2: Reviewing Results
### Viewing Evaluation Results
Once an evaluation is complete, it appears in the AI Evaluations tab with its status, cosine similarity score, and completion timestamp.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We should add a screenshot here showing how the evaluation appears once completed on the evaluation page, along with highlighting the button that users need to click to download the results.

In the results csv "question_id" is referring to the question number from the golden QA list. This means question id of the question in the first row of the Golden QA csv will be 1 and so on.

## Understanding Cosine Similarity
The Cosine Similarity score tells you how meaningfully similar the AI Assistant's actual answers were to the expected "golden" answers. You can hover over the ⓘ icon next to the column header to see an explanation for what cosine similarity means.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We should add a screenshot here as well showing where the ⓘ icon is located, since it may be difficult for a first-time user to identify or navigate to it easily, especially given that it is a small icon.

| < 0.3 | The response has drifted significantly in meaning, even if some words overlap — the assistant may need tuning |


- **Analyze answers that are below 0.3** — Cosine similarity can be good starting indicator to weed out answers that are not aligned at all. So starting with answers that are low scoring and figuring out how to improve the scores on these is a great start. Consistently scoring above 0.7 is a good indicator that the AI answers are aligned to your expectations. However, following nuances can be kept in mind:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel NGOs will not understand "weed out".
Suggestion - We should use a simpler word here - something like "identify or filter out"

Each dataset is a CSV file containing a set of questions paired with their ideal (or "golden") answers with a certain duplication factor. Once a dataset has been uploaded, it can be seen here and can be re-used from for multiple eval runs.

## How to Use It
**Uploading a Golden QA Dataset** Golden QA datasets can be uploaded from the Create AI Evaluation form (accessed via the `+ Create AI Evaluation` button on the AI Evaluations tab). On that form, click Upload Golden QA to upload a new CSV file. A template is available via link on the create form to help you get started quickly.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Could we link the sample CSV template here? Also, could we link the AI Evaluation section as well?


**Downloading a Dataset** Each row in the table has a download icon (↓) in the Actions column. Clicking it downloads the corresponding CSV file, which is useful for reviewing or auditing the question-answer pairs, or for making edits before re-uploading a revised version.

**Using a Dataset in an Evaluation** When creating an AI Evaluation, use the "Search or select a Golden QA" dropdown to pick an existing dataset from the Golden QA library. Combine it with an AI Assistant selection and an Evaluation Name, then click Run Evaluation. The platform will send each question in the dataset to the chosen AI Assistant, compare the responses to the golden answers, and report a cosine similarity score once the run is complete.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems repetitive and redundant, since it has already been included. It can be removed.


- **Analyze answers that are below 0.3** — Cosine similarity can be good starting indicator to weed out answers that are not aligned at all. So starting with answers that are low scoring and figuring out how to improve the scores on these is a great start. Consistently scoring above 0.7 is a good indicator that the AI answers are aligned to your expectations. However, following nuances can be kept in mind:
- **Low-scoring evaluations don't always mean failure** — review the downloaded results to identify which specific questions scored poorly. You may find patterns that can guide improvements to your assistant's knowledge base or prompt instructions. For some questions, it may be ok to get lower scores ex- your AI assistant is catching edge cases and not answering to harmful or potentially misleading questions.
- **High-scoring evaluations more than 0.7 don’t always mean correct answers** — review the results to identify if the answers are also complete. Once the majority of the answers are scoring high on cosine similarity more evaluators can be added to help further improve the correctness and completeness of answers. Connect with Glific team to understand how this can be enabled.
Copy link
Copy Markdown
Contributor

@tanuprasad530 tanuprasad530 May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Along with completeness, correctness also needs to be reviewed. For example, both the Golden Answer and the generated answer may contain a numerical value, but the values themselves could be different. In such cases, the generated answer may appear complete but still be incorrect. However, since the rest of the response is semantically similar, it may still receive a high cosine similarity score.

Might be good to add it in the first line too - along with checking for completeness.

@tanuprasad530 tanuprasad530 merged commit d78040a into main May 22, 2026
7 checks passed
@tanuprasad530 tanuprasad530 deleted the add-new-ai-evals-documentation branch May 22, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants