Add AI Evaluations documentation in Glific#618
Conversation
Added documentation for AI Evaluations in Glific, covering prerequisites, navigation, running evaluations, reviewing results, understanding cosine similarity, and best practices.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughNew documentation page for AI Evaluations in Glific covering setup, running evaluations with Golden QA datasets, reviewing results via cosine similarity scoring, and managing the evaluation dataset library. 141 lines added across introduction, workflow guides, result interpretation, and reference material. ChangesAI Evaluations in Glific
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🚀 Deployed on https://deploy-preview-618--glific-docs.netlify.app |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/5`. Integrations/AI Evaluations in Glific.md:
- Line 94: Fix the typo "roq" to "row" in the sentence describing question_id in
the results csv: locate the line containing "In the results csv \"question_id\"
is referring to the question number from the golden QA list. This means question
id of the question in the first roq of the Golden QA csv will be 1 and so on."
and change "roq" to "row" so it reads "first row of the Golden QA csv". Ensure
the corrected text preserves existing punctuation and casing for "question_id"
and "Golden QA csv".
- Line 34: Change the Step headings from h4 to h3 so headings increment
correctly (h2 → h3 → h4...). Specifically update the "Step 1: Click \"+ Create
AI Evaluation\"" heading and apply the same change to the subsequent Step 2–Step
5 headings in the "AI Evaluations in Glific" section so each step uses an h3
instead of h4.
- Line 27: Fix the garbled text in the "Evaluation Name" column description:
update the sentence that currently reads "The name you gave the evaluation,
along with the AI Assistant version and Golden QA dataset used along with its
duplication factor.extra “.”." to a clean, grammatically correct form (e.g.,
"The name you gave the evaluation, plus the AI Assistant version and the Golden
QA dataset used, including its duplication factor."). Locate the "Evaluation
Name" description in the same document and replace the corrupted fragment
"duplication factor.extra “.”." with the corrected phrasing.
- Line 53: Replace the non-descriptive link text "here" in the Tip line ("Tip:
Your CSV must follow the format... Access the template from the link
[here](https://docs.google.com/.../copy)") with a descriptive label such as "CSV
template" or "CSV template (Google Sheets)"; update the markdown link so the
visible text clearly describes the destination (e.g., "CSV template (Google
Sheets)") while keeping the existing URL intact to address the MD059
descriptive-link-text violation.
- Line 88: The sentence containing "Open the results CSV in a google spreadsheet
to perform further analysis and interpret the results of the evaluation."
incorrectly lowercases the proper noun "Google"; update that string to "Open the
results CSV in a Google spreadsheet to perform further analysis and interpret
the results of the evaluation." — edit the markdown where that sentence appears
(search for the exact phrase) and replace "google" with "Google".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: eb88ab06-8184-45fc-b301-d2d39fd66861
📒 Files selected for processing (1)
docs/5. Integrations/AI Evaluations in Glific.md
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
| <img width="1408" height="771" alt="Screenshot 2026-05-21 at 10 50 06 AM" src="https://github.com/user-attachments/assets/4601f2e2-23b0-49f2-9507-fc8fbbe334ca" /> | ||
|
|
||
| The page shows a table of all past evaluations with the following columns: | ||
| - Evaluation Name — The name you gave the evaluation, along with the AI Assistant version and Golden QA dataset used along with its duplication factor. |
There was a problem hiding this comment.
Here golden QnA dataset and duplication factor are all new terms for the users. It might be good add something like ("More on this below"), like you added for cosine similarity.
| ### Step 5: Run the Evaluation | ||
| Click the "Run Evaluation" button to start the evaluation. | ||
|
|
||
| Glific will now send each question from your Golden QA dataset to the selected AI Assistant and compare the responses against the expected answers. The evaluation will appear in the AI Evaluations list with a "**Completed**" status once it finishes. Time taken to complete the evaluation run depends on the number of golden questions and answers. A good estimation of time range would be 15-30 mins, can even go to 45 mins. |
There was a problem hiding this comment.
Add "compare the responses generated by this version of AI Assistant against the expected answer" or similar.
Right now, it is slightly unclear which responses we are referring to.
| ## Part 2: Reviewing Results | ||
| ### Viewing Evaluation Results | ||
| Once an evaluation is complete, it appears in the AI Evaluations tab with its status, cosine similarity score, and completion timestamp. | ||
|
|
There was a problem hiding this comment.
Suggestion: We should add a screenshot here showing how the evaluation appears once completed on the evaluation page, along with highlighting the button that users need to click to download the results.
| In the results csv "question_id" is referring to the question number from the golden QA list. This means question id of the question in the first row of the Golden QA csv will be 1 and so on. | ||
|
|
||
| ## Understanding Cosine Similarity | ||
| The Cosine Similarity score tells you how meaningfully similar the AI Assistant's actual answers were to the expected "golden" answers. You can hover over the ⓘ icon next to the column header to see an explanation for what cosine similarity means. |
There was a problem hiding this comment.
Suggestion: We should add a screenshot here as well showing where the ⓘ icon is located, since it may be difficult for a first-time user to identify or navigate to it easily, especially given that it is a small icon.
| | < 0.3 | The response has drifted significantly in meaning, even if some words overlap — the assistant may need tuning | | ||
|
|
||
|
|
||
| - **Analyze answers that are below 0.3** — Cosine similarity can be good starting indicator to weed out answers that are not aligned at all. So starting with answers that are low scoring and figuring out how to improve the scores on these is a great start. Consistently scoring above 0.7 is a good indicator that the AI answers are aligned to your expectations. However, following nuances can be kept in mind: |
There was a problem hiding this comment.
I feel NGOs will not understand "weed out".
Suggestion - We should use a simpler word here - something like "identify or filter out"
| Each dataset is a CSV file containing a set of questions paired with their ideal (or "golden") answers with a certain duplication factor. Once a dataset has been uploaded, it can be seen here and can be re-used from for multiple eval runs. | ||
|
|
||
| ## How to Use It | ||
| **Uploading a Golden QA Dataset** Golden QA datasets can be uploaded from the Create AI Evaluation form (accessed via the `+ Create AI Evaluation` button on the AI Evaluations tab). On that form, click Upload Golden QA to upload a new CSV file. A template is available via link on the create form to help you get started quickly. |
There was a problem hiding this comment.
Suggestion: Could we link the sample CSV template here? Also, could we link the AI Evaluation section as well?
|
|
||
| **Downloading a Dataset** Each row in the table has a download icon (↓) in the Actions column. Clicking it downloads the corresponding CSV file, which is useful for reviewing or auditing the question-answer pairs, or for making edits before re-uploading a revised version. | ||
|
|
||
| **Using a Dataset in an Evaluation** When creating an AI Evaluation, use the "Search or select a Golden QA" dropdown to pick an existing dataset from the Golden QA library. Combine it with an AI Assistant selection and an Evaluation Name, then click Run Evaluation. The platform will send each question in the dataset to the chosen AI Assistant, compare the responses to the golden answers, and report a cosine similarity score once the run is complete. |
There was a problem hiding this comment.
This seems repetitive and redundant, since it has already been included. It can be removed.
|
|
||
| - **Analyze answers that are below 0.3** — Cosine similarity can be good starting indicator to weed out answers that are not aligned at all. So starting with answers that are low scoring and figuring out how to improve the scores on these is a great start. Consistently scoring above 0.7 is a good indicator that the AI answers are aligned to your expectations. However, following nuances can be kept in mind: | ||
| - **Low-scoring evaluations don't always mean failure** — review the downloaded results to identify which specific questions scored poorly. You may find patterns that can guide improvements to your assistant's knowledge base or prompt instructions. For some questions, it may be ok to get lower scores ex- your AI assistant is catching edge cases and not answering to harmful or potentially misleading questions. | ||
| - **High-scoring evaluations more than 0.7 don’t always mean correct answers** — review the results to identify if the answers are also complete. Once the majority of the answers are scoring high on cosine similarity more evaluators can be added to help further improve the correctness and completeness of answers. Connect with Glific team to understand how this can be enabled. |
There was a problem hiding this comment.
Along with completeness, correctness also needs to be reviewed. For example, both the Golden Answer and the generated answer may contain a numerical value, but the values themselves could be different. In such cases, the generated answer may appear complete but still be incorrect. However, since the rest of the response is semantically similar, it may still receive a high cosine similarity score.
Might be good to add it in the first line too - along with checking for completeness.
Added documentation for AI Evaluations in Glific, covering prerequisites, navigation, running evaluations, reviewing results, understanding cosine similarity, and best practices.
Summary by CodeRabbit