This is a Python library for assessing the quality of question-answering systems, such as systems built with LLM-based agents. It is agnostic to the agent implementation and the LLM it uses.
The evaluation is based on a user-provided reference dataset containing queries, reference responses, and optional reference steps, such as expected tool uses. The evaluator compares these references with the agent's actual responses and executed steps. Reference steps can be grouped to allow some expected steps to occur in any order.
The library provides built-in evaluation metrics and supports user-defined custom metrics (§ Metrics).
Developed and maintained by Graphwise. For issues and feature requests, please open a GitHub issue.
Apache-2.0 License. See the LICENSE file for details.
