Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
ANTHROPIC_API_KEY=YOUR_ANTHROPIC_API_KEY
OPENAI_API_KEY=YOUR_OPENAI_API_KEY
GEMINI_API_KEY=YOUR_GEMINI_API_KEY
OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
OLLAMA_BASE_URL=http://localhost:11434
31 changes: 31 additions & 0 deletions .github/workflows/test-and-eval.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Tests and Evaluations

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
test-and-eval:
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
echo "$HOME/.cargo/bin" >> $GITHUB_PATH
- name: Install dependencies
run: |
uv sync --dev
- name: Tests
run: uv run pytest -q
- name: Evaluations (Mock Mode)
run: uv run python -m intent_kit.evals.run_all_evals --quiet --mock
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ ENV/
htmlcov/
.pytest_cache/
.tox/
reports/

# Evaluation Results
intent_kit/evals/results/
intent_kit/evals/reports/

# Visualization
intentkit_graphs/
Expand Down
201 changes: 200 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -801,6 +801,67 @@ pytest tests/

---

## Evaluation & Benchmarking

intent-kit provides a built-in evaluation framework for benchmarking intent graphs and nodes against real datasets. This is separate from unit/integration tests and is designed for large-scale, reproducible evaluation.

The evaluation framework is now part of the main `intent_kit` package and can be imported as:

```python
from intent_kit.evals import run_all_evaluations, evaluate_node, generate_markdown_report
```

**Organized Structure:**
- **Latest results**: Always available in `intent_kit/evals/results/latest/` and `intent_kit/evals/reports/latest/`
- **Date-based archives**: Historical runs are automatically archived in date-based directories
- **Clean separation**: Reports and raw results are organized separately for easy access

### Running All Evals

To run all evaluations and generate comprehensive markdown reports:

```bash
# Run with real API calls (requires API keys)
uv run run-evals

# Run in mock mode (no API keys required)
uv run run-evals --mock
```

- Generates a comprehensive report at `reports/comprehensive_report.md`
- Generates individual reports for each dataset in `reports/`
- Mock mode uses simulated responses for testing without API costs

### Running a Specific Eval

To run a specific node evaluation (with markdown output):

```bash
uv run eval-node --dataset handler_node_llm --output reports/my_eval_report.md
```

- Replace `handler_node_llm` with any dataset name (without .yaml extension)
- Add `--output <file.md>` to save the report to a specific file
- Reports are automatically saved to `reports/` directory

### Adding New Evals
- Add new YAML datasets to `intent_kit/evals/datasets/`
- Add corresponding node implementations to `intent_kit/evals/sample_nodes/`
- The framework will automatically discover and evaluate them

### Where are the results?
- **Latest reports**: `intent_kit/evals/reports/latest/`
- **Latest results**: `intent_kit/evals/results/latest/`
- **Date-based archives**: `intent_kit/evals/reports/YYYY-MM-DD/` and `intent_kit/evals/results/YYYY-MM-DD/`
- Reports are in markdown format for easy sharing and review
- Raw results are in CSV format for detailed analysis

### When to use evals vs. tests?
- **Unit/Integration tests** (in `tests/`): For correctness, fast feedback, and CI
- **Evals** (in `intent_kit/evals/`): For benchmarking, regression, and real-world performance

---

## Project Structure

```
Expand Down Expand Up @@ -837,6 +898,13 @@ intent-kit/
│ │ ├── google_client.py
│ │ ├── ollama_client.py
│ │ └── __init__.py
│ ├── evals/ # Evaluation framework
│ │ ├── __init__.py # Evaluation exports
│ │ ├── run_all_evals.py # Run all evaluations
│ │ ├── run_node_eval.py # Individual node evaluation
│ │ ├── datasets/ # Evaluation datasets
│ │ ├── sample_nodes/ # Sample nodes for evaluation
│ │ └── reports/ # Generated evaluation reports
│ ├── types.py # Type definitions
│ ├── exceptions/ # Custom exceptions
│ └── utils/ # Utilities
Expand All @@ -855,4 +923,135 @@ intent-kit/

## License

MIT License
MIT License

## Evaluation API

The evaluation API provides a clean Python interface for testing your nodes against YAML datasets.

### Basic Usage

```python
from intent_kit.evals import load_dataset, run_eval
from intent_kit.evals.sample_nodes.classifier_node_llm import classifier_node_llm

# Load a dataset
dataset = load_dataset("intent_kit/evals/datasets/classifier_node_llm.yaml")

# Run evaluation
result = run_eval(dataset, classifier_node_llm)

# Check results
print(f"Accuracy: {result.accuracy():.1%}")
print(f"Passed: {result.passed_count()}/{result.total_count()}")

# Save results (using default locations)
csv_path = result.save_csv()
json_path = result.save_json()
md_path = result.save_markdown()

# Or specify custom paths
result.save_csv("my_results.csv")
result.save_json("my_results.json")
result.save_markdown("my_report.md")
```

### Convenience Functions

```python
from intent_kit.evals import run_eval_from_path, run_eval_from_module

# Evaluate from file path
result = run_eval_from_path(
"intent_kit/evals/datasets/classifier_node_llm.yaml",
classifier_node_llm
)

# Evaluate with module loading
result = run_eval_from_module(
"intent_kit/evals/datasets/classifier_node_llm.yaml",
"intent_kit.evals.sample_nodes.classifier_node_llm",
"classifier_node_llm"
)
```

### Custom Comparison

```python
# Case-insensitive comparison
def case_insensitive_comparator(expected, actual):
return str(expected).lower().strip() == str(actual).lower().strip()

result = run_eval(dataset, node, comparator=case_insensitive_comparator)
```

### Programmatic Datasets

```python
from intent_kit.evals import EvalTestCase, Dataset

# Create test cases programmatically
test_cases = [
EvalTestCase(
input="What's the weather like?",
expected="Weather response",
context={"user_id": "test"}
)
]

dataset = Dataset(
name="my_dataset",
description="Custom test dataset",
node_type="classifier",
node_name="my_node",
test_cases=test_cases
)

result = run_eval(dataset, my_node)
```

### Dataset Format

YAML datasets should follow this format:

```yaml
dataset:
name: "my_dataset"
description: "Test dataset for my node"
node_type: "classifier"
node_name: "my_node"

test_cases:
- input: "What's the weather like in New York?"
expected: "Weather in New York: Sunny with a chance of rain"
context:
user_id: "user123"

- input: "Cancel my flight"
expected: "Successfully cancelled flight"
context:
user_id: "user123"
```

### Error Handling

The API handles errors gracefully:

- **Node exceptions**: Caught and recorded in results
- **Missing files**: Clear error messages
- **Malformed datasets**: Validation with helpful error messages
- **Fail-fast option**: Stop evaluation on first failure

```python
# Fail-fast evaluation
result = run_eval(dataset, node, fail_fast=True)
```

### Output Locations

By default, results are saved to the existing intent-kit directory structure:

- **CSV/JSON results**: `intent_kit/evals/results/latest/`
- **Markdown reports**: `intent_kit/evals/reports/latest/`

Files are automatically timestamped to avoid conflicts. You can also specify custom paths if needed.
14 changes: 14 additions & 0 deletions env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Example .env file for Intent Kit LLM evaluations
# Copy this to .env and add your actual API keys

# OpenAI API Key (for GPT models)
OPENAI_API_KEY=your-openai-api-key-here

# Anthropic API Key (for Claude models)
ANTHROPIC_API_KEY=your-anthropic-api-key-here

# Google API Key (for Gemini models)
GOOGLE_API_KEY=your-google-api-key-here

# Ollama (local models - no API key needed)
# OLLAMA_BASE_URL=http://localhost:11434
6 changes: 3 additions & 3 deletions examples/advanced_remediation_demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@
# --- Setup LLM configs ---
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or "sk-mock-openai"
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or "sk-mock-gemini"
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or "sk-mock-gemini"

LLM_CONFIG_1 = {"provider": "openai",
"model": "gpt-4.1-mini", "api_key": OPENAI_API_KEY}
LLM_CONFIG_2 = {"provider": "google",
"model": "gemini-2.5-flash", "api_key": GEMINI_API_KEY}
"model": "gemini-2.5-flash", "api_key": GOOGLE_API_KEY}

# --- Core Handler: Simulates model confusion and ambiguity ---

Expand Down Expand Up @@ -134,7 +134,7 @@ def main():
print("• Consensus voting: Multiple models must agree before output is accepted.")
print("• Alternate prompt: Handler retries with a new prompt if it can't answer.")

if "mock" in OPENAI_API_KEY or "mock" in GEMINI_API_KEY:
if "mock" in OPENAI_API_KEY or "mock" in GOOGLE_API_KEY:
print("\n💡 Pro Tip: For real LLM behavior, add your OpenAI and Gemini API keys to a .env file.")


Expand Down
Loading