Stephen-Collins-tech · stephenc222 · Jul 7, 2025 · Jul 4, 2025 · Jul 4, 2025 · Jul 4, 2025
diff --git a/.env.example b/.env.example
@@ -1,4 +1,5 @@
 ANTHROPIC_API_KEY=YOUR_ANTHROPIC_API_KEY
 OPENAI_API_KEY=YOUR_OPENAI_API_KEY
-GEMINI_API_KEY=YOUR_GEMINI_API_KEY
-OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
+GOOGLE_API_KEY=YOUR_GOOGLE_API_KEY
+OPENROUTER_API_KEY=YOUR_OPENROUTER_API_KEY
+OLLAMA_BASE_URL=http://localhost:11434 
diff --git a/.github/workflows/test-and-eval.yml b/.github/workflows/test-and-eval.yml
@@ -0,0 +1,31 @@
+name: Tests and Evaluations
+
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  test-and-eval:
+    runs-on: ubuntu-latest
+    env:
+      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+      GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
+    steps:
+      - uses: actions/checkout@v3
+      - uses: actions/setup-python@v4
+        with:
+          python-version: '3.11'
+      - name: Install uv
+        run: |
+          curl -LsSf https://astral.sh/uv/install.sh | sh
+          echo "$HOME/.cargo/bin" >> $GITHUB_PATH
+      - name: Install dependencies
+        run: |
+          uv sync --dev
+      - name: Tests
+        run: uv run pytest -q
+      - name: Evaluations (Mock Mode)
+        run: uv run python -m intent_kit.evals.run_all_evals --quiet --mock
diff --git a/.gitignore b/.gitignore
@@ -36,6 +36,11 @@ ENV/
 htmlcov/
 .pytest_cache/
 .tox/
+reports/
+
+# Evaluation Results
+intent_kit/evals/results/
+intent_kit/evals/reports/
 
 # Visualization
 intentkit_graphs/

diff --git a/README.md b/README.md
@@ -801,6 +801,67 @@ pytest tests/
 
 ---
 
+## Evaluation & Benchmarking
+
+intent-kit provides a built-in evaluation framework for benchmarking intent graphs and nodes against real datasets. This is separate from unit/integration tests and is designed for large-scale, reproducible evaluation.
+
+The evaluation framework is now part of the main `intent_kit` package and can be imported as:
+
+```python
+from intent_kit.evals import run_all_evaluations, evaluate_node, generate_markdown_report
+```
+
+**Organized Structure:**
+- **Latest results**: Always available in `intent_kit/evals/results/latest/` and `intent_kit/evals/reports/latest/`
+- **Date-based archives**: Historical runs are automatically archived in date-based directories
+- **Clean separation**: Reports and raw results are organized separately for easy access
+
+### Running All Evals
+
+To run all evaluations and generate comprehensive markdown reports:
+
+```bash
+# Run with real API calls (requires API keys)
+uv run run-evals
+
+# Run in mock mode (no API keys required)
+uv run run-evals --mock
+```
+
+- Generates a comprehensive report at `reports/comprehensive_report.md`
+- Generates individual reports for each dataset in `reports/`
+- Mock mode uses simulated responses for testing without API costs
+
+### Running a Specific Eval
+
+To run a specific node evaluation (with markdown output):
+
+```bash
+uv run eval-node --dataset handler_node_llm --output reports/my_eval_report.md
+```
+
+- Replace `handler_node_llm` with any dataset name (without .yaml extension)
+- Add `--output <file.md>` to save the report to a specific file
+- Reports are automatically saved to `reports/` directory
+
+### Adding New Evals
+- Add new YAML datasets to `intent_kit/evals/datasets/`
+- Add corresponding node implementations to `intent_kit/evals/sample_nodes/`
+- The framework will automatically discover and evaluate them
+
+### Where are the results?
+- **Latest reports**: `intent_kit/evals/reports/latest/`
+- **Latest results**: `intent_kit/evals/results/latest/`
+- **Date-based archives**: `intent_kit/evals/reports/YYYY-MM-DD/` and `intent_kit/evals/results/YYYY-MM-DD/`
+- Reports are in markdown format for easy sharing and review
+- Raw results are in CSV format for detailed analysis
+
+### When to use evals vs. tests?
+- **Unit/Integration tests** (in `tests/`): For correctness, fast feedback, and CI
+- **Evals** (in `intent_kit/evals/`): For benchmarking, regression, and real-world performance
+
+---
+
 ## Project Structure
 
 ```
@@ -837,6 +898,13 @@ intent-kit/
 │   │   ├── google_client.py
 │   │   ├── ollama_client.py
 │   │   └── __init__.py
+│   ├── evals/               # Evaluation framework
+│   │   ├── __init__.py      # Evaluation exports
+│   │   ├── run_all_evals.py # Run all evaluations
+│   │   ├── run_node_eval.py # Individual node evaluation
+│   │   ├── datasets/        # Evaluation datasets
+│   │   ├── sample_nodes/    # Sample nodes for evaluation
+│   │   └── reports/         # Generated evaluation reports
 │   ├── types.py             # Type definitions
 │   ├── exceptions/          # Custom exceptions
 │   └── utils/               # Utilities
@@ -855,4 +923,135 @@ intent-kit/
 
 ## License
 
-MIT License
+MIT License
+
+## Evaluation API
+
+The evaluation API provides a clean Python interface for testing your nodes against YAML datasets.
+
+### Basic Usage
+
+```python
+from intent_kit.evals import load_dataset, run_eval
+from intent_kit.evals.sample_nodes.classifier_node_llm import classifier_node_llm
+
+# Load a dataset
+dataset = load_dataset("intent_kit/evals/datasets/classifier_node_llm.yaml")
+
+# Run evaluation
+result = run_eval(dataset, classifier_node_llm)
+
+# Check results
+print(f"Accuracy: {result.accuracy():.1%}")
+print(f"Passed: {result.passed_count()}/{result.total_count()}")
+
+# Save results (using default locations)
+csv_path = result.save_csv()
+json_path = result.save_json()
+md_path = result.save_markdown()
+
+# Or specify custom paths
+result.save_csv("my_results.csv")
+result.save_json("my_results.json")
+result.save_markdown("my_report.md")
+```
+
+### Convenience Functions
+
+```python
+from intent_kit.evals import run_eval_from_path, run_eval_from_module
+
+# Evaluate from file path
+result = run_eval_from_path(
+    "intent_kit/evals/datasets/classifier_node_llm.yaml",
+    classifier_node_llm
+)
+
+# Evaluate with module loading
+result = run_eval_from_module(
+    "intent_kit/evals/datasets/classifier_node_llm.yaml",
+    "intent_kit.evals.sample_nodes.classifier_node_llm",
+    "classifier_node_llm"
+)
+```
+
+### Custom Comparison
+
+```python
+# Case-insensitive comparison
+def case_insensitive_comparator(expected, actual):
+    return str(expected).lower().strip() == str(actual).lower().strip()
+
+result = run_eval(dataset, node, comparator=case_insensitive_comparator)
+```
+
+### Programmatic Datasets
+
+```python
+from intent_kit.evals import EvalTestCase, Dataset
+
+# Create test cases programmatically
+test_cases = [
+    EvalTestCase(
+        input="What's the weather like?",
+        expected="Weather response",
+        context={"user_id": "test"}
+    )
+]
+
+dataset = Dataset(
+    name="my_dataset",
+    description="Custom test dataset",
+    node_type="classifier",
+    node_name="my_node",
+    test_cases=test_cases
+)
+
+result = run_eval(dataset, my_node)
+```
+
+### Dataset Format
+
+YAML datasets should follow this format:
+
+```yaml
+dataset:
+  name: "my_dataset"
+  description: "Test dataset for my node"
+  node_type: "classifier"
+  node_name: "my_node"
+
+test_cases:
+  - input: "What's the weather like in New York?"
+    expected: "Weather in New York: Sunny with a chance of rain"
+    context:
+      user_id: "user123"
+
+  - input: "Cancel my flight"
+    expected: "Successfully cancelled flight"
+    context:
+      user_id: "user123"
+```
+
+### Error Handling
+
+The API handles errors gracefully:
+
+- **Node exceptions**: Caught and recorded in results
+- **Missing files**: Clear error messages
+- **Malformed datasets**: Validation with helpful error messages
+- **Fail-fast option**: Stop evaluation on first failure
+
+```python
+# Fail-fast evaluation
+result = run_eval(dataset, node, fail_fast=True)
+```
+
+### Output Locations
+
+By default, results are saved to the existing intent-kit directory structure:
+
+- **CSV/JSON results**: `intent_kit/evals/results/latest/`
+- **Markdown reports**: `intent_kit/evals/reports/latest/`
+
+Files are automatically timestamped to avoid conflicts. You can also specify custom paths if needed.
diff --git a/env.example b/env.example
@@ -0,0 +1,14 @@
+# Example .env file for Intent Kit LLM evaluations
+# Copy this to .env and add your actual API keys
+
+# OpenAI API Key (for GPT models)
+OPENAI_API_KEY=your-openai-api-key-here
+
+# Anthropic API Key (for Claude models)
+ANTHROPIC_API_KEY=your-anthropic-api-key-here
+
+# Google API Key (for Gemini models)
+GOOGLE_API_KEY=your-google-api-key-here
+
+# Ollama (local models - no API key needed)
+# OLLAMA_BASE_URL=http://localhost:11434 
diff --git a/examples/advanced_remediation_demo.py b/examples/advanced_remediation_demo.py
@@ -28,12 +28,12 @@
 # --- Setup LLM configs ---
 load_dotenv()
 OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or "sk-mock-openai"
-GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or "sk-mock-gemini"
+GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or "sk-mock-gemini"
 
 LLM_CONFIG_1 = {"provider": "openai",
                 "model": "gpt-4.1-mini", "api_key": OPENAI_API_KEY}
 LLM_CONFIG_2 = {"provider": "google",
-                "model": "gemini-2.5-flash", "api_key": GEMINI_API_KEY}
+                "model": "gemini-2.5-flash", "api_key": GOOGLE_API_KEY}
 
 # --- Core Handler: Simulates model confusion and ambiguity ---
 
@@ -134,7 +134,7 @@ def main():
     print("• Consensus voting: Multiple models must agree before output is accepted.")
     print("• Alternate prompt: Handler retries with a new prompt if it can't answer.")
 
-    if "mock" in OPENAI_API_KEY or "mock" in GEMINI_API_KEY:
+    if "mock" in OPENAI_API_KEY or "mock" in GOOGLE_API_KEY:
         print("\n💡 Pro Tip: For real LLM behavior, add your OpenAI and Gemini API keys to a .env file.")