[AISOS-741] Add auto review primitive to Forge#119
Open
JoshSalomon wants to merge 27 commits into
Open
Conversation
added 21 commits
July 2, 2026 12:37
… ReviewCycleData dataclasses Detailed description: - Created containers/review.py with: - Verdict StrEnum with APPROVED and REJECTED values - ReviewConfig dataclass with max_retries (int, default 3) and instructions (str) - ReviewCycleData dataclass with all specified fields - parse_review_config() function for parsing YAML frontmatter - Function supports AUTO_REVIEW_MAX_RETRIES env var fallback (SC-005) - Malformed YAML logs warning and returns defaults (SC-012/BR-006) - Added 23 unit tests in tests/unit/containers/test_review.py Closes: AISOS-2053
Implemented detect_review_md(skill_name, ticket_key, skills_base) -> Path | None
that locates review.md files with project override precedence:
- Mirrors precedence logic from src/forge/skills/resolver.py
- Checks project-specific path first: skills/{project}/{skill_name}/review.md
- Falls back to default: skills/default/{skill_name}/review.md
- Extracts project key from ticket_key (e.g., 'AISOS-123' -> 'aisos')
- Returns None if review.md doesn't exist (SC-010)
- Uses is_file() to correctly handle edge case of directory named review.md
Key files modified:
- containers/review.py: Added detect_review_md function with debug logging
- tests/unit/containers/test_review.py: Added 11 comprehensive unit tests
- tests/unit/containers/conftest.py: Created to isolate from forge imports
Supports:
- SC-006: Project override precedence
- SC-010: No reviewer when review.md absent
Closes: AISOS-2054
Added parse_verdict(output_text: str) -> tuple[Verdict, str] to containers/review.py: - Performs case-insensitive search for APPROVED and REJECTED markers - Returns (Verdict.APPROVED, "") when APPROVED marker found first (SC-002) - Returns (Verdict.REJECTED, feedback) when REJECTED marker found first, extracting all text after the marker as feedback (SC-002) - Returns (Verdict.REJECTED, "Verdict could not be parsed") when neither marker present - When both markers present, first occurrence wins (position-based) - Feedback text is stripped of leading/trailing whitespace - Empty feedback handled correctly (SC-003 edge case) Added 23 comprehensive unit tests covering: - Case-insensitive matching (uppercase, lowercase, mixed case) - Feedback extraction from text after REJECTED marker - Edge cases: empty string, whitespace-only, neither marker - Both markers present with position-based precedence - Real-world review scenarios All 57 tests pass. Closes: AISOS-2055
Added write_cycle_file(workspace, step_name, cycle_data) to containers/review.py:
- Creates .forge/{step_name}/ directory if it doesn't exist
- Writes review_cycle_N.json where N is cycle_data.cycle
- Serializes ReviewCycleData to JSON with 2-space indentation
- Ensures verdict is lowercase string ('approved' or 'rejected')
- Handles both Verdict enum and string values for verdict field
- Uses ensure_ascii=False for proper Unicode handling
Added 18 unit tests in TestWriteCycleFile class verifying:
- Directory creation (.forge/ and step subdirectory)
- File output path pattern (review_cycle_N.json)
- All required JSON fields (SC-007)
- ISO 8601 UTC timestamp format preservation
- Verdict lowercase conversion and enum handling
- Multiple cycles written separately
- JSON pretty printing
- File overwrite behavior
- Empty/multiline/special character feedback handling
- Float precision for elapsed_seconds
- Step name variations (hyphens, underscores)
Closes: AISOS-2056
Integrates the review loop into the container entrypoint after skill execution:
- Import review functions from containers/review.py (detect_review_md,
parse_review_config, parse_verdict, write_cycle_file, ReviewCycleData, Verdict)
- Add run_reviewer_agent() to spawn reviewer with review.md instructions (SC-001)
- Add load_conversation_history() to load from .forge/history/{task_key}.json
- Add run_worker_with_feedback() for retry with feedback injection (SC-003):
- Injects '## Reviewer Feedback' section into task description
- Loads conversation history for context
- Add run_review_loop() for complete review loop:
- Parse config via parse_review_config (SC-004, SC-005)
- Track cycle timing with time.perf_counter()
- Parse verdict via parse_verdict (SC-002)
- Write cycle file to .forge/{step_name}/review_cycle_N.json (SC-007)
- Exit on APPROVED or max retries exhausted (BR-005)
- Modify main() to:
- Extract skill_name from task_data or FORGE_SKILL_NAME env var
- Check for review.md via detect_review_md after run_agent_task
- Run review loop if review.md exists (SC-001, SC-010)
Add 17 unit tests covering all acceptance criteria:
- load_conversation_history: loads existing, handles missing, handles JSON error
- run_worker_with_feedback: feedback injection, history loading
- run_review_loop: APPROVED exits, REJECTED retries, max retries, frontmatter,
env var fallback, cycle file path, reviewer instructions, timing, error handling
- main integration: skips when no skill_name, skips when no review.md, runs when exists
Closes: AISOS-2057
… mode Add terminal progress output to the review loop in containers/entrypoint.py: - Added TTY detection with _IS_TTY = sys.stdout.isatty() - Added _print_review_progress() function for terminal output: - Format: "Review cycle N/M: VERDICT - [truncated feedback]" - Only outputs when stdout is a TTY (local/terminal mode) - Truncates feedback to 200 chars with ellipsis when exceeded - Uses flush=True for immediate output (SC-011: within 1 second) - Integrated into run_review_loop() immediately after verdict determination - Added 8 unit tests for the new functionality Supports SC-011: terminal displays progress within 1 second of verdict. Closes: AISOS-2058
Implements async polling for review cycle files during container execution.
Changes:
- Create ReviewCyclePoller class in src/forge/observability/review_poller.py
- Accepts step_name parameter for step-specific path detection
- Polls .forge/{step-name}/review_cycle_*.json files
- Implements async polling loop with configurable interval
- JSON parsing with retry handles race conditions with container writes
- Tracks already-processed files to avoid duplicates
- Add ReviewCycleData dataclass for orchestrator-side review cycle data
- Add AUTO_REVIEW_POLL_INTERVAL (default 5s) to forge.config.Settings
- Add aiofiles>=24.1.0 dependency to pyproject.toml
- Export ReviewCyclePoller and ReviewCycleData from forge.observability
- Add 32 unit tests with full coverage
Closes: AISOS-2059
…ode switching
Implemented review cycle recording functionality for observability:
- Created src/forge/observability/review_recorder.py with:
- ReviewCycleData dataclass with fields: cycle, max_cycles, verdict, feedback,
skill, elapsed_seconds, timestamp (datetime type)
- to_dict() and to_json() methods for serialization
- ReviewCycleRecorder class with step_name parameter and mode switching:
- mode='log': logs cycle data at INFO level with structured format
- mode='copy': copies files to {recording_dir}/{step-name}/review_cycle_*.json
- mode=None: recording disabled
- step_dir property for step-specific subdirectory path
- record() method for logging, record_file() method for copying
- Added AUTO_REVIEW_RECORD_POLLED_FILES setting to forge.config.Settings
- Type: Literal['log', 'copy'] | None (default: None)
- Placed under Auto Review Configuration section
- Updated observability module __init__.py to export new classes
- Created comprehensive unit tests (34 tests) in
tests/unit/observability/test_review_recorder.py covering:
- ReviewCycleData fields, to_dict(), to_json()
- ReviewCycleRecorder initialization with all mode combinations
- Log mode INFO level logging with structured format
- Copy mode file copying and directory creation
- Error handling for missing files and directory creation failures
- Settings configuration validation
All tests pass.
Closes: AISOS-2060
Extend src/forge/api/routes/metrics.py with Prometheus metrics for review cycle monitoring: - forge_review_cycles_total Counter: Total review cycles detected, labeled by skill and step - forge_review_verdicts_total Counter: Review verdicts by outcome (approved/rejected), labeled by skill, step, and verdict - forge_review_duration_seconds Histogram: Review cycle duration, labeled by skill and step, using same buckets as AGENT_DURATION [1, 5, 10, 30, 60, 120, 300, 600] Added helper functions for recording metrics: - record_review_cycle(skill, step) - record_review_verdict(skill, step, verdict) - observe_review_duration(skill, step, duration) Added 11 unit tests in TestReviewCycleMetrics class verifying: - Counter and histogram existence and labeling - Helper function increment/observe behavior - Metrics visibility in /metrics endpoint output Closes: AISOS-2061
Create src/forge/observability/review_notifier.py module for posting Jira comments when review cycles are detected during container execution. Changes: - Add ReviewJiraNotifier class with rate limiting support (default 30s) - Format ReviewCycleData as readable Jira comments with: - Cycle number and max cycles with header - Verdict with ✅/❌ icons (APPROVED/REJECTED) - Skill name and duration (seconds or minutes) - Feedback summary (truncated at 500 chars with word boundary) - Handle Jira API errors gracefully: log WARNING and continue - Add NotifyResult dataclass for notify operation results - Export ReviewJiraNotifier and NotifyResult from forge.observability - Add 31 comprehensive unit tests covering: - Initialization and defaults - Rate limiting behavior - Comment formatting - Error handling - Integration scenarios Follows existing JiraClient patterns from forge.workflow.utils.jira_status Closes: AISOS-2062
…name parameter
Implementation:
- Extended ContainerResult dataclass with review_cycles field (list[ReviewCycleData])
- Added optional step_name parameter to ContainerRunner.run() method
- Added _poll_review_cycles() background task that:
- Polls workspace for .forge/{step-name}/review_cycle_*.json files
- Records cycles via ReviewCycleRecorder (log mode)
- Emits Prometheus metrics for observability
- Added _poller_to_recorder_cycle() helper for ReviewCycleData conversion
- Background polling task cancelled cleanly on container exit
- Final poll performed after container exits to catch remaining files
- Collected review_cycles included in all ContainerResult returns
Unit tests:
- 14 tests in tests/unit/sandbox/test_runner_review_polling.py covering:
- ContainerResult review_cycles field
- Poller to recorder cycle conversion
- Step name parameter acceptance
- Background polling task lifecycle
- Review cycle collection and metrics recording
- Step name path organization
Closes: AISOS-2063
Added _sweep_review_cycles() private method to ContainerRunner that performs
a synchronous post-execution sweep after the container exits. This catches any
review cycle files that may have been missed during async polling, especially
when containers exit quickly after writing files.
Implementation details:
- Scans .forge/{step-name}/review_cycle_*.json for all files
- Deduplicates against files already processed by async poller
- Parses and validates JSON, adding to review_cycles list
- Records via recorder and emits Prometheus metrics
- Logs warning when files are missed during async polling
- Handles edge cases: invalid JSON, missing fields, empty files
Added 11 unit tests covering:
- Basic sweep functionality and deduplication
- Edge case handling (missing dir, invalid JSON, etc.)
- Integration tests for fast-exit scenario
- Warning logging verification
Closes: AISOS-2064
Updated all workflow nodes that invoke ContainerRunner to pass the
step_name parameter for proper review cycle file organization:
- implementation.py: step_name='implement_task'
- local_reviewer.py: step_name='local_review' (bug and feature reviews)
- ci_evaluator.py: step_name='analyze_ci' and step_name='fix_ci'
- code_review.py: step_name='code_review'
- docs_updater.py: step_name='update_docs'
- implement_review.py: step_name='implement_review_analyze' and 'implement_review_fix'
- plan_bug_fix.py: step_name='plan_bug_fix'
- rca_analysis.py: step_name='analyze_bug' and step_name='reflect_rca'
- rebase.py: step_name='rebase'
This enables the orchestrator to poll for and record files at
.forge/{step-name}/review_cycle_*.json during container execution.
Added unit tests for step_name propagation in both implementation.py
and local_reviewer.py. All 630+ unit tests pass.
Closes: AISOS-2065
…fication
Created docs/reference/review-md-schema.md with complete schema specification:
- Overview explaining purpose of review.md files
- File location and naming convention (review.md alongside SKILL.md)
- Per-project override precedence rules (project > default)
- YAML frontmatter schema with max_retries field (optional, default: 3)
- Priority chain: frontmatter > AUTO_REVIEW_MAX_RETRIES env var > default
- Prose body format for reviewer instructions with best practices
- Edge cases: empty file, frontmatter-only, missing frontmatter, malformed YAML
- Review cycle output path: .forge/{step-name}/review_cycle_N.json
- JSON schema for review cycle output files
- Validation summary table
Follows existing documentation style from docs/reference/ files and
is consistent with SKILL.md frontmatter patterns.
Closes: AISOS-2066
Add default review instructions for code implementation tasks with: - YAML frontmatter with max_retries: 3 - Four review criteria: test coverage, error handling, documentation, no debug code - Language-agnostic examples (Python, Go, Node.js, Rust) - Instructions to use git diff origin/main...HEAD to see changes - APPROVED/REJECTED verdict markers - Structured feedback format for rejections with file/line/criterion Closes: AISOS-2067
Add default review instructions for pre-PR code review quality gate. The review.md complements the existing SKILL.md by adding an automated quality gate that runs after the local-code-review skill completes. Key features: - YAML frontmatter with max_retries: 2 (lower than implement-task's 3 since this is a secondary review pass) - Instructions to detect and run project test command (pyproject.toml, package.json, Makefile, Cargo.toml, go.mod) - Four review criteria: 1. Breaking changes: API/interface changes without migration 2. Test failures: all tests must pass locally 3. Security issues: hardcoded secrets, SQL injection, path traversal 4. Spec alignment: changes match task description - APPROVED/REJECTED verdict format with structured feedback examples Closes: AISOS-2068
…plement-task Created skills/aisos/implement-task/review.md as a reference example demonstrating per-project skill customization. The override: - Sets max_retries: 2 (customized from default of 3) - Demonstrates inheritance by referencing default review criteria - Adds Forge-specific criteria: - Testing with pytest (fixtures, parametrize, pytest-asyncio) - Type hints required on all public functions (PEP 604 style) - Logging via logging module (no print() for diagnostics) - Proper async/await patterns (gather vs sequential, no blocking I/O) - Includes documentation for skill authors on how to create overrides - Shows feedback format examples with Forge-specific file paths This serves as documentation by example for teams creating their own project-specific skill overrides. Closes: AISOS-2069
Detailed description:
- Created docs/guide/auto-review.md documenting the auto-review primitive
- Covers all required sections:
- Overview: What auto-review is and when to use it
- Quick Start: Minimal review.md example
- Configuration: max_retries in frontmatter, AUTO_REVIEW_MAX_RETRIES env var, default (3)
- Writing Review Instructions: How to write effective reviewer prompts
- Verdict Protocol: APPROVED/REJECTED markers with examples and feedback format
- Per-Project Overrides: How skill resolution works for review.md
- Observability: Review cycle files at .forge/{step-name}/review_cycle_N.json
- Troubleshooting: Timeout, parsing failures, endless loops, and more
- Includes mermaid flowchart showing review loop
- Links to related docs (skills authoring, review-md-schema)
- Follows documentation style from existing docs/guide/ files
Closes: AISOS-2070
Updated three documentation files to reference the new auto-review capability: - docs/skills/authoring.md: Added Auto-Review section explaining that skills can include review.md for automatic quality gates, with link to the guide - docs/skills/index.md: Updated directory layout to show review.md alongside SKILL.md, added Optional Files table with review.md description - docs/skills/defaults.md: Added Auto-review notes under implement-task and local-code-review sections describing what their review.md files check All changes link to docs/guide/auto-review.md for full details. Changes are minimal and focused — the full documentation is in the auto-review guide. Closes: AISOS-2071
Fixed ruff I001 lint error in src/forge/observability/__init__.py: - Changed multi-line import blocks to single-line imports - Imports now properly sorted alphabetically by module name All tests pass, no breaking issues found in code review. Closes: AISOS-741-review
…ig reference Added new Auto-Review section to docs/reference/config.md documenting: - AUTO_REVIEW_MAX_RETRIES: default max retry attempts - AUTO_REVIEW_POLL_INTERVAL: polling interval for review cycle files - AUTO_REVIEW_RECORD_POLLED_FILES: recording mode for observability These settings were added to src/forge/config.py as part of the auto-review feature but were not reflected in the configuration reference documentation. Closes: AISOS-741-docs
JoshSalomon
commented
Jul 2, 2026
The entrypoint imports from review.py but the Containerfile only copied entrypoint.py. This caused ImportError at container startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test_review.py tests import from the containers/ directory which isn't a Python package. Added sys.path setup in the test conftest following the existing pattern used by integration tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() was declared async but only did 'return self'. As an async def, this wraps self in a coroutine, breaking 'async for' on Python 3.12 (CI) which requires __aiter__ on the direct return value. Making it a regular method returns self directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() is correctly async (returns a coroutine wrapping self). The caller in runner.py must await it to get the async iterator. The previous fix incorrectly made poll() sync, which broke the poller tests on Python 3.12. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The poll() method is async, so the mock must be AsyncMock for await to work on Python 3.12. MagicMock returns a non-awaitable which fails with "can't be used in 'await' expression". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
poll() only sets a flag and returns self — no I/O, no reason to be async. As async def, it wraps self in a coroutine which breaks 'async for' on Python 3.11 (CI) where coroutines don't satisfy __aiter__. Making it sync aligns with how runner.py calls it: 'async for new_cycles in poller.poll():' Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements the Auto-Review Primitive, a new capability that enables automated quality gates for skill execution in Forge. When a skill includes a
review.mdfile, the system automatically spawns a reviewer agent after execution to validate output quality, providing structured feedback and retry loops until the work is approved or max retries are exhausted. This feature reduces manual review burden and catches quality issues earlier in the development cycle.Changes
Container Review Loop Engine (
containers/review.py)Verdictenum (APPROVED/REJECTED) andReviewConfig/ReviewCycleDatadataclassesparse_review_config()for parsing YAML frontmatter from review.md filesdetect_review_md()with project override precedence (mirrors skill resolver pattern)parse_verdict()for case-insensitive APPROVED/REJECTED marker extractionwrite_cycle_file()for JSON output to.forge/{step-name}/review_cycle_N.jsonContainer Entrypoint Integration (
containers/entrypoint.py)run_review_loop()orchestrating the reviewer agent spawn, verdict parsing, and retry logicrun_reviewer_agent()andrun_worker_with_feedback()for feedback injectionload_conversation_history()for loading prior context on retries_print_review_progress())Observability Infrastructure (
src/forge/observability/)ReviewCyclePollerclass with async polling loop for detecting review cycle files during container executionReviewCycleRecorderwith mode switching (log/copy/None) for recording detected cyclesReviewJiraNotifierfor posting formatted Jira comments with rate limitingforge_review_cycles_total,forge_review_verdicts_total,forge_review_duration_secondsContainer Runner Integration (
src/forge/sandbox/runner.py)ContainerResultdataclass withreview_cyclesfieldstep_nameparameter toContainerRunner.run()method_poll_review_cycles()background task for async polling during execution_sweep_review_cycles()for post-execution sync sweep to catch missed filesWorkflow Node Updates
step_nameparameter:implementation.py: step_name='implement_task'local_reviewer.py: step_name='local_review'ci_evaluator.py: step_name='analyze_ci'/'fix_ci'code_review.py: step_name='code_review'docs_updater.py: step_name='update_docs'implement_review.py: step_name='implement_review_analyze'/'implement_review_fix'plan_bug_fix.py: step_name='plan_bug_fix'rca_analysis.py: step_name='analyze_bug'/'reflect_rca'rebase.py: step_name='rebase'Configuration (
src/forge/config.py)AUTO_REVIEW_MAX_RETRIES: default max retry attempts (default: 3)AUTO_REVIEW_POLL_INTERVAL: polling interval in seconds (default: 5)AUTO_REVIEW_RECORD_POLLED_FILES: recording mode ('log'/'copy'/None)Sample Review Files
skills/default/implement-task/review.mdwith generic code review criteriaskills/default/local-code-review/review.mdwith pre-PR review criteriaskills/aisos/implement-task/review.mdas project-specific override exampleDocumentation
docs/reference/review-md-schema.mdwith formal schema specificationdocs/guide/auto-review.mdwith comprehensive user documentationdocs/skills/authoring.md,docs/skills/index.md,docs/skills/defaults.mdto reference auto-reviewdocs/reference/config.mdwith new configuration settingsDependencies
aiofiles>=24.1.0topyproject.tomlfor async file operationsImplementation Notes
detect_review_md()mirrors the skill resolver pattern—project-specific overrides (skills/{project}/{skill}/review.md) take precedence over defaults (skills/default/{skill}/review.md)Testing
tests/unit/containers/test_review.py: 57 tests for review.py functionstests/unit/observability/test_review_poller.py: 32 tests for async pollingtests/unit/observability/test_review_recorder.py: 34 tests for recording modestests/unit/observability/test_review_notifier.py: 31 tests for Jira notificationtests/unit/sandbox/test_runner_review_polling.py: 25+ tests for runner integrationRelated Tickets
Generated by Forge SDLC Orchestrator