feat: add structured category and severity fields to review findings#29
feat: add structured category and severity fields to review findings#29mvanhorn wants to merge 1 commit into
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Please merge this, it is an important feature to end users. |
|
Thanks for the great work, @mvanhorn! 🙏 This is a really solid contribution — adding structured One thing I'd like to evaluate before merging: this change introduces modifications to the review prompts. I want to carefully assess whether the additional prompt instructions for category/severity classification could have any impact on the quality or focus of the review output itself. I'll run some comparative reviews today and get back to you with my findings. Appreciate your patience — expect an update later today! |
|
Thanks @mvanhorn for this well-implemented PR! The code quality is solid, and the structured category and severity fields will be valuable for downstream CI integrations. However, after conducting careful evaluations on our benchmark suite, I've observed that introducing these changes results in a noticeable degradation in the overall review quality of the tool. The additional prompt instructions for category/severity classification appear to be affecting the focus and accuracy of the review output itself. We're currently investigating the root cause of this regression in depth. Once we identify the underlying issue, we'll provide specific improvement suggestions — potentially around prompt engineering, model behavior, or field population strategies. Please keep this PR open — we believe this feature is important and want to work through the quality concerns rather than close it. We'll follow up soon with concrete next steps and any necessary adjustments. Appreciate your patience and contribution! |
Maybe this tagging process should be run after an original review which doesn't affect the effect of ocr. |
the same idea. maybe create a new review task for classifying the finding issues. |
|
what i found for reviewing the code change is that unnecessary to change prompt but just set severity and categories to required at function declaration, and at code_comment phase, the LLM will set it. |
|
Hi @mvanhorn — first of all, thank you so much for this contribution! The structured category/severity idea is exactly what we want for CI integrations, and we appreciate the thoughtfulness in your design (backward compatibility, tests, docs). We owe you an apology for the delayed response. The project was recently open-sourced and we've been heads-down on urgent issues. We now have time to properly evaluate this feature and would love to move it forward with you. Could you rebase onto the latest Feedback1. Do not modify the system prompt ( We've found that adding the category/severity instruction to the system prompt causes a measurable regression on our public evaluation benchmarks. The root cause is difficult to isolate, so our policy is to keep the system prompt stable. The tool schema itself (point 2) is sufficient for the model to understand and populate these fields — no prompt-level guidance is needed. 2. Tool schema: remove The schema design is great overall. However, we'd like
During our internal classification experiments we found that LLMs consistently struggle to distinguish between 3. Rendering: badge-style for CLI, flat fields for JSON For JSON output, For CLI terminal output, we'd prefer the badge approach (recommended) where the tag is inline with the comment text, with color driven by severity. Example: An alternative (line-by-line) is also acceptable but we recommend the badge: Thanks again for your patience and for driving this forward. Looking forward to the next iteration! |
Summary
Adds two optional structured fields,
categoryandseverity, to every reviewfinding. They flow through the model, tool-call parsing, JSON output, agent
output, and the human-readable text renderer, and are populated by the review
LLM via the
code_commenttool schema and a short prompt-template instruction.Allowed values match the issue's tables:
severity:critical,high,medium,low,infocategory:bug,security,performance,maintainability,test,style,documentation,otherWhy this matters
Per #16, the machine-readable output of
ocr reviewexposes finding text,location, and suggestion, but no structured category/severity per finding. CI
integrations (GitHub Actions, GitLab CI) currently have to re-parse
natural-language comment text to sort, group, filter, or gate builds by
importance. The maintainers asked the reporter to open this dedicated issue and
laid out the enum tables plus acceptance criteria this PR implements:
categoryandseverityper finding whenthe model provides them.
omitempty, and the tool schema does not mark themrequired, so the keys areomitted entirely when empty and older/less-capable models still emit valid tool
calls.
The change is backward-compatible by construction (optional +
omitempty+ notrequired).
Out of scope by design (the issue frames these as follow-ups, design questions
#3/#4): no
--severityCLI filtering flags and noconfidencefield. This PRlands the data first; filtering/gating can be a separate change now that the
fields exist.
Testing
go build ./...— successgo vet ./...— cleango test ./...— all packages pass (198 tests)internal/tool/code_comment_test.go:category/severityare parsed when presentwhen empty (no
"category":"")Fixes #16
AI was used for assistance.