Skip to content

feat: add structured category and severity fields to review findings#29

Open
mvanhorn wants to merge 1 commit into
alibaba:mainfrom
mvanhorn:feat/16-structured-category-severity
Open

feat: add structured category and severity fields to review findings#29
mvanhorn wants to merge 1 commit into
alibaba:mainfrom
mvanhorn:feat/16-structured-category-severity

Conversation

@mvanhorn

@mvanhorn mvanhorn commented Jun 3, 2026

Copy link
Copy Markdown

Summary

Adds two optional structured fields, category and severity, to every review
finding. They flow through the model, tool-call parsing, JSON output, agent
output, and the human-readable text renderer, and are populated by the review
LLM via the code_comment tool schema and a short prompt-template instruction.

Allowed values match the issue's tables:

  • severity: critical, high, medium, low, info
  • category: bug, security, performance, maintainability, test, style, documentation, other

Why this matters

Per #16, the machine-readable output of ocr review exposes finding text,
location, and suggestion, but no structured category/severity per finding. CI
integrations (GitHub Actions, GitLab CI) currently have to re-parse
natural-language comment text to sort, group, filter, or gate builds by
importance. The maintainers asked the reporter to open this dedicated issue and
laid out the enum tables plus acceptance criteria this PR implements:

  • JSON and agent output now include category and severity per finding when
    the model provides them.
  • README and README.zh-CN document the allowed values and semantics.
  • Existing consumers that ignore the fields are unaffected: the struct fields use
    omitempty, and the tool schema does not mark them required, so the keys are
    omitted entirely when empty and older/less-capable models still emit valid tool
    calls.

The change is backward-compatible by construction (optional + omitempty + not
required).

Out of scope by design (the issue frames these as follow-ups, design questions
#3/#4): no --severity CLI filtering flags and no confidence field. This PR
lands the data first; filtering/gating can be a separate change now that the
fields exist.

Testing

  • go build ./... — success
  • go vet ./... — clean
  • go test ./... — all packages pass (198 tests)
  • New unit tests in internal/tool/code_comment_test.go:
    • category/severity are parsed when present
    • a finding without them is still accepted and leaves the fields zero-valued
    • JSON serialization includes the fields when set and omits the keys entirely
      when empty (no "category":"")

Fixes #16

AI was used for assistance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Superskyyy

Copy link
Copy Markdown

Please merge this, it is an important feature to end users.

@stay-foolish-forever

Copy link
Copy Markdown
Contributor

Thanks for the great work, @mvanhorn! 🙏

This is a really solid contribution — adding structured category and severity fields to the review findings is an important enhancement that lays the groundwork for downstream filtering, CI gate policies, and better machine-readable output.

One thing I'd like to evaluate before merging: this change introduces modifications to the review prompts. I want to carefully assess whether the additional prompt instructions for category/severity classification could have any impact on the quality or focus of the review output itself. I'll run some comparative reviews today and get back to you with my findings.

Appreciate your patience — expect an update later today!

@stay-foolish-forever

Copy link
Copy Markdown
Contributor

Thanks @mvanhorn for this well-implemented PR! The code quality is solid, and the structured category and severity fields will be valuable for downstream CI integrations.

However, after conducting careful evaluations on our benchmark suite, I've observed that introducing these changes results in a noticeable degradation in the overall review quality of the tool. The additional prompt instructions for category/severity classification appear to be affecting the focus and accuracy of the review output itself.

We're currently investigating the root cause of this regression in depth. Once we identify the underlying issue, we'll provide specific improvement suggestions — potentially around prompt engineering, model behavior, or field population strategies.

Please keep this PR open — we believe this feature is important and want to work through the quality concerns rather than close it. We'll follow up soon with concrete next steps and any necessary adjustments.

Appreciate your patience and contribution!

@MuoDoo

MuoDoo commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

However, after conducting careful evaluations on our benchmark suite, I've observed that introducing these changes results in a noticeable degradation in the overall review quality of the tool. The additional prompt instructions for category/severity classification appear to be affecting the focus and accuracy of the review output itself.

We're currently investigating the root cause of this regression in depth. Once we identify the underlying issue, we'll provide specific improvement suggestions — potentially around prompt engineering, model behavior, or field population strategies.

Maybe this tagging process should be run after an original review which doesn't affect the effect of ocr.

@ScarletCarpet

Copy link
Copy Markdown

然而,在对我们的基准套件进行仔细评估后,我观察到引入这些更改会导致工具整体审查质量显著下降。用于类别/严重性分类的额外提示说明似乎影响了审查输出本身的重点和准确性。
我们目前正在深入调查这一回归的根本原因。一旦我们确定根本问题,我们将提供具体的改进建议——可能是关于提示工程、模型行为或领域人口策略。

也许这个标记过程应该在原始审查之后进行,这不会影响OCR的效果。

the same idea. maybe create a new review task for classifying the finding issues.

@ScarletCarpet

Copy link
Copy Markdown

what i found for reviewing the code change is that unnecessary to change prompt but just set severity and categories to required at function declaration, and at code_comment phase, the LLM will set it.

+                },
+                "severity": {
+                  "type": "string",
+                  "description": "Issue severity. One of: high, medium, low."
+                },
+                "category": {
+                  "type": "string",
+                  "description": "Issue category. One of: bug, security, performance, maintainability, test, style, documentation, improvement, other."
                 }
               },
               "required": [
+                "severity",
+                "category",
                 "content",
                 "existing_code"
               ]

@lizhengfeng101

Copy link
Copy Markdown
Collaborator

Hi @mvanhorn — first of all, thank you so much for this contribution! The structured category/severity idea is exactly what we want for CI integrations, and we appreciate the thoughtfulness in your design (backward compatibility, tests, docs).

We owe you an apology for the delayed response. The project was recently open-sourced and we've been heads-down on urgent issues. We now have time to properly evaluate this feature and would love to move it forward with you.

Could you rebase onto the latest main and rework the implementation based on the feedback below? Quite a bit has changed since you opened this PR.


Feedback

1. Do not modify the system prompt (task_template.json)

We've found that adding the category/severity instruction to the system prompt causes a measurable regression on our public evaluation benchmarks. The root cause is difficult to isolate, so our policy is to keep the system prompt stable. The tool schema itself (point 2) is sufficient for the model to understand and populate these fields — no prompt-level guidance is needed.

2. Tool schema: remove info from severity

The schema design is great overall. However, we'd like severity to be limited to:

critical, high, medium, low

During our internal classification experiments we found that LLMs consistently struggle to distinguish between low and info. Reducing the options produces more decisive, reliable outputs from the model.

3. Rendering: badge-style for CLI, flat fields for JSON

For JSON output, category and severity should be sibling fields alongside content, start_line, etc. — which your current implementation already does correctly.

For CLI terminal output, we'd prefer the badge approach (recommended) where the tag is inline with the comment text, with color driven by severity. Example:

─── internal/mcp/client.go:27-27 ───
[security · high] Potential environment variable leak: `os.Environ()` passes all
environment variables of the current process (which may contain sensitive values
like `API_KEY`, `SECRET_TOKEN`) to the MCP subprocess. Since MCP servers may be
provided by third parties, consider passing only necessary variables (e.g. PATH
and those explicitly specified in the `env` config), or at minimum document this
behavior clearly.

An alternative (line-by-line) is also acceptable but we recommend the badge:

─── internal/mcp/client.go:27-27 ───
Category : security
Severity : high

Potential environment variable leak: ...

Thanks again for your patience and for driving this forward. Looking forward to the next iteration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(agent): add structured category and severity fields to review output

6 participants