Skip to content

Progress towards tree-sitter feature#3102

Draft
saniyafatima07 wants to merge 12 commits into
mandiant:masterfrom
saniyafatima07:script-feature
Draft

Progress towards tree-sitter feature#3102
saniyafatima07 wants to merge 12 commits into
mandiant:masterfrom
saniyafatima07:script-feature

Conversation

@saniyafatima07

@saniyafatima07 saniyafatima07 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Addresses PR #1080 and #2931

  • Updated few outdated tree-sitter queries
  • Ensure all related test cases pass

Checklist

  • No CHANGELOG update needed
  • No new tests needed
  • No documentation update needed
  • This submission includes AI-generated code and I have provided details in the description.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Tree-Sitter Script Analysis to capa, enabling feature extraction from script languages such as C#, Python, HTML, and ASPX templates. It adds a new Tree-Sitter-based feature extractor, auto-detection capabilities, and signature-based tools, along with comprehensive tests and updated dependencies. The code review feedback primarily addresses compatibility issues with the upgraded tree-sitter library (version 0.25.0), specifically pointing out that QueryCursor has been removed and the Parser instantiation has changed in tree-sitter versions >= 0.21.0. The feedback provides actionable suggestions to execute queries directly and update parser usage. Additionally, it identifies a bug in integer suffix parsing and recommends replacing a deprecated importlib.resources API.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +140 to +146
def get_function_call_names(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.function_call_name)
yield from self.get_captured_nodes(cursor, node)

def get_imported_constants(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.imported_constant_name)
yield from self.get_captured_nodes(cursor, node)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change
def get_function_call_names(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.function_call_name)
yield from self.get_captured_nodes(cursor, node)
def get_imported_constants(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.imported_constant_name)
yield from self.get_captured_nodes(cursor, node)
def get_function_call_names(self, node: Node) -> Iterator[Node]:
yield from self.get_captured_nodes(self.query.function_call_name, node)
def get_imported_constants(self, node: Node) -> Iterator[Node]:
yield from self.get_captured_nodes(self.query.imported_constant_name, node)

Comment on lines +164 to +174
def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:
target_node = self.tree.root_node if node is None else node
cursor = QueryCursor(self.query.namespace)
namespace_captures: List[Tuple[Node, str]] = []

for query_name, nodes in cursor.captures(target_node).items():
for namespace_node in nodes:
namespace_captures.append((namespace_node, query_name))

return sorted(namespace_captures, key=self.get_node_capture_sort_key)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since QueryCursor is removed in tree-sitter >= 0.21.0, you should execute the query directly using self.query.namespace.captures(target_node). This returns a list of (Node, str) tuples representing the captured nodes and their capture names.

Suggested change
def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:
target_node = self.tree.root_node if node is None else node
cursor = QueryCursor(self.query.namespace)
namespace_captures: List[Tuple[Node, str]] = []
for query_name, nodes in cursor.captures(target_node).items():
for namespace_node in nodes:
namespace_captures.append((namespace_node, query_name))
return sorted(namespace_captures, key=self.get_node_capture_sort_key)
def get_namespaces(self, node: Optional[Node] = None) -> List[Tuple[Node, str]]:
target_node = self.tree.root_node if node is None else node
captures = self.query.namespace.captures(target_node)
namespace_captures: List[Tuple[Node, str]] = []
for namespace_node, query_name in captures:
namespace_captures.append((namespace_node, query_name))
return sorted(namespace_captures, key=self.get_node_capture_sort_key)

Comment on lines +39 to +41
def _contains_errors(ts_language, node: Node) -> bool:
query = Query(ts_language, "(ERROR) @error")
return bool(QueryCursor(query).captures(node))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since QueryCursor is removed in tree-sitter >= 0.21.0, you should execute the query directly using query.captures(node).

Suggested change
def _contains_errors(ts_language, node: Node) -> bool:
query = Query(ts_language, "(ERROR) @error")
return bool(QueryCursor(query).captures(node))
def _contains_errors(ts_language, node: Node) -> bool:
query = Query(ts_language, "(ERROR) @error")
return bool(query.captures(node))

Comment on lines +156 to +162
def get_string_literals(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.string_literal)
yield from self.get_captured_nodes(cursor, node)

def get_integer_literals(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.integer_literal)
yield from self.get_captured_nodes(cursor, node)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change
def get_string_literals(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.string_literal)
yield from self.get_captured_nodes(cursor, node)
def get_integer_literals(self, node: Node) -> Iterator[Node]:
cursor = QueryCursor(self.query.integer_literal)
yield from self.get_captured_nodes(cursor, node)
def get_string_literals(self, node: Node) -> Iterator[Node]:
yield from self.get_captured_nodes(self.query.string_literal, node)
def get_integer_literals(self, node: Node) -> Iterator[Node]:
yield from self.get_captured_nodes(self.query.integer_literal, node)

Comment on lines +50 to +52
def parse(self) -> Tree:
parser = Parser(self.query.language)
return parser.parse(self.buf)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

In tree-sitter >= 0.21.0, the Parser constructor does not accept a Language argument. You must set the language attribute after instantiating the parser.

Suggested change
def parse(self) -> Tree:
parser = Parser(self.query.language)
return parser.parse(self.buf)
def parse(self) -> Tree:
parser = Parser()
parser.language = self.query.language
return parser.parse(self.buf)

Comment on lines +222 to +224
def get_content_sections(self) -> Iterator[Node]:
cursor = QueryCursor(self.query.content)
yield from self.get_captured_nodes(cursor, self.tree.root_node)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Since QueryCursor is removed, you can pass the query directly to get_captured_nodes without instantiating a cursor.

Suggested change
def get_content_sections(self) -> Iterator[Node]:
cursor = QueryCursor(self.query.content)
yield from self.get_captured_nodes(cursor, self.tree.root_node)
def get_content_sections(self) -> Iterator[Node]:
yield from self.get_captured_nodes(self.query.content, self.tree.root_node)

import re
from typing import List, Tuple, Iterator, Optional

from tree_sitter import Node, Tree, Parser, QueryCursor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In tree-sitter >= 0.21.0, QueryCursor has been removed. Import Query instead to execute queries directly.

Suggested change
from tree_sitter import Node, Tree, Parser, QueryCursor
from tree_sitter import Node, Tree, Parser, Query

from typing import Optional
from pathlib import Path

from tree_sitter import Node, Tree, Query, Parser, Language, QueryCursor

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In tree-sitter >= 0.21.0, QueryCursor has been completely removed. Executing queries is now done directly via Query.captures or Language.query. Importing QueryCursor will raise an ImportError at runtime.

Suggested change
from tree_sitter import Node, Tree, Query, Parser, Language, QueryCursor
from tree_sitter import Node, Tree, Query, Parser, Language

Comment on lines +166 to +173
def parse_integer(self, integer: str) -> int:
for suffix in self.integer_suffixes:
if integer.endswith(suffix):
integer = integer[:-1]
for prefix, base in self.integer_prefixes:
if integer.startswith(prefix):
return int(integer, base)
return int(integer)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a bug in parse_integer when stripping suffixes. If an integer literal has multiple suffixes (e.g., 100ul in C#), the loop only strips the last character of the suffix that matched first, leaving the other suffix character intact (e.g., leaving 100u), which then fails to parse as an integer and raises a ValueError. Using rstrip with a combined set of uppercase and lowercase suffixes is more robust and correctly handles multi-character suffixes.

Suggested change
def parse_integer(self, integer: str) -> int:
for suffix in self.integer_suffixes:
if integer.endswith(suffix):
integer = integer[:-1]
for prefix, base in self.integer_prefixes:
if integer.startswith(prefix):
return int(integer, base)
return int(integer)
def parse_integer(self, integer: str) -> int:
suffixes = "".join(self.integer_suffixes)
if suffixes:
integer = integer.rstrip(suffixes + suffixes.upper())
for prefix, base in self.integer_prefixes:
if integer.startswith(prefix):
return int(integer, base)
return int(integer)

Comment on lines +74 to +76
def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:
signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))
return {category: set(names) for category, names in signatures.items()}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

importlib.resources.read_text is deprecated since Python 3.11. Use importlib.resources.files instead, which is fully supported in Python 3.10+.

Suggested change
def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:
signatures = json.loads(importlib.resources.read_text(capa.features.extractors.ts.signatures, signature_file))
return {category: set(names) for category, names in signatures.items()}
def load_import_signatures(self, signature_file: str) -> Dict[str, set[str]]:
ref = importlib.resources.files(capa.features.extractors.ts.signatures) / signature_file
signatures = json.loads(ref.read_text(encoding="utf-8"))
return {category: set(names) for category, names in signatures.items()}

@saniyafatima07 saniyafatima07 marked this pull request as ready for review June 10, 2026 18:58
@saniyafatima07 saniyafatima07 marked this pull request as draft June 10, 2026 18:58
Comment thread capa/capabilities/common.py
@saniyafatima07 saniyafatima07 force-pushed the script-feature branch 2 times, most recently from c78e0a6 to eb48b8a Compare June 19, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants