Skip to content

feat(reachability): library-mode for all parsers + silent-blackout warning#120

Open
gadievron wants to merge 2 commits into
staging/parser-fix-stackfrom
fix/reachability-library-mode
Open

feat(reachability): library-mode for all parsers + silent-blackout warning#120
gadievron wants to merge 2 commits into
staging/parser-fix-stackfrom
fix/reachability-library-mode

Conversation

@gadievron

Copy link
Copy Markdown
Collaborator

feat(reachability): library-mode for all parsers + silent-blackout warning

Two coupled changes that make OpenAnt usable on data-processing libraries (parsers, deserializers, codecs), where the reachability filter otherwise prunes the entire public-API surface. Both touch the same files (core/parser_adapter.py, utilities/agentic_enhancer/entry_point_detector.py, the six subprocess pipelines), so they ship together.

Base / dependency

This PR is based on staging/parser-fix-stack (the integrated #70#117 parser-fix stack), not master. It extends #117's library-mode and complements #75's zero-seed net, and builds on the zig parser state from #75/#85/#87/#110 — so it only applies on top of that stack. Depends-on: #75, #117 (and the parser-fix stack generally). Retarget the base to master once the stack lands.


Part 1 — silent-blackout warning (advisory; never changes filtering)

Problem: the #75 zero-seed net only fires at exactly zero entry points. A library trips a handful of incidental seeds (code merely containing an input-reading pattern), so a 90%+ reduction looks like a successful filter while the public-API core was dropped. tree-sitter lib/: 661 → 24 units (all wasm_store.c); parser.c/lexer.c/stack.c/subtree.c removed, reported as 96% "reduction" with no warning.

Fix: blackout_warning() in utilities/agentic_enhancer (shared by all filter sites). Warns when 0 of N kept, OR ≥90% pruned with no structural entry point (every seed is an incidental input_pattern match, never unit_type:/name:main/decorator:). Suppressed under library_mode. Wired into core/parser_adapter.py + all six subprocess pipelines. Advisory only — never changes which units are kept. Calibration: silent on a real app (Arkime C: 63%, real main/cli_handler seeds), fires on a library.

Part 2 — opt-in library-mode for all parsers + CLI

Problem: #117 added --library-mode (seed the exported public API) only on the Python path. The other six parsers run as subprocesses with their own reachability copy, so a C/JS/Go/Ruby/PHP/Zig library still collapsed.

Fix: centralize the seed logic into library_seed_ids() (handles both is_exported snake_case on disk and isExported camelCase from the pipelines' normalize). Thread an opt-in library_mode through all four surfaces of each subprocess parser (dispatch → subprocess cmd → --library-mode argparse → seed union before the BFS), plus scan_repository and the openant parse/scan CLI flags. Union-only — never drops a structural entry point, so app scans are unaffected. #117's Python path is behavior-identical.

Verified end-to-end (parse, no API):

Project lang without --library-mode with --library-mode
tree-sitter lib/ C 24/661 (blackout) 550/661 (parser core reachable)
google/uuid Go 0/78 77/78
unicode-display_width Ruby 0/13 13/13
slugify JS 0/8 7/8
nikic/iter PHP 0/120 119/120
ziglibs/known-folders Zig 9/21 21/21

Arkime (a real app) is unchanged either way. (Zig CLI selection needs the companion fix(cli) PR to be reachable via openant parse --language zig.)

Not a security fix

Tooling/coverage changes (they change which code reaches the analyzer), not code-vulnerability patches.

Tests

15 new: tests/test_blackout_warning.py (7 — both triggers + structural/library-mode suppression) and tests/test_library_seed_ids.py (8 — both casings + name heuristic). #117's test_library_mode_reachability.py unchanged (behavior-preserving). Full libs/openant-core suite: 624 passed, 22 skipped.

The #75 zero-seed net only fires at EXACTLY 0 entry points. A library (e.g.
tree-sitter) trips a handful of INCIDENTAL seeds (code merely containing an
input-reading pattern), yielding a 96.6% reduction (712 -> 24, all wasm) that
looks like a successful filter while the real public-API core was dropped.

Add blackout_warning() in utilities/agentic_enhancer (shared by all 7 filter
sites). Advisory ONLY -- never changes which units are kept. Warns on total
blackout (0 kept) OR >=90% pruned with NO structural entry point (route/main/
CLI/handler) -- i.e. all seeds are incidental input-pattern matches. Suppressed
under library_mode (high reduction is then the intended precise result).
Suggests --library-mode in the message. Wired into core/parser_adapter.py + all
6 subprocess pipelines (c/js/go/ruby/php/zig).

Calibration: silent on Arkime (63%/54% reductions, real route/main seeds),
fires on tree-sitter. Tests: 7 new covering both triggers + structural-seed and
library-mode suppression.
#117's library-mode (seed the exported public API so a library with no main/
route/CLI entry point isn't blacked out) lived only in the Python path
(core/parser_adapter). The other six parsers run as subprocesses with their own
reachability copy, so a C/JS/etc. library still collapsed: tree-sitter's C core
pruned 661 -> 24 (all wasm), the public ts_parser_* API never seeded.

Centralize _library_seed_ids into utilities/agentic_enhancer.library_seed_ids
(now handles both is_exported snake_case on disk and isExported camelCase from
the pipelines' in-memory normalize). Thread an opt-in library_mode through all
4 parallel surfaces for each of the 6 subprocess parsers (parse dispatch ->
_parse_<lang> subprocess cmd -> test_pipeline --library-mode argparse -> union
library_seed_ids into entry_points before the BFS), plus scan_repository and the
'openant parse' / 'openant scan' CLI flags. Union-only: never drops a
structurally detected entry point, so app scans are unaffected.

Verified end-to-end on tree-sitter C: without the flag the blackout warning
fires (24/661); with --library-mode, 352 public-API seeds -> 550 reachable, the
parser core (parser.c/lexer.c/stack.c/subtree.c/query.c/node.c) now analysable.

Tests: 8 new (library_seed_ids both casings + name heuristic); #117 Python path
unchanged (6 green).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant