Skip to content

fix(parsers/c): header static-inline call edges + same-(file,name) func_id collisions#121

Open
gadievron wants to merge 2 commits into
staging/parser-fix-stackfrom
fix/c-parser-bugs
Open

fix(parsers/c): header static-inline call edges + same-(file,name) func_id collisions#121
gadievron wants to merge 2 commits into
staging/parser-fix-stackfrom
fix/c-parser-bugs

Conversation

@gadievron

Copy link
Copy Markdown
Collaborator

Two pre-existing C/C++ parser-correctness fixes, stacked on staging/parser-fix-stack because both build on the C-parser commits already in this stack (deterministic include-resolution, u1/u2 extractor fixes, member-dispatch) — none of which are on master yet. Surfaced while auditing a tree-sitter parse; details in the dossier carried alongside this work.

Commit 1 — static inline functions in included headers resolve as call edges

_is_visible_from rejected every cross-file static callee. That is right for a .c translation unit but wrong for a static inline helper in an #included header (each including TU gets its own copy and can call it). The include-resolution loop already found the header callee via include_map, then dropped the edge as internal-linkage.

The fix is scoped to the include-resolution site only: a candidate whose file ends in a header extension (.h/.hpp/.hxx/.hh) is visible regardless of static. The repo-wide unique-name fallback and the prototype fallback stay strict — a static helper in another .c still does not resolve cross-TU (the #84 precision invariant test_unique_name_fallback_skips_static_in_other_file stays green).

Commit 2 — same-(file,name) functions no longer collapse to one func_id

func_id = "<file>:<name>" carries no signature or preprocessor condition, and the store was an unguarded keep-last assignment. Two functions sharing (file, name) silently overwrote each other, corrupting the inventory and the call graph: on tree-sitter, 14 real wasm_store.c implementations were replaced by their #else no-op stubs, and C++ overloads collapsed onto one node so a call into one overload reached the other's callee.

New _store_function disambiguates collision-only — uniquely-named functions keep the exact path:name id, so the existing hardcoded id literals are untouched:

  • Same signature (#ifdef/#else, ODR-dup): keep one node, prefer the larger body — the stub is shorter regardless of which arm tree-sitter emits first.
  • Different signature (overload): both are real; fold a colon-free signature discriminator into the func_id key only. name stays bare so the call-graph builder (which resolves by name) still finds both overloads. Colon-free because downstream id splits rsplit on : (std::stringstd..string).

Both store sites (process-function + lambda) route through the helper. This complements the existing template-specialization disambiguation (bug [39]), which folds its discriminator into the name (correct there — a template call is written g<int>; an overload call is written bare).

Tests

New: tests/parsers/c/test_c_static_inline_header_resolution.py, tests/parsers/c/test_c_collision_discriminator.py.

$ pytest tests/parsers/c/ -q
71 passed

Full repo suite: 609 passed, 22 skipped.

Scope / follow-ups

The same (file,name)-collapse exists in the JS/TS, Python, PHP, and Ruby extractors (each with different collision semantics and parameter shapes). Those are intentionally not in this PR — each needs its own fix and tests. This PR addresses the C/C++ parser, where the measured impact (14 lost wasm_store.c functions) was demonstrated.

🤖 Generated with Claude Code

`_is_visible_from` rejected any cross-file `static` callee, which is correct
for a `.c` translation unit but wrong for a `static inline` function in an
`#include`d header (the header-inline idiom — every including TU gets its own
copy and can call it). The include-resolution loop found the header callee via
`include_map`, then dropped the edge as internal-linkage.

Fix is scoped to the include-resolution site only: a candidate whose file ends
in a header extension (.h/.hpp/.hxx/.hh) is visible regardless of `static`. The
repo-wide unique-name fallback and the prototype fallback stay strict — a
`static` helper in another `.c` still does not resolve cross-TU.

Tests: tests/parsers/c/test_c_static_inline_header_resolution.py (static-inline
header resolves; static-in-other-.c still rejected; extern header still
resolves). Precision invariant preserved:
  $ pytest tests/parsers/c/test_c_static_inline_header_resolution.py \
           tests/parsers/c/test_c_call_resolution_precision.py -q
  12 passed in 0.66s

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@gadievron

Copy link
Copy Markdown
Collaborator Author

Verified end-to-end on tree-sitter (post-fix, parse-only — no API)

Re-parsed tree-sitter with both fixes in place to confirm they deliver, with before/after derived from the raw definition stream (_store_function spy on lib/src/wasm_store.c):

Commit 2 (collision): 67 raw function_definition nodes → 53 distinct func_ids → 14 collide on (file,name). Under the old keep-last store, 13 of those kept the shorter #else stub; the discriminator's prefer-larger now keeps the real implementation. Examples:

ts_wasm_language_release:     old kept line2105 (len 71 stub) -> new keeps line1963 (len 1594 real)
ts_wasm_store_call_scanner_scan: old kept line2057 (len 199)  -> new keeps line1854 (len 886 real)
ts_wasm_store_delete:         old kept line2013 (len 62 stub) -> new keeps line1039 (len 481 real)

(13 flips total — correcting the PR description's "14": 14 names collide, 13 had the real body displaced by a stub.)

Commit 1 (header static-inline): parsing lib/src/*.c + lib/src/*.h + lib/include/tree_sitter/*.h and building the call graph yields 22 edges that land on static inline functions defined in headers — e.g. ts_subtree_total_size -> ts_subtree_size, ts_subtree_total_size -> ts_subtree_padding, ts_point_edit -> point_add. These are exactly the edges _is_visible_from dropped pre-fix (the unit test pins the single-edge case as [] before the change). Total resolved edges in that scope: 835.

@gadievron

Copy link
Copy Markdown
Collaborator Author

Blast-radius review (independent re-derivation + judge)

Verified that both changes are surgical and fully contained. Re-derived independently, then reconciled. Confirmed:

  • Header-visibility change touches exactly one site. _is_visible_from has 3 call sites (call_graph_builder.py:510/518/529); only the include-resolution branch (510) gains the header bypass. The repo-wide unique-name fallback (518) and prototype fallback (529) stay strict — a static helper in another .c still does not resolve cross-TU.
  • Collision discriminator is collision-only and colon-free. Uniquely-named functions keep a byte-identical path:name id (pinned by test_unique_name_id_is_byte_identical); the discriminator goes into the func_id key only, name stays bare so name-based call resolution still finds both overloads. Every file-recovery split downstream uses index-0 / rsplit(':',1)[0] (diff_filter.py:179, repository_index.py:174, export_csv.py:90, generate_report.py:52, reporter.py:264/299, cli.py:667) — all immune to a colon-free discriminator. CSV uses csv.DictWriter (auto-quotes the comma/parens). No split(':')[1] / partition / jsonschema id-pattern consumer exists. The discriminated id flows through dataset.json, the call graph, and the reachability filter as a consistent opaque string.

Known edge cases (documented, not blocking)

  1. (med) LLM-reachability recall. core/llm_reachability.py:291 silently skips a unit_id the model fails to echo verbatim; an overload id with spaces/parens is marginally harder to copy, so an overload's LLM-derived reachability signal can be dropped. Fail-soft (it over-seeds elsewhere).
  2. (low) Display labels. Sites that derive a name via split(':')[-1] (experiment.py:373/595, generate_report.py:356, cli.py:690) now show pick(int a,int b) instead of pick. Lookups are unaffected — the full id remains the dict key.
  3. (low) ODR-pathological headers. Two different headers each defining static helper() and both included by one TU now resolve to one deterministic (sorted-first) edge vs none pre-fix. ODR-illegal input; strictly an improvement.
  4. (inherent, unchanged) Inbound overload resolution. A call to an overloaded name still resolves to the first same-file overload (name-based resolution can't arg-type-match). Both overloads' outgoing edges are now correct — strictly better than the pre-fix conflation.

Tests: 609 passed / 22 skipped.

…c_id

`func_id = "<file>:<name>"` carries no signature or preprocessor condition, and
the store was an unguarded `self.functions[func_id] = ...` (keep-last). Two
functions sharing (file, name) silently overwrote each other, corrupting the
inventory AND the call graph: on tree-sitter, 14 real wasm_store.c
implementations were replaced by their `#else` no-op stubs; C++ overloads
collapsed onto one node so a call into one overload reached the other's callee.

New `_store_function` disambiguates collision-only (unique names keep the exact
`path:name` id — the hundreds of hardcoded id literals in the suite are
untouched):
  - SAME signature (#ifdef/#else, ODR-dup): keep ONE node, prefer the larger
    body — the stub is shorter regardless of which arm tree-sitter emits first.
  - DIFFERENT signature (overload): both are real; fold a COLON-FREE signature
    discriminator into the func_id key only. `name` stays bare so the call-graph
    builder (which resolves by `name`) still finds both. Colon-free because
    downstream id splits rsplit on ':' (std::string -> std..string).

Both store sites (process-function + lambda) route through the helper.

Tests: tests/parsers/c/test_c_collision_discriminator.py (overloads survive,
callees not conflated, #ifdef keeps larger body, ids colon-free, unique-name id
byte-identical):
  $ pytest tests/parsers/c/test_c_collision_discriminator.py -q
  5 passed in 0.47s
  $ pytest tests/parsers/c/ -q
  71 passed in 5.71s

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant