Skip to content

Replace core parsing/resolution engine with Rubydex#447

Open
paracycle wants to merge 8 commits into
mainfrom
ufuk/rubydex-rewrite
Open

Replace core parsing/resolution engine with Rubydex#447
paracycle wants to merge 8 commits into
mainfrom
ufuk/rubydex-rewrite

Conversation

@paracycle
Copy link
Copy Markdown
Member

@paracycle paracycle commented Apr 13, 2026

Summary

Replace the per-file AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. Constants are resolved from actual parsed definitions rather than Zeitwerk path conventions, enabling more thorough analysis with comparable speed.

  • Replaces constant_resolver, parallel, and direct parser/ast deps with rubydex
  • Replaces better_html with Herb for ERB parsing (preserves block-spanning tags and source positions)
  • Drops the runtime Rails dependency entirely (used only for old constant_resolver autoload paths)
  • Removes the MD5-based file cache (Rubydex is fast enough without it)
  • Removes the parallel gem (fork overhead exceeded the work being done)

Architecture

Before: for each file → parse with Prism → walk AST → extract constant references → resolve via constant_resolver (Zeitwerk paths) → check violations

After: index all workspace files in one Rubydex::Graph batch call → resolve all constants → iterate Declaration#references per declaration → check violations. Association detection (has_many, belongs_to, etc.) runs as a supplementary Prism-based pass since Rubydex treats those as method calls, not constant references — and is targeted to only files that contain association method references via graph.method_references (~1% of files on Shopify Core).

Dependency changes

Removed Added
constant_resolver rubydex
parallel herb
better_html
ast (direct)
parser (direct)

Kept: prism (for association detection), activesupport, bundler, benchmark.

Key design decisions

  1. Full workspace indexing: index_and_resolve indexes ALL Ruby files for cross-package resolution, even when a scoped check (packwerk check components/foo) is requested. Only the scoped files are checked for violations.
  2. Per-declaration iteration: the hot loop walks graph.declarations and uses Declaration#references rather than iterating all 9M+ constant references globally. Skips declarations with no in-scope references entirely.
  3. URI-keyed lookups: the hot loop compares URIs as opaque strings via precomputed hash maps (@checked_uris, @uri_to_package, @uri_to_relative_path). No URI parsing in the inner loop.
  4. Singleton class deduplication: Foo.bar produces references to both Foo and Foo::<Foo> (the singleton class). Skipping Rubydex::SingletonClass declarations gives each user-visible reference once.
  5. Zeitwerk-canonical target paths: when a constant has multiple definitions (e.g. canonical file + class reopenings), prefer the file matching the Zeitwerk naming convention (ApiClientapp/models/api_client.rb) for the violation message's "defined in" hint.
  6. Shared namespace handling: if any definition of a constant is in the source file's package, the reference is treated as local (not a violation). Correctly handles cases like module GraphApi reopened across many packages.
  7. Targeted ERB and association parsing: ERB files are extracted via Herb's extract_ruby which preserves line numbers, then fed into graph.index_source. Association detection only re-parses files that graph.method_references shows contain association method calls.
  8. 0/1-based location conversion: Rubydex uses 0-based line/column, Prism uses 1-based line + 0-based column, Packwerk displays 1-based for both. Conversions happen at the boundary, with regression tests asserting reported locations match the 1-based display convention for both source paths.

Performance on Shopify Core (~60k files)

We're within 3-4s of released speed while doing significantly more analysis. The 31k extra offenses come from references the old ConstNodeInspector/constant_resolver pipeline missed — primarily ERB files (where the old concatenated-Ruby approach often produced unparseable output) and shared namespace references. After running packwerk update-todo, only ~620 strict-mode violations remain (which need manual resolution since strict packages reject new entries in package_todo.yml).

Deleted code

16 source files, 11 test files, 3 RBI files removed:

  • AST walking pipeline: file_processor, node_processor, node_processor_factory, node_visitor, node_helpers
  • Constant extraction: const_node_inspector, constant_name_inspector, constant_discovery, parsed_constant_definitions, reference_extractor, unresolved_reference, association_inspector
  • Per-file caching: cache
  • Parsers: parsers/ruby, parsers/factory, parsers/parser_interface
  • Rails integration: rails_load_paths

Tests

134 tests pass (was 132 + 2 new integration tests for line/column locations). Sorbet typecheck clean. Rubocop clean.

The new line/column integration tests assert reported locations for both code paths (Rubydex source for plain constant references, Prism source for ActiveRecord associations) and catch off-by-one regressions.

Commits

  • Replace core parsing/resolution engine with Rubydex
  • Optimize Rubydex pipeline for correctness and performance
  • Refactor `extract_refs_by_file` into focused helpers
  • Convert Rubydex/Prism 0-based locations to Packwerk's 1-based convention
  • Refactor: URI-keyed lookups, named types, simpler helpers
  • Replace better_html with Herb for ERB parsing
  • Bump gem RBI files

@paracycle paracycle requested a review from a team as a code owner April 13, 2026 23:25
@paracycle paracycle force-pushed the ufuk/rubydex-rewrite branch 3 times, most recently from 7fa6be2 to 46eea11 Compare April 14, 2026 00:00
@exterm
Copy link
Copy Markdown
Contributor

exterm commented Apr 14, 2026

Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore?

@paracycle
Copy link
Copy Markdown
Member Author

Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore?

That's correct. Rubydex can do proper Ruby constant resolution, so we don't need to use Zeitwerk heuristics to figure out what constant references resolve to based on their filename. At least, we shouldn't, and, if there are any problems with the resolution, then we should fix them in Rubydex.

@paracycle paracycle force-pushed the ufuk/rubydex-rewrite branch from eda8221 to 3342d3d Compare April 16, 2026 19:58
Replaces the entire AST walking, constant extraction, and resolution
pipeline with Rubydex, a high-performance Ruby indexer written in Rust.

Architectural changes:
- RunContext rewritten to create a Rubydex::Graph, index files,
  resolve all constants, and iterate resolved references to detect
  cross-package violations
- ParseRun simplified to two phases (index_and_resolve + find_offenses)
  instead of per-file parallel processing
- Association detection rewritten using Prism native AST (no parser
  gem translation layer needed), runs as a supplementary pass since
  Rubydex doesn't understand ActiveRecord associations
- ERB support: Parsers::Erb simplified to extract_ruby_source which
  feeds into Rubydex's index_source API
- ApplicationValidator uses Rubydex::Graph instead of ConstantResolver

Dependency changes:
- Removed: constant_resolver, parallel, ast (direct), parser (direct)
- Added: rubydex
- Kept: prism (for association detection), better_html (for ERB)

Deleted 16 source files, 11 test files, 3 RBI files.

Also regenerates Sorbet RBIs and updates CI matrix to 3.2-3.4.
@paracycle paracycle force-pushed the ufuk/rubydex-rewrite branch 5 times, most recently from 361ea57 to 9580fff Compare May 12, 2026 20:54
Correctness fixes:
- Fix NotFileUriError for ERB locations using file:// URI prefix
- Prefix constant names with :: to match package_todo.yml format
- Use Rubydex::SingletonClass type check (with #owner) to map metaclass
  names (Foo::<Foo>) back to their owning class (Foo)
- Skip SingletonClass declarations to avoid double-counting references:
  Foo.bar produces refs to BOTH Foo and Foo::<Foo>, so only iterating
  Class declarations gives us each user-visible reference once
- Select canonical Zeitwerk definition path when constant has multiple
  definitions (e.g. canonical file + class reopenings)
- Fix shared namespace false positives by checking all definitions:
  if any definition is in source package, the reference is local

Performance optimizations:
- Iterate per-declaration via Declaration#references instead of
  iterating all 2.7M+ constant references globally. Skips unreferenced
  declarations entirely and amortizes per-constant work
- Use location.uri instead of to_file_path for 5x faster extraction
  (raw string access vs URI parsing on 2.7M+ refs)
- Target association parsing via graph.method_references: only ~1% of
  files have associations, so we parse 1,200 files instead of 57,000
- Eliminate redundant Dir.glob (8.3s saved on Core) by reusing the
  file list from FilesForProcessing
- Inline URI->path conversion in hot loops
- Cache per-file package lookups
- Lazy compute defn_packages only for declarations with in-set refs
- Remove parallel gem: fork overhead exceeded the work being done
  (benchmarks showed 2-3x slowdown)
- Remove Rails dependency: load_paths were only used by old
  ConstantResolver, no longer needed with Rubydex

Update rubydex dependency to >= 0.2.3.

End-to-end check on Shopify Core (~60k files): ~45s total.
@paracycle paracycle force-pushed the ufuk/rubydex-rewrite branch from 9580fff to dbdf2a0 Compare May 12, 2026 20:55
paracycle added 3 commits May 13, 2026 00:25
Split the 70+ line method with nested loops, lazy initialization, and
inline URI parsing into 4 focused methods:

- extract_refs_by_file: linear main loop -- collect refs in set,
  compute defn data, check each ref
- collect_in_set_references: gathers [source_path, location] pairs
  for refs that land in the check set
- definition_packages_and_target: computes the package set and
  canonical Zeitwerk-preferred target path for a declaration
- uri_to_relative_path: URI->path conversion used by both helpers

No behavioral or performance changes.
Rubydex reports locations with 0-based line and 0-based column.
Prism reports 1-based line and 0-based column.
Packwerk displays locations as 1-based line:column.

Without normalization, every reported offense pointed to the wrong line
(off by one) and column (off by one). For example, a reference on line
84 of a file was reported as 83.

Add integration tests asserting reported locations for both code paths:
- Rubydex source: plain constant reference (Foo.new)
- Prism source: ActiveRecord association (belongs_to :foo)
Iterate per-declaration using URI strings as opaque keys throughout the
hot reference loop. Builds three lookup tables once after indexing:

- @checked_uris        : URIs of files in the check set
- @uri_to_package      : URI -> owning Package
- @uri_to_relative_path: URI -> workspace-relative path

This replaces inline URI string parsing in each iteration with single
hash lookups, and eliminates the redundant @file_uri_prefix path.

Also:
- Replace tuple type aliases with PORO classes ExtractedRef and
  AssociationReference for clarity at call sites
- Replace tuple return from definition_set_for with a DefinitionSet PORO
- Rename collect_in_check_references to checked_references (filter_map)
- Remove redundant tuple element from checked_references return value
- Replace ||= accumulator mutation in definition_set_for with three
  focused passes: collect URIs, derive packages, pick canonical target
- Skip Rubydex::SingletonClass declarations: their references duplicate
  the regular class's references (Foo.bar produces refs to BOTH Foo and
  Foo::<Foo>), so iterating Class declarations gives each user-visible
  reference once

No behavioral change. Runs in ~30s on Shopify Core (down from ~48s).
@paracycle paracycle requested a review from rafaelfranca May 12, 2026 23:13
paracycle added 2 commits May 13, 2026 15:19
Herb is Shopify's purpose-built ERB parser that:
- Handles ERB blocks spanning tags (e.g. <%= form_for ... do |f| %>...<% end %>)
- Preserves original line/column positions in extracted Ruby
- Has a much smaller dependency footprint than better_html

Better_html worked but pulled in parser, ast, action_view, and other
heavy transitive dependencies. Herb is a single Rust-backed gem.

Replaces ~50 lines of better_html-specific code with a 3-line
Herb.extract_ruby call.

Also adds an explicit require for active_support/core_ext/object/blank
which better_html was loading transitively.

On Shopify Core: ~30s -> ~22s of packwerk check time. Same 31,411
offense count -- behavior preserved.
@paracycle paracycle force-pushed the ufuk/rubydex-rewrite branch from e03e231 to b7f7281 Compare May 13, 2026 12:22
@exterm
Copy link
Copy Markdown
Contributor

exterm commented May 13, 2026

The blog announcement mentions about 3x more constant references resolved in our monolith. I don't know how the codebase has developed over the last 4 years but I used packwerk pretty heavily on it before that and I'm sure we were not accidentally missing over 60% of constant references. Can you expand on what references this new implementation finds that the old one didn't?

I'm worried there may be a change in semantics here.

…haviors

Three integration tests covering the behavior changes between the
pre-Rubydex pipeline and this branch:

- Each namespace prefix in a nested reference is reported as a separate
  offense. `Sales::Order::Error.new` now produces 3 offenses (for
  `::Sales`, `::Sales::Order`, `::Sales::Order::Error`) instead of 1.
  This explains the bulk of the additional offenses observed when
  running this branch on real codebases.

- The same prefix-expansion applies to constant references inside ERB
  files, which is why ERB templates contribute disproportionately to
  the new offense count.

- Namespace reopenings whose constant is also defined in the source
  file's package are not flagged as cross-package violations. An
  earlier iteration of this branch incorrectly produced ~219k bogus
  offenses on the Shopify monolith from this; this test guards against
  reintroducing that bug.

The first two tests fail on origin/main; the third passes on both
branches and serves only as a regression guard.
@exterm
Copy link
Copy Markdown
Contributor

exterm commented May 15, 2026

Thank you @paracycle , that explains it.

@paracycle
Copy link
Copy Markdown
Member Author

The blog announcement mentions about 3x more constant references resolved in our monolith. I don't know how the codebase has developed over the last 4 years but I used packwerk pretty heavily on it before that and I'm sure we were not accidentally missing over 60% of constant references. Can you expand on what references this new implementation finds that the old one didn't?

I'm worried there may be a change in semantics here.

Valid concern, thanks for asking me to dig deeper here.

I got an LLM to analyze the new offenses flagged on with this branch and made it add 3 new tests in this commit. From what I understand, most of the new violations are from us flagging nested references as separate violations, which are much more prevalent in ERB files.

This does seem to be a change in behaviour, but I am not sure if this is better or worse.

@exterm
Copy link
Copy Markdown
Contributor

exterm commented May 15, 2026

From my perspective it is worse (steep increase in number of violations without adding a lot of signal, as these are the same violation at a different level of abstraction) but I am only one user in the end

@paracycle
Copy link
Copy Markdown
Member Author

@exterm There seems to be another category of violations that account for about 9000 new violations in Core:

So Shopify::Foo is defined in lib/shopify/foo.rb which lives in the root package . Any reference to it from components/X becomes a violation against .

The old pipeline:

  • Shopify::Foo.foo → extract just Shopify
  • constant_resolver looks for shopify.rb in autoload paths
  • Probably doesn't find a definition (because lib/ isn't typically in autoload paths) → silently skipped
  • Result: zero violation

The new pipeline:

  • Shopify::Foo.foo → extract Shopify::Foo (after collapse, the leaf)
  • Rubydex finds Shopify::Foo defined in lib/shopify/foo.rb
  • Source package is components/baz, target package is .
  • Result: dependency violation against .

This is a real cross-package reference that the old pipeline missed because it didn't know lib/shopify/ defines Shopify::Foo.

What do you think about this one?

@exterm
Copy link
Copy Markdown
Contributor

exterm commented May 16, 2026

I think there is something wrong with that analysis (I don't see why packwerk on current main would extract only Shopify from Shopify::Foo.foo; it should extract Shopify::Foo).

However there is also something very valid in there - Packwerk with constant_resolver can only resolve constants defined in autoloaded code. That was an implementation limitation, not the preferred behavior. So, barring a misunderstanding on my side, these 9000 new violations are probably valid.

I believe modern Rails applications autoload lib by default, in which case it will make much less of a difference. Still, I see it as a considerable improvement if packwerk can now trace constants to non-autoloaded code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants