Replace core parsing/resolution engine with Rubydex by paracycle · Pull Request #447 · Shopify/packwerk

paracycle · 2026-04-13T23:25:48Z

Summary

Replace the per-file AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. Constants are resolved from actual parsed definitions rather than Zeitwerk path conventions, enabling more thorough analysis with comparable speed.

Replaces constant_resolver, parallel, and direct parser/ast deps with rubydex
Replaces better_html with Herb for ERB parsing (preserves block-spanning tags and source positions)
Drops the runtime Rails dependency entirely (used only for old constant_resolver autoload paths)
Removes the MD5-based file cache (Rubydex is fast enough without it)
Removes the parallel gem (fork overhead exceeded the work being done)

Architecture

Before: for each file → parse with Prism → walk AST → extract constant references → resolve via constant_resolver (Zeitwerk paths) → check violations

After: index all workspace files in one Rubydex::Graph batch call → resolve all constants → iterate Declaration#references per declaration → check violations. Association detection (has_many, belongs_to, etc.) runs as a supplementary Prism-based pass since Rubydex treats those as method calls, not constant references — and is targeted to only files that contain association method references via graph.method_references (~1% of files on Shopify Core).

Dependency changes

Removed	Added
`constant_resolver`	`rubydex`
`parallel`	`herb`
`better_html`
`ast` (direct)
`parser` (direct)

Kept: prism (for association detection), activesupport, bundler, benchmark.

Key design decisions

Full workspace indexing: index_and_resolve indexes ALL Ruby files for cross-package resolution, even when a scoped check (packwerk check components/foo) is requested. Only the scoped files are checked for violations.
Per-declaration iteration: the hot loop walks graph.declarations and uses Declaration#references rather than iterating all 9M+ constant references globally. Skips declarations with no in-scope references entirely.
URI-keyed lookups: the hot loop compares URIs as opaque strings via precomputed hash maps (@checked_uris, @uri_to_package, @uri_to_relative_path). No URI parsing in the inner loop.
Singleton class deduplication: Foo.bar produces references to both Foo and Foo::<Foo> (the singleton class). Skipping Rubydex::SingletonClass declarations gives each user-visible reference once.
Zeitwerk-canonical target paths: when a constant has multiple definitions (e.g. canonical file + class reopenings), prefer the file matching the Zeitwerk naming convention (ApiClient → app/models/api_client.rb) for the violation message's "defined in" hint.
Shared namespace handling: if any definition of a constant is in the source file's package, the reference is treated as local (not a violation). Correctly handles cases like module GraphApi reopened across many packages.
Targeted ERB and association parsing: ERB files are extracted via Herb's extract_ruby which preserves line numbers, then fed into graph.index_source. Association detection only re-parses files that graph.method_references shows contain association method calls.
0/1-based location conversion: Rubydex uses 0-based line/column, Prism uses 1-based line + 0-based column, Packwerk displays 1-based for both. Conversions happen at the boundary, with regression tests asserting reported locations match the 1-based display convention for both source paths.

Performance on Shopify Core (~60k files)

We're within 3-4s of released speed while doing significantly more analysis. The 31k extra offenses come from references the old ConstNodeInspector/constant_resolver pipeline missed — primarily ERB files (where the old concatenated-Ruby approach often produced unparseable output) and shared namespace references. After running packwerk update-todo, only ~620 strict-mode violations remain (which need manual resolution since strict packages reject new entries in package_todo.yml).

Deleted code

16 source files, 11 test files, 3 RBI files removed:

AST walking pipeline: file_processor, node_processor, node_processor_factory, node_visitor, node_helpers
Constant extraction: const_node_inspector, constant_name_inspector, constant_discovery, parsed_constant_definitions, reference_extractor, unresolved_reference, association_inspector
Per-file caching: cache
Parsers: parsers/ruby, parsers/factory, parsers/parser_interface
Rails integration: rails_load_paths

Tests

134 tests pass (was 132 + 2 new integration tests for line/column locations). Sorbet typecheck clean. Rubocop clean.

The new line/column integration tests assert reported locations for both code paths (Rubydex source for plain constant references, Prism source for ActiveRecord associations) and catch off-by-one regressions.

Commits

Replace core parsing/resolution engine with Rubydex
Optimize Rubydex pipeline for correctness and performance
Refactor `extract_refs_by_file` into focused helpers
Convert Rubydex/Prism 0-based locations to Packwerk's 1-based convention
Refactor: URI-keyed lookups, named types, simpler helpers
Replace better_html with Herb for ERB parsing
Bump gem RBI files

exterm · 2026-04-14T13:40:19Z

Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore?

paracycle · 2026-04-14T14:21:36Z

Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore?

That's correct. Rubydex can do proper Ruby constant resolution, so we don't need to use Zeitwerk heuristics to figure out what constant references resolve to based on their filename. At least, we shouldn't, and, if there are any problems with the resolution, then we should fix them in Rubydex.

Replaces the entire AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. Architectural changes: - RunContext rewritten to create a Rubydex::Graph, index files, resolve all constants, and iterate resolved references to detect cross-package violations - ParseRun simplified to two phases (index_and_resolve + find_offenses) instead of per-file parallel processing - Association detection rewritten using Prism native AST (no parser gem translation layer needed), runs as a supplementary pass since Rubydex doesn't understand ActiveRecord associations - ERB support: Parsers::Erb simplified to extract_ruby_source which feeds into Rubydex's index_source API - ApplicationValidator uses Rubydex::Graph instead of ConstantResolver Dependency changes: - Removed: constant_resolver, parallel, ast (direct), parser (direct) - Added: rubydex - Kept: prism (for association detection), better_html (for ERB) Deleted 16 source files, 11 test files, 3 RBI files. Also regenerates Sorbet RBIs and updates CI matrix to 3.2-3.4.

Correctness fixes: - Fix NotFileUriError for ERB locations using file:// URI prefix - Prefix constant names with :: to match package_todo.yml format - Use Rubydex::SingletonClass type check (with #owner) to map metaclass names (Foo::<Foo>) back to their owning class (Foo) - Skip SingletonClass declarations to avoid double-counting references: Foo.bar produces refs to BOTH Foo and Foo::<Foo>, so only iterating Class declarations gives us each user-visible reference once - Select canonical Zeitwerk definition path when constant has multiple definitions (e.g. canonical file + class reopenings) - Fix shared namespace false positives by checking all definitions: if any definition is in source package, the reference is local Performance optimizations: - Iterate per-declaration via Declaration#references instead of iterating all 2.7M+ constant references globally. Skips unreferenced declarations entirely and amortizes per-constant work - Use location.uri instead of to_file_path for 5x faster extraction (raw string access vs URI parsing on 2.7M+ refs) - Target association parsing via graph.method_references: only ~1% of files have associations, so we parse 1,200 files instead of 57,000 - Eliminate redundant Dir.glob (8.3s saved on Core) by reusing the file list from FilesForProcessing - Inline URI->path conversion in hot loops - Cache per-file package lookups - Lazy compute defn_packages only for declarations with in-set refs - Remove parallel gem: fork overhead exceeded the work being done (benchmarks showed 2-3x slowdown) - Remove Rails dependency: load_paths were only used by old ConstantResolver, no longer needed with Rubydex Update rubydex dependency to >= 0.2.3. End-to-end check on Shopify Core (~60k files): ~45s total.

Split the 70+ line method with nested loops, lazy initialization, and inline URI parsing into 4 focused methods: - extract_refs_by_file: linear main loop -- collect refs in set, compute defn data, check each ref - collect_in_set_references: gathers [source_path, location] pairs for refs that land in the check set - definition_packages_and_target: computes the package set and canonical Zeitwerk-preferred target path for a declaration - uri_to_relative_path: URI->path conversion used by both helpers No behavioral or performance changes.

Rubydex reports locations with 0-based line and 0-based column. Prism reports 1-based line and 0-based column. Packwerk displays locations as 1-based line:column. Without normalization, every reported offense pointed to the wrong line (off by one) and column (off by one). For example, a reference on line 84 of a file was reported as 83. Add integration tests asserting reported locations for both code paths: - Rubydex source: plain constant reference (Foo.new) - Prism source: ActiveRecord association (belongs_to :foo)

Iterate per-declaration using URI strings as opaque keys throughout the hot reference loop. Builds three lookup tables once after indexing: - @checked_uris : URIs of files in the check set - @uri_to_package : URI -> owning Package - @uri_to_relative_path: URI -> workspace-relative path This replaces inline URI string parsing in each iteration with single hash lookups, and eliminates the redundant @file_uri_prefix path. Also: - Replace tuple type aliases with PORO classes ExtractedRef and AssociationReference for clarity at call sites - Replace tuple return from definition_set_for with a DefinitionSet PORO - Rename collect_in_check_references to checked_references (filter_map) - Remove redundant tuple element from checked_references return value - Replace ||= accumulator mutation in definition_set_for with three focused passes: collect URIs, derive packages, pick canonical target - Skip Rubydex::SingletonClass declarations: their references duplicate the regular class's references (Foo.bar produces refs to BOTH Foo and Foo::<Foo>), so iterating Class declarations gives each user-visible reference once No behavioral change. Runs in ~30s on Shopify Core (down from ~48s).

Herb is Shopify's purpose-built ERB parser that: - Handles ERB blocks spanning tags (e.g. <%= form_for ... do |f| %>...<% end %>) - Preserves original line/column positions in extracted Ruby - Has a much smaller dependency footprint than better_html Better_html worked but pulled in parser, ast, action_view, and other heavy transitive dependencies. Herb is a single Rust-backed gem. Replaces ~50 lines of better_html-specific code with a 3-line Herb.extract_ruby call. Also adds an explicit require for active_support/core_ext/object/blank which better_html was loading transitively. On Shopify Core: ~30s -> ~22s of packwerk check time. Same 31,411 offense count -- behavior preserved.

exterm · 2026-05-13T19:03:05Z

The blog announcement mentions about 3x more constant references resolved in our monolith. I don't know how the codebase has developed over the last 4 years but I used packwerk pretty heavily on it before that and I'm sure we were not accidentally missing over 60% of constant references. Can you expand on what references this new implementation finds that the old one didn't?

I'm worried there may be a change in semantics here.

…haviors Three integration tests covering the behavior changes between the pre-Rubydex pipeline and this branch: - Each namespace prefix in a nested reference is reported as a separate offense. `Sales::Order::Error.new` now produces 3 offenses (for `::Sales`, `::Sales::Order`, `::Sales::Order::Error`) instead of 1. This explains the bulk of the additional offenses observed when running this branch on real codebases. - The same prefix-expansion applies to constant references inside ERB files, which is why ERB templates contribute disproportionately to the new offense count. - Namespace reopenings whose constant is also defined in the source file's package are not flagged as cross-package violations. An earlier iteration of this branch incorrectly produced ~219k bogus offenses on the Shopify monolith from this; this test guards against reintroducing that bug. The first two tests fail on origin/main; the third passes on both branches and serves only as a regression guard.

exterm · 2026-05-15T20:11:48Z

Thank you @paracycle , that explains it.

paracycle · 2026-05-15T20:12:18Z

The blog announcement mentions about 3x more constant references resolved in our monolith. I don't know how the codebase has developed over the last 4 years but I used packwerk pretty heavily on it before that and I'm sure we were not accidentally missing over 60% of constant references. Can you expand on what references this new implementation finds that the old one didn't?

I'm worried there may be a change in semantics here.

Valid concern, thanks for asking me to dig deeper here.

I got an LLM to analyze the new offenses flagged on with this branch and made it add 3 new tests in this commit. From what I understand, most of the new violations are from us flagging nested references as separate violations, which are much more prevalent in ERB files.

This does seem to be a change in behaviour, but I am not sure if this is better or worse.

exterm · 2026-05-15T20:22:27Z

From my perspective it is worse (steep increase in number of violations without adding a lot of signal, as these are the same violation at a different level of abstraction) but I am only one user in the end

paracycle · 2026-05-15T21:38:42Z

@exterm There seems to be another category of violations that account for about 9000 new violations in Core:

So Shopify::Foo is defined in lib/shopify/foo.rb which lives in the root package . Any reference to it from components/X becomes a violation against .

The old pipeline:

Shopify::Foo.foo → extract just Shopify

constant_resolver looks for shopify.rb in autoload paths

Probably doesn't find a definition (because lib/ isn't typically in autoload paths) → silently skipped

Result: zero violation

The new pipeline:

Shopify::Foo.foo → extract Shopify::Foo (after collapse, the leaf)

Rubydex finds Shopify::Foo defined in lib/shopify/foo.rb

Source package is components/baz, target package is .

Result: dependency violation against .

This is a real cross-package reference that the old pipeline missed because it didn't know lib/shopify/ defines Shopify::Foo.

What do you think about this one?

exterm · 2026-05-16T21:23:29Z

I think there is something wrong with that analysis (I don't see why packwerk on current main would extract only Shopify from Shopify::Foo.foo; it should extract Shopify::Foo).

However there is also something very valid in there - Packwerk with constant_resolver can only resolve constants defined in autoloaded code. That was an implementation limitation, not the preferred behavior. So, barring a misunderstanding on my side, these 9000 new violations are probably valid.

I believe modern Rails applications autoload lib by default, in which case it will make much less of a difference. Still, I see it as a considerable improvement if packwerk can now trace constants to non-autoloaded code.

paracycle requested a review from a team as a code owner April 13, 2026 23:25

paracycle force-pushed the ufuk/rubydex-rewrite branch 3 times, most recently from 7fa6be2 to 46eea11 Compare April 14, 2026 00:00

paracycle force-pushed the ufuk/rubydex-rewrite branch from eda8221 to 3342d3d Compare April 16, 2026 19:58

paracycle force-pushed the ufuk/rubydex-rewrite branch 5 times, most recently from 361ea57 to 9580fff Compare May 12, 2026 20:54

paracycle force-pushed the ufuk/rubydex-rewrite branch from 9580fff to dbdf2a0 Compare May 12, 2026 20:55

paracycle added 3 commits May 13, 2026 00:25

paracycle requested a review from rafaelfranca May 12, 2026 23:13

paracycle added 2 commits May 13, 2026 15:19

Bump gem RBI files

b7f7281

paracycle force-pushed the ufuk/rubydex-rewrite branch from e03e231 to b7f7281 Compare May 13, 2026 12:22

paracycle mentioned this pull request May 14, 2026

RBSCommentsToSorbetSigs emits two consecutive sig blocks for overloaded #: annotations, which sorbet-runtime rejects at load time Shopify/spoom#913

Open

dduugg mentioned this pull request May 14, 2026

perf: Replace better_html dependency with herb #462

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace core parsing/resolution engine with Rubydex#447

Replace core parsing/resolution engine with Rubydex#447
paracycle wants to merge 8 commits into
mainfrom
ufuk/rubydex-rewrite

paracycle commented Apr 13, 2026 •

edited

Loading

Uh oh!

exterm commented Apr 14, 2026 •

edited

Loading

Uh oh!

paracycle commented Apr 14, 2026

Uh oh!

exterm commented May 13, 2026 •

edited

Loading

Uh oh!

exterm commented May 15, 2026

Uh oh!

paracycle commented May 15, 2026

Uh oh!

exterm commented May 15, 2026 •

edited

Loading

Uh oh!

paracycle commented May 15, 2026

Uh oh!

exterm commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paracycle commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Dependency changes

Key design decisions

Performance on Shopify Core (~60k files)

Deleted code

Tests

Commits

Uh oh!

exterm commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paracycle commented Apr 14, 2026

Uh oh!

exterm commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

exterm commented May 15, 2026

Uh oh!

paracycle commented May 15, 2026

Uh oh!

exterm commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paracycle commented May 15, 2026

Uh oh!

exterm commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paracycle commented Apr 13, 2026 •

edited

Loading

exterm commented Apr 14, 2026 •

edited

Loading

exterm commented May 13, 2026 •

edited

Loading

exterm commented May 15, 2026 •

edited

Loading