Replace core parsing/resolution engine with Rubydex#447
Conversation
7fa6be2 to
46eea11
Compare
|
Cool stuff, Ufuk. Would this mean packwerk doesn't depend on the interrogated codebase using zeitwerk anymore? |
That's correct. Rubydex can do proper Ruby constant resolution, so we don't need to use Zeitwerk heuristics to figure out what constant references resolve to based on their filename. At least, we shouldn't, and, if there are any problems with the resolution, then we should fix them in Rubydex. |
eda8221 to
3342d3d
Compare
Replaces the entire AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. Architectural changes: - RunContext rewritten to create a Rubydex::Graph, index files, resolve all constants, and iterate resolved references to detect cross-package violations - ParseRun simplified to two phases (index_and_resolve + find_offenses) instead of per-file parallel processing - Association detection rewritten using Prism native AST (no parser gem translation layer needed), runs as a supplementary pass since Rubydex doesn't understand ActiveRecord associations - ERB support: Parsers::Erb simplified to extract_ruby_source which feeds into Rubydex's index_source API - ApplicationValidator uses Rubydex::Graph instead of ConstantResolver Dependency changes: - Removed: constant_resolver, parallel, ast (direct), parser (direct) - Added: rubydex - Kept: prism (for association detection), better_html (for ERB) Deleted 16 source files, 11 test files, 3 RBI files. Also regenerates Sorbet RBIs and updates CI matrix to 3.2-3.4.
361ea57 to
9580fff
Compare
Correctness fixes: - Fix NotFileUriError for ERB locations using file:// URI prefix - Prefix constant names with :: to match package_todo.yml format - Use Rubydex::SingletonClass type check (with #owner) to map metaclass names (Foo::<Foo>) back to their owning class (Foo) - Skip SingletonClass declarations to avoid double-counting references: Foo.bar produces refs to BOTH Foo and Foo::<Foo>, so only iterating Class declarations gives us each user-visible reference once - Select canonical Zeitwerk definition path when constant has multiple definitions (e.g. canonical file + class reopenings) - Fix shared namespace false positives by checking all definitions: if any definition is in source package, the reference is local Performance optimizations: - Iterate per-declaration via Declaration#references instead of iterating all 2.7M+ constant references globally. Skips unreferenced declarations entirely and amortizes per-constant work - Use location.uri instead of to_file_path for 5x faster extraction (raw string access vs URI parsing on 2.7M+ refs) - Target association parsing via graph.method_references: only ~1% of files have associations, so we parse 1,200 files instead of 57,000 - Eliminate redundant Dir.glob (8.3s saved on Core) by reusing the file list from FilesForProcessing - Inline URI->path conversion in hot loops - Cache per-file package lookups - Lazy compute defn_packages only for declarations with in-set refs - Remove parallel gem: fork overhead exceeded the work being done (benchmarks showed 2-3x slowdown) - Remove Rails dependency: load_paths were only used by old ConstantResolver, no longer needed with Rubydex Update rubydex dependency to >= 0.2.3. End-to-end check on Shopify Core (~60k files): ~45s total.
9580fff to
dbdf2a0
Compare
Split the 70+ line method with nested loops, lazy initialization, and inline URI parsing into 4 focused methods: - extract_refs_by_file: linear main loop -- collect refs in set, compute defn data, check each ref - collect_in_set_references: gathers [source_path, location] pairs for refs that land in the check set - definition_packages_and_target: computes the package set and canonical Zeitwerk-preferred target path for a declaration - uri_to_relative_path: URI->path conversion used by both helpers No behavioral or performance changes.
Rubydex reports locations with 0-based line and 0-based column. Prism reports 1-based line and 0-based column. Packwerk displays locations as 1-based line:column. Without normalization, every reported offense pointed to the wrong line (off by one) and column (off by one). For example, a reference on line 84 of a file was reported as 83. Add integration tests asserting reported locations for both code paths: - Rubydex source: plain constant reference (Foo.new) - Prism source: ActiveRecord association (belongs_to :foo)
Iterate per-declaration using URI strings as opaque keys throughout the hot reference loop. Builds three lookup tables once after indexing: - @checked_uris : URIs of files in the check set - @uri_to_package : URI -> owning Package - @uri_to_relative_path: URI -> workspace-relative path This replaces inline URI string parsing in each iteration with single hash lookups, and eliminates the redundant @file_uri_prefix path. Also: - Replace tuple type aliases with PORO classes ExtractedRef and AssociationReference for clarity at call sites - Replace tuple return from definition_set_for with a DefinitionSet PORO - Rename collect_in_check_references to checked_references (filter_map) - Remove redundant tuple element from checked_references return value - Replace ||= accumulator mutation in definition_set_for with three focused passes: collect URIs, derive packages, pick canonical target - Skip Rubydex::SingletonClass declarations: their references duplicate the regular class's references (Foo.bar produces refs to BOTH Foo and Foo::<Foo>), so iterating Class declarations gives each user-visible reference once No behavioral change. Runs in ~30s on Shopify Core (down from ~48s).
Herb is Shopify's purpose-built ERB parser that: - Handles ERB blocks spanning tags (e.g. <%= form_for ... do |f| %>...<% end %>) - Preserves original line/column positions in extracted Ruby - Has a much smaller dependency footprint than better_html Better_html worked but pulled in parser, ast, action_view, and other heavy transitive dependencies. Herb is a single Rust-backed gem. Replaces ~50 lines of better_html-specific code with a 3-line Herb.extract_ruby call. Also adds an explicit require for active_support/core_ext/object/blank which better_html was loading transitively. On Shopify Core: ~30s -> ~22s of packwerk check time. Same 31,411 offense count -- behavior preserved.
e03e231 to
b7f7281
Compare
|
The blog announcement mentions I'm worried there may be a change in semantics here. |
…haviors Three integration tests covering the behavior changes between the pre-Rubydex pipeline and this branch: - Each namespace prefix in a nested reference is reported as a separate offense. `Sales::Order::Error.new` now produces 3 offenses (for `::Sales`, `::Sales::Order`, `::Sales::Order::Error`) instead of 1. This explains the bulk of the additional offenses observed when running this branch on real codebases. - The same prefix-expansion applies to constant references inside ERB files, which is why ERB templates contribute disproportionately to the new offense count. - Namespace reopenings whose constant is also defined in the source file's package are not flagged as cross-package violations. An earlier iteration of this branch incorrectly produced ~219k bogus offenses on the Shopify monolith from this; this test guards against reintroducing that bug. The first two tests fail on origin/main; the third passes on both branches and serves only as a regression guard.
|
Thank you @paracycle , that explains it. |
Valid concern, thanks for asking me to dig deeper here. I got an LLM to analyze the new offenses flagged on with this branch and made it add 3 new tests in this commit. From what I understand, most of the new violations are from us flagging nested references as separate violations, which are much more prevalent in ERB files. This does seem to be a change in behaviour, but I am not sure if this is better or worse. |
|
From my perspective it is worse (steep increase in number of violations without adding a lot of signal, as these are the same violation at a different level of abstraction) but I am only one user in the end |
|
@exterm There seems to be another category of violations that account for about 9000 new violations in Core:
What do you think about this one? |
|
I think there is something wrong with that analysis (I don't see why packwerk on current However there is also something very valid in there - Packwerk with I believe modern Rails applications autoload |
Summary
Replace the per-file AST walking, constant extraction, and resolution pipeline with Rubydex, a high-performance Ruby indexer written in Rust. Constants are resolved from actual parsed definitions rather than Zeitwerk path conventions, enabling more thorough analysis with comparable speed.
constant_resolver,parallel, and directparser/astdeps withrubydexbetter_htmlwith Herb for ERB parsing (preserves block-spanning tags and source positions)constant_resolverautoload paths)parallelgem (fork overhead exceeded the work being done)Architecture
Before: for each file → parse with Prism → walk AST → extract constant references → resolve via
constant_resolver(Zeitwerk paths) → check violationsAfter: index all workspace files in one
Rubydex::Graphbatch call → resolve all constants → iterateDeclaration#referencesper declaration → check violations. Association detection (has_many,belongs_to, etc.) runs as a supplementary Prism-based pass since Rubydex treats those as method calls, not constant references — and is targeted to only files that contain association method references viagraph.method_references(~1% of files on Shopify Core).Dependency changes
constant_resolverrubydexparallelherbbetter_htmlast(direct)parser(direct)Kept:
prism(for association detection),activesupport,bundler,benchmark.Key design decisions
index_and_resolveindexes ALL Ruby files for cross-package resolution, even when a scoped check (packwerk check components/foo) is requested. Only the scoped files are checked for violations.graph.declarationsand usesDeclaration#referencesrather than iterating all 9M+ constant references globally. Skips declarations with no in-scope references entirely.@checked_uris,@uri_to_package,@uri_to_relative_path). No URI parsing in the inner loop.Foo.barproduces references to bothFooandFoo::<Foo>(the singleton class). SkippingRubydex::SingletonClassdeclarations gives each user-visible reference once.ApiClient→app/models/api_client.rb) for the violation message's "defined in" hint.module GraphApireopened across many packages.extract_rubywhich preserves line numbers, then fed intograph.index_source. Association detection only re-parses files thatgraph.method_referencesshows contain association method calls.Performance on Shopify Core (~60k files)
We're within 3-4s of released speed while doing significantly more analysis. The 31k extra offenses come from references the old
ConstNodeInspector/constant_resolverpipeline missed — primarily ERB files (where the old concatenated-Ruby approach often produced unparseable output) and shared namespace references. After runningpackwerk update-todo, only ~620 strict-mode violations remain (which need manual resolution since strict packages reject new entries inpackage_todo.yml).Deleted code
16 source files, 11 test files, 3 RBI files removed:
file_processor,node_processor,node_processor_factory,node_visitor,node_helpersconst_node_inspector,constant_name_inspector,constant_discovery,parsed_constant_definitions,reference_extractor,unresolved_reference,association_inspectorcacheparsers/ruby,parsers/factory,parsers/parser_interfacerails_load_pathsTests
134 tests pass (was 132 + 2 new integration tests for line/column locations). Sorbet typecheck clean. Rubocop clean.
The new line/column integration tests assert reported locations for both code paths (Rubydex source for plain constant references, Prism source for ActiveRecord associations) and catch off-by-one regressions.
Commits