Skip to content

Plugin Directory: Tokenise PHP source for block / dashboard widget detection#626

Open
dd32 wants to merge 3 commits intoWordPress:trunkfrom
dd32:add/claude/plugin-importer-tokenize-php-scans
Open

Plugin Directory: Tokenise PHP source for block / dashboard widget detection#626
dd32 wants to merge 3 commits intoWordPress:trunkfrom
dd32:add/claude/plugin-importer-tokenize-php-scans

Conversation

@dd32
Copy link
Copy Markdown
Member

@dd32 dd32 commented May 8, 2026

Summary

Replaces the regex paths in find_blocks_in_file() (PHP branches) and find_dashboard_widgets_in_file() with a single token-based extractor in a new Tools\Tokenisation_Helpers class. Fixes false positives inside comments, strings, and method/static calls; restores title extraction for both register_block_type() and new WP_Block_Type(); and corrects the dashboard widget label capture for plugins that pass class constants as the widget ID (e.g. Jetpack, where the previous regex picked up the word Stats from an adjacent doc comment).

Builds on #625 (merged) and addresses the deferred Copilot review comments around tokenization and escaped quotes.

Behavioural changes

Form Original regex This PR
Bare literal 1st arg matched matched
Wrapped in __, _x, esc_html__, etc. first quote in arg first literal-string token reachable in arg
Class const as 1st arg + literal label later order-sensitive; could pick text from inline doc comments label correctly extracted regardless of preceding non-literal args
Inline doc comment between args comment contents could be captured ignored (not a literal)
Escaped quote inside literal truncated value correctly unescaped
Call inside line/block comment false positive ignored
Call inside a string literal false positive ignored
$obj->name(), Class::name(), function name() declaration false positive ignored
\register_block_type() (leading backslash) incidental substring match matched (T_NAME_FULLY_QUALIFIED)
Foo\register_block_type() (arbitrary namespace) incidental substring match not matched (documented shortcut)
title in second-arg options array captured for new WP_Block_Type only captured for both, including __()-wrapped values
Concat name 'prefix-' . $var skipped via negative lookahead partial literal captured, then filtered by \w+/\w+ shape check (same outcome)
Variable label in wp_add_dashboard_widget call dropped (no entry) call reported with empty string (so the section term still applies; meta is still skipped)

Examples

Block titles via the options array (now also for register_block_type):

register_block_type( 'my-plugin/foo', array( 'title' => 'Foo' ) );
register_block_type( 'my-plugin/bar', [ 'title' => __( 'Bar', 'td' ) ] );
new WP_Block_Type( 'my-plugin/baz', array( 'title' => 'Baz' ) );

Tokenisation_Helpers::find_function_call_first_arg_and_array_value( $src, 'register_block_type', 1, 'title' ) returns:

array(
    'my-plugin/foo' => 'Foo',
    'my-plugin/bar' => 'Bar',
)

Class const + inline doc comment between args (the Jetpack pattern):

wp_add_dashboard_widget(
    My_Widget_Class::DASHBOARD_WIDGET_ID,
    /** This is a comment */
    'My Widget',
    array( $instance, 'render' )
);

find_function_call_arg_strings( $src, 'wp_add_dashboard_widget', 1 ) returns array( 'My Widget' ). The previous regex was sensitive to the contents of any inline comment between args — when the comment contained quote characters, the inner regex would match those quotes first and capture the wrong text (the original Jetpack source has /** "Stats" is a product name. */ between the ID and the label, which is what triggered the regression).

Variable as label still tags the section:

wp_add_dashboard_widget( 'id', $variable, 'cb' );

returns array( '' ) — the call is detected (so the dashboard-widgets plugin_section term is applied), but no dashboard_widget_name meta value is stored (the empty entry is filtered at the meta-storage point).

False positives are now ignored:

// wp_add_dashboard_widget( 'x', 'Y', 'cb' );
/* wp_add_dashboard_widget( 'x', 'Y', 'cb' ); */
$str = "wp_add_dashboard_widget( 'x', 'Y', 'cb' );";
$obj->wp_add_dashboard_widget( 'x', 'Y', 'cb' );
SomeClass::wp_add_dashboard_widget( 'x', 'Y', 'cb' );
function wp_add_dashboard_widget( $id, $name, $cb ) {}

All return array().

Documented shortcuts

Four test cases under Shortcut (to reduce complexity): doc comments record deliberate trade-offs:

  • Concatenated literal-plus-expression captures only the leading literal (consumers validate the result downstream).
  • Only \register_block_type (T_NAME_FULLY_QUALIFIED) is treated as the global function; arbitrary Foo\Bar\register_block_type is not matched.
  • The "first literal in arg" rule does not validate that the surrounding wrapper is a known i18n helper, so any wrapping call (e.g. $obj->method( 'Inner' )) yields its inner literal.
  • For the positional-string method, an array literal at the target position returns the first inner string (typically a key) — callers wanting key/value semantics use find_function_call_first_arg_and_array_value() instead.

Test plan

  • 37 unit tests in tests/Tokenisation_Helpers_Test.php cover the cases above and the four documented shortcuts.
  • PHPCS clean (phpcs-branch.php exit 0).
  • Re-import a sample of plugins known to use blocks and confirm find_blocks_in_file() still returns the same set of blocks (no regression for plugins that use register_block_type or new WP_Block_Type with literal names).
  • Re-import the Jetpack plugin and confirm dashboard_widget_name post meta now contains Jetpack Stats (was Stats).
  • Confirm a plugin that calls wp_add_dashboard_widget with a variable label still gets the dashboard-widgets section term assigned, and no empty dashboard_widget_name rows are stored.

🤖 Generated with Claude Code

dd32 and others added 2 commits May 8, 2026 15:59
…tection.

Replace the regex paths in find_blocks_in_file() and find_dashboard_widgets_in_file() with a single token-based extractor in a new Tools\Tokenisation_Helpers class. The helper walks tokens once per file, ignores matches in comments and string literals, and follows i18n wrappers like __(), _x(), and esc_html__() to the inner string value.

Includes a 29-test PHPUnit class covering the supported cases plus the deliberate shortcuts taken to keep the helper small.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oard widget calls.

- Add Tokenisation_Helpers::find_function_call_first_arg_and_array_value() for the registration pattern (first-arg identifier plus an inline options array). Uses it in find_blocks_in_file() to capture the optional title for register_block_type() and new WP_Block_Type().
- find_function_call_arg_strings() now returns one entry per matched call, yielding an empty string when the target arg has no literal. find_dashboard_widgets_in_file() therefore reports every detected call, allowing the section term to be applied even when the label is not parseable; import_from_svn skips empty values when storing dashboard_widget_name post meta.
- Tests: rename the dashboard-widget class-constant case to use generic identifiers, update the variable / class-constant assertions for the new contract, and add coverage for the new title-extraction method (long/short array, translation wrapper, missing key, no options, variable value, non-literal first arg).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 8, 2026 06:23
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dd32.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR replaces regex-based PHP source scanning for block and dashboard widget detection with a shared token-based extractor to avoid false positives in comments/strings and improve argument/title extraction.

Changes:

  • Added Tools\Tokenisation_Helpers to parse PHP tokens and extract string-literal arguments and array metadata.
  • Updated importer logic to use tokenisation for register_block_type / new WP_Block_Type and wp_add_dashboard_widget, and to avoid storing empty widget-name meta.
  • Added PHPUnit coverage for the new tokenisation behaviors and documented shortcuts.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
wordpress.org/public_html/wp-content/plugins/plugin-directory/tools/class-tokenisation-helpers.php Introduces token-walker and string/array literal extraction utilities.
wordpress.org/public_html/wp-content/plugins/plugin-directory/tests/Tokenisation_Helpers_Test.php Adds unit tests for token-based detection, including false-positive prevention and edge cases.
wordpress.org/public_html/wp-content/plugins/plugin-directory/cli/class-import.php Switches block/widget detection to tokenisation helper and filters empty widget meta values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return array();
}

$is_new = str_starts_with( $function_name, 'new ' );
continue;
}
$matches_simple = ( T_STRING === $tok[0] && 0 === strcasecmp( $tok[1], $needle ) );
$matches_global = ( T_NAME_FULLY_QUALIFIED === $tok[0] && 0 === strcasecmp( $tok[1], $global_form ) );
$prev_id = is_array( $pt ) ? $pt[0] : null;
break;
}
if ( in_array( $prev_id, array( T_OBJECT_OPERATOR, T_DOUBLE_COLON, T_FUNCTION, T_NULLSAFE_OBJECT_OPERATOR ), true ) ) {
Comment on lines +1201 to +1203
if ( $contents ) {
foreach ( array( 'register_block_type', 'new WP_Block_Type' ) as $needle ) {
foreach ( Tokenisation_Helpers::find_function_call_first_arg_and_array_value( $contents, $needle, 1, 'title' ) as $name => $title ) {
Comment on lines +75 to +76
$tokens = @token_get_all( $contents );
if ( ! $tokens ) {
Comment on lines +68 to +74
/**
* Walk PHP tokens for calls to `$function_name` and yield each call's
* arg-list tokens, split into per-arg slices at top-level commas.
*
* @return array[] One entry per matched call: [ arg0_tokens, arg1_tokens, ... ].
*/
private static function walk_calls( $contents, $function_name ) {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants