Improve performance in Utf8String::semanticSplit() by live627 · Pull Request #9196 · SimpleMachines/SMF

live627 · 2026-04-30T13:44:31Z

Before

╔═══════════════════════════════════════════════════════════════╗
║  Utf8String::extractWords()                                   ║
║  (Based on actual behavior from SMF implementation)           ║
╚═══════════════════════════════════════════════════════════════╝

⏱️ Performance Test
Mean: 752.776 ms
Words extracted: 600
❌ Performance slow (threshold: 100ms)

After

╔═══════════════════════════════════════════════════════════════╗
║  Utf8String::extractWords()                                   ║
║  (Based on actual behavior from SMF implementation)           ║
╚═══════════════════════════════════════════════════════════════╝

📝 ASCII & Alphanumeric Tests
✅ PASS: ascii words
✅ PASS: camelCase
✅ PASS: mixed case
✅ PASS: single word
✅ PASS: empty string

🔤 Punctuation Tests (Preserved in words)
✅ PASS: dot abbreviation
✅ PASS: multi dots
✅ PASS: apostrophe word
✅ PASS: multiple apostrophes

🔢 Number Tests
✅ PASS: simple number
✅ PASS: decimal
✅ PASS: comma number
✅ PASS: mixed num alpha
✅ PASS: alpha num boundary

🌍 Unicode Script Tests
✅ PASS: hebrew letters
✅ PASS: hebrew apostrophe
✅ PASS: hebrew quotes

🇯🇵 CJK & Japanese Tests
✅ PASS: katakana sequence
✅ PASS: katakana mix

🎯 Combining Marks Tests
✅ PASS: combining acute (alone)
✅ PASS: multiple combining
✅ PASS: combining in word

⚙️ Special Characters
✅ PASS: underscore word
✅ PASS: mixed underscore

😊 Emoji Tests
✅ PASS: emoji sequence
✅ PASS: emoji + text
✅ PASS: family emoji (ZWJ)
✅ PASS: zwj mixed

🚩 Flag Emoji Tests
✅ PASS: single flag
✅ PASS: multiple flags
✅ PASS: odd RI sequence

🎪 Complex Mixed Cases
✅ PASS: full sentence
✅ PASS: unicode chaos

🌐 Whitespace & Edge Cases
✅ PASS: double space
✅ PASS: tabs
✅ PASS: newline
✅ PASS: mixed whitespace

✨ Emoji Modifiers & Variation Selectors
✅ PASS: emoji with variation selector

⏱️ Performance Test
Mean: 76.442 ms
Words extracted: 600
✅ Performance OK

🔄 Invariant Tests (No Crashes)

╔═══════════════════════════════════════════════════════════════╗
║  Test Summary                                                 ║
╚═══════════════════════════════════════════════════════════════╝
✅ Passed: 38
❌ Failed: 0
🔄 Invariant Passed: 100
💥 Invariant Failed: 0
Success Rate: 100.0%

🎉 All tests passed!

Tesst script

<?php

/**
 * Standalone CLI test script for Utf8String optimization validation
 * 
 * Usage: php test-utf8-optimization.php
 */

declare(strict_types=1);

define('SMF', 1);
require_once __DIR__ . '/Sources/Unicode/Utf8String.php';
require_once __DIR__ . '/Sources/Unicode/CombiningClasses.php';
require_once __DIR__ . '/Sources/Unicode/Metadata.php';
require_once __DIR__ . '/vendor/autoload.php';

use SMF\Unicode\Utf8String;

$passed = 0;
$failed = 0;

function assertEqual($label, $input, $expected)
{
    global $passed, $failed;
    $obj = Utf8String::create($input);
    $result = $obj->extractWords(0);

    // Sort for comparison
    sort($result);
    sort($expected);

    if ($result !== $expected) {
        echo "❌ FAIL: $label\n";
        echo "Input:    " . json_encode($input, JSON_UNESCAPED_UNICODE) . "\n";
        echo "Expected: " . json_encode($expected, JSON_UNESCAPED_UNICODE) . "\n";
        echo "Got:      " . json_encode($result, JSON_UNESCAPED_UNICODE) . "\n\n";
        $failed++;
    } else {
        echo "✅ PASS: $label\n";
        $passed++;
    }
}

echo "╔═══════════════════════════════════════════════════════════════╗\n";
echo "║  Utf8String::extractWords()                                   ║\n";
echo "║  (Based on actual behavior from SMF implementation)           ║\n";
echo "╚═══════════════════════════════════════════════════════════════╝\n\n";

// === ASCII & Alphanumeric ===
echo "📝 ASCII & Alphanumeric Tests\n";
assertEqual('ascii words', 'hello world', ['hello', 'world']);
assertEqual('camelCase', 'helloWorldTest', ['helloWorldTest']);
assertEqual('mixed case', 'ABCdefGHI', ['ABCdefGHI']);
assertEqual('single word', 'hello', ['hello']);
assertEqual('empty string', '', []);

// === Punctuation (KEPT in words per Unicode word break rules) ===
echo "\n🔤 Punctuation Tests (Preserved in words)\n";
assertEqual('dot abbreviation', 'e.g.', ['e.g']);  // Kept together
assertEqual('multi dots', 'U.S.A.', ['U.S.A']);  // Kept together per word break rules
assertEqual('apostrophe word', "don't", ["don't"]);  // Apostrophe kept per MidLetter rule
assertEqual('multiple apostrophes', "rock'n'roll", ["rock'n'roll"]);  // All kept together

// === Numbers ===
echo "\n🔢 Number Tests\n";
assertEqual('simple number', '123456', ['123456']);
assertEqual('decimal', '3.14159', ['3.14159']);  // Decimal point kept (MidNum rule)
assertEqual('comma number', '1,234,567', ['1,234,567']);  // Commas kept (MidNum rule)
assertEqual('mixed num alpha', 'A1B2C3', ['A1B2C3']);
assertEqual('alpha num boundary', 'foo123bar', ['foo123bar']);

// === Unicode Scripts ===
echo "\n🌍 Unicode Script Tests\n";
assertEqual('hebrew letters', 'אבגדה', ['אבגדה']);
assertEqual('hebrew apostrophe', "א'ב", ["א'ב"]);  // Kept together (MidLetter rule)
assertEqual('hebrew quotes', "א\"ב", ["א\"ב"]);  // Kept together per word break

// === CJK & Japanese ===
echo "\n🇯🇵 CJK & Japanese Tests\n";
assertEqual('katakana sequence', 'カタカナ', ['カタカナ']);
assertEqual('katakana mix', 'カタabcナ', ['カタ', 'abc', 'ナ']);  // Different scripts break

// === Combining Marks ===
echo "\n🎯 Combining Marks Tests\n";
assertEqual('combining acute (alone)', "a\u{0301}", ["a\u{0301}"]);  // Has base, kept
assertEqual('multiple combining', "a\u{0301}\u{0302}", ["a\u{0301}\u{0302}"]);  // Has base, kept
assertEqual('combining in word', "e\u{0301}cole", ["e\u{0301}cole"]);  // Base + text

// === Special Characters (underscores are word chars) ===
echo "\n⚙️ Special Characters\n";
assertEqual('underscore word', 'foo_bar', ['foo_bar']);  // Underscore is \w
assertEqual('mixed underscore', 'foo_bar123_baz', ['foo_bar123_baz']);

// === Emoji ===
echo "\n😊 Emoji Tests\n";
assertEqual('emoji sequence', '🙂🙂🙂', ['🙂', '🙂', '🙂']);  // Split individually!
assertEqual('emoji + text', 'hi🙂there', ['hi', '🙂', 'there']);
assertEqual(
    'family emoji (ZWJ)',
    "👨‍👩‍👧‍👦",
    ["👨‍👩‍👧‍👦"]  // ZWJ keeps together
);
assertEqual(
    'zwj mixed',
    "a👨‍👩‍👧‍👦b",
    ['a', '👨‍👩‍👧‍👦', 'b']
);

// === Flag Emoji (Regional Indicators) ===
echo "\n🚩 Flag Emoji Tests\n";
assertEqual(
    'single flag',
    "🇺🇸",
    ["🇺🇸"]  // 2 RIs = 1 flag
);
assertEqual(
    'multiple flags',
    "🇺🇸🇨🇦🇯🇵",
    ["🇺🇸", "🇨🇦", "🇯🇵"]  // Even number of RIs = multiple flags
);
assertEqual(
    'odd RI sequence',
    "🇺🇸🇨",
    ["🇺🇸"]  // 🇺🇸 = flag, 🇨 alone is filtered (single RI)
);

// === Complex Mixed Cases ===
echo "\n🎪 Complex Mixed Cases\n";
assertEqual(
    'full sentence',
    "Hello, world! It's 3.14 e.g. 🇺🇸🙂 foo_bar",
    [
        "3.14",        // Decimal kept
        "e.g",         // Dot abbreviation kept
        "foo_bar",     // Underscore kept
        "Hello",       // Basic word
        "It's",        // Apostrophe kept
        "world",       // Basic word
        "🇺🇸",         // Flag (2 RIs)
        "🙂"           // Emoji
    ]
);

assertEqual(
    'unicode chaos',
    "a\u{0301}b🇺🇸👨‍👩‍👧‍👦3.14",
    [
        "3.14",                   // Decimal
        "áb",                  // Base mark + letter
        "🇺🇸",                    // Flag
        "👨‍👩‍👧‍👦"               // Family emoji (ZWJ)
    ]
);

// === Whitespace & Edge Cases ===
echo "\n🌐 Whitespace & Edge Cases\n";
assertEqual('double space', 'a  b', ['a', 'b']);  // Spaces removed
assertEqual('tabs', "a\tb", ['a', 'b']);  // Tabs removed
assertEqual('newline', "a\nb", ['a', 'b']);  // Newlines removed
assertEqual('mixed whitespace', "a \t b", ['a', 'b']);  // All whitespace removed

// === Emoji with Variation Selectors ===
echo "\n✨ Emoji Modifiers & Variation Selectors\n";
assertEqual(
    'emoji with variation selector',
    '☺️',  // Smiley + variation selector
    ['☺️']
);

// === Performance Test ===
echo "\n⏱️ Performance Test\n";
$input = str_repeat("Hello🙂🇺🇸👨‍👩‍👧‍👦 3.14 foo_bar ", 100);

$times = [];

for ($i = 0; $i < 5; $i++) {
    $start = hrtime(true);

    $words = Utf8String::create($input)->extractWords(0);

    $end = hrtime(true);

    $times[] = ($end - $start) / 1e6; // ms
}

$mean = array_sum($times) / count($times);

echo "Mean: " . number_format($mean, 3) . " ms\n";
echo "Words extracted: " . count($words) . "\n";

if ($mean < 100) {
    echo "✅ Performance OK\n";
} else {
    echo "❌ Performance slow (threshold: 100ms)\n";
}

// === Invariant Tests ===
echo "\n🔄 Invariant Tests (No Crashes)\n";
$invariantPassed = 0;
$invariantFailed = 0;

function invariantTest($input)
{
    global $invariantPassed, $invariantFailed;
    try {
        $obj = Utf8String::create($input);
        $words = $obj->extractWords(0);
        // Should not crash, that's the main test
        $invariantPassed++;
    } catch (\Throwable $e) {
        echo "💥 Crash: " . json_encode(substr($input, 0, 50)) . "\n";
        echo "   Error: " . $e->getMessage() . "\n";
        $invariantFailed++;
    }
}

function randomString($len)
{
    $chars = [
        'a','b','c','1','2','3',
        '.',',','\'','_',
        'א','ב','カ','ナ',
        '🙂','🇺🇸',
        "\u{0301}",
        "\u{200D}",
        ' '
    ];

    $s = '';
    for ($i = 0; $i < $len; $i++) {
        $s .= $chars[array_rand($chars)];
    }

    return $s;
}

for ($i = 0; $i < 100; $i++) {
    $input = randomString(50);
    invariantTest($input);
}

// === Summary ===
echo "\n╔═══════════════════════════════════════════════════════════════╗\n";
echo "║  Test Summary                                                 ║\n";
echo "╚═══════════════════════════════════════════════════════════════╝\n";
echo "✅ Passed: $passed\n";
echo "❌ Failed: $failed\n";
echo "🔄 Invariant Passed: $invariantPassed\n";
echo "💥 Invariant Failed: $invariantFailed\n";

$total = $passed + $failed;
if ($total > 0) {
    $percentage = ($passed / $total) * 100;
    echo "Success Rate: " . number_format($percentage, 1) . "%\n";
}

if ($failed === 0 && $invariantFailed === 0) {
    echo "\n🎉 All tests passed!\n";
    exit(0);
} else {
    exit(1);
}

Co-authored-by: Jon Stovell <jonstovell@gmail.com>

Co-authored-by: John Rayes <live627@gmail.com>

live627 · 2026-05-01T06:12:52Z

I experimented with sending the offset to preg_match instead of using $substring_after to try to reduce string copying, but it turned out to be slightly slower, surprisingly enough.
.

Improve performance in Utf8String::semanticSplit()

8f35127

live627 added Performance Charset/Encoding UTF8 & mb4 encoding related issues labels Apr 30, 2026

live627 linked an issue Apr 30, 2026 that may be closed by this pull request

Undefined variables in Utf8String::semanticSplit() #9195

Open

Sesquipedalian requested changes Apr 30, 2026

View reviewed changes

Comment thread Sources/Unicode/Utf8String.php

Sesquipedalian mentioned this pull request Apr 30, 2026

Undefined variables in Utf8String::semanticSplit() #9195

Open

live627 and others added 4 commits April 30, 2026 15:42

Precompute regex patterns

c54f59e

Co-authored-by: Jon Stovell <jonstovell@gmail.com>

undefined vars

1800c09

Precompute pattern in Utf8String::extractWords()

16f12ae

don't need two load this array

f8a67cf

live627 force-pushed the semanticSplit branch from a6bb47f to f8a67cf Compare May 1, 2026 05:37

live627 commented May 1, 2026

View reviewed changes

Comment thread Sources/Unicode/Utf8String.php Outdated

live627 commented May 1, 2026

View reviewed changes

Comment thread Sources/Unicode/Utf8String.php Outdated

Apply suggestions from code review

8a31f0d

Co-authored-by: John Rayes <live627@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance in Utf8String::semanticSplit()#9196

Improve performance in Utf8String::semanticSplit()#9196
live627 wants to merge 6 commits intoSimpleMachines:release-3.0from
live627:semanticSplit

live627 commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

live627 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

live627 commented Apr 30, 2026

Before

After

Tesst script

Uh oh!

Uh oh!

Uh oh!

Uh oh!

live627 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants