Skip to content

Improve performance in Utf8String::semanticSplit()#9196

Open
live627 wants to merge 6 commits intoSimpleMachines:release-3.0from
live627:semanticSplit
Open

Improve performance in Utf8String::semanticSplit()#9196
live627 wants to merge 6 commits intoSimpleMachines:release-3.0from
live627:semanticSplit

Conversation

@live627
Copy link
Copy Markdown
Contributor

@live627 live627 commented Apr 30, 2026

Before

╔═══════════════════════════════════════════════════════════════╗
║  Utf8String::extractWords()                                   ║
║  (Based on actual behavior from SMF implementation)           ║
╚═══════════════════════════════════════════════════════════════╝

⏱️ Performance Test
Mean: 752.776 ms
Words extracted: 600
❌ Performance slow (threshold: 100ms)

After

╔═══════════════════════════════════════════════════════════════╗
║  Utf8String::extractWords()                                   ║
║  (Based on actual behavior from SMF implementation)           ║
╚═══════════════════════════════════════════════════════════════╝

📝 ASCII & Alphanumeric Tests
✅ PASS: ascii words
✅ PASS: camelCase
✅ PASS: mixed case
✅ PASS: single word
✅ PASS: empty string

🔤 Punctuation Tests (Preserved in words)
✅ PASS: dot abbreviation
✅ PASS: multi dots
✅ PASS: apostrophe word
✅ PASS: multiple apostrophes

🔢 Number Tests
✅ PASS: simple number
✅ PASS: decimal
✅ PASS: comma number
✅ PASS: mixed num alpha
✅ PASS: alpha num boundary

🌍 Unicode Script Tests
✅ PASS: hebrew letters
✅ PASS: hebrew apostrophe
✅ PASS: hebrew quotes

🇯🇵 CJK & Japanese Tests
✅ PASS: katakana sequence
✅ PASS: katakana mix

🎯 Combining Marks Tests
✅ PASS: combining acute (alone)
✅ PASS: multiple combining
✅ PASS: combining in word

⚙️ Special Characters
✅ PASS: underscore word
✅ PASS: mixed underscore

😊 Emoji Tests
✅ PASS: emoji sequence
✅ PASS: emoji + text
✅ PASS: family emoji (ZWJ)
✅ PASS: zwj mixed

🚩 Flag Emoji Tests
✅ PASS: single flag
✅ PASS: multiple flags
✅ PASS: odd RI sequence

🎪 Complex Mixed Cases
✅ PASS: full sentence
✅ PASS: unicode chaos

🌐 Whitespace & Edge Cases
✅ PASS: double space
✅ PASS: tabs
✅ PASS: newline
✅ PASS: mixed whitespace

✨ Emoji Modifiers & Variation Selectors
✅ PASS: emoji with variation selector

⏱️ Performance Test
Mean: 76.442 ms
Words extracted: 600
✅ Performance OK

🔄 Invariant Tests (No Crashes)

╔═══════════════════════════════════════════════════════════════╗
║  Test Summary                                                 ║
╚═══════════════════════════════════════════════════════════════╝
✅ Passed: 38
❌ Failed: 0
🔄 Invariant Passed: 100
💥 Invariant Failed: 0
Success Rate: 100.0%

🎉 All tests passed!

Tesst script

<?php

/**
 * Standalone CLI test script for Utf8String optimization validation
 * 
 * Usage: php test-utf8-optimization.php
 */

declare(strict_types=1);

define('SMF', 1);
require_once __DIR__ . '/Sources/Unicode/Utf8String.php';
require_once __DIR__ . '/Sources/Unicode/CombiningClasses.php';
require_once __DIR__ . '/Sources/Unicode/Metadata.php';
require_once __DIR__ . '/vendor/autoload.php';

use SMF\Unicode\Utf8String;

$passed = 0;
$failed = 0;

function assertEqual($label, $input, $expected)
{
    global $passed, $failed;
    $obj = Utf8String::create($input);
    $result = $obj->extractWords(0);

    // Sort for comparison
    sort($result);
    sort($expected);

    if ($result !== $expected) {
        echo "❌ FAIL: $label\n";
        echo "Input:    " . json_encode($input, JSON_UNESCAPED_UNICODE) . "\n";
        echo "Expected: " . json_encode($expected, JSON_UNESCAPED_UNICODE) . "\n";
        echo "Got:      " . json_encode($result, JSON_UNESCAPED_UNICODE) . "\n\n";
        $failed++;
    } else {
        echo "✅ PASS: $label\n";
        $passed++;
    }
}

echo "╔═══════════════════════════════════════════════════════════════╗\n";
echo "║  Utf8String::extractWords()                                   ║\n";
echo "║  (Based on actual behavior from SMF implementation)           ║\n";
echo "╚═══════════════════════════════════════════════════════════════╝\n\n";

// === ASCII & Alphanumeric ===
echo "📝 ASCII & Alphanumeric Tests\n";
assertEqual('ascii words', 'hello world', ['hello', 'world']);
assertEqual('camelCase', 'helloWorldTest', ['helloWorldTest']);
assertEqual('mixed case', 'ABCdefGHI', ['ABCdefGHI']);
assertEqual('single word', 'hello', ['hello']);
assertEqual('empty string', '', []);

// === Punctuation (KEPT in words per Unicode word break rules) ===
echo "\n🔤 Punctuation Tests (Preserved in words)\n";
assertEqual('dot abbreviation', 'e.g.', ['e.g']);  // Kept together
assertEqual('multi dots', 'U.S.A.', ['U.S.A']);  // Kept together per word break rules
assertEqual('apostrophe word', "don't", ["don't"]);  // Apostrophe kept per MidLetter rule
assertEqual('multiple apostrophes', "rock'n'roll", ["rock'n'roll"]);  // All kept together

// === Numbers ===
echo "\n🔢 Number Tests\n";
assertEqual('simple number', '123456', ['123456']);
assertEqual('decimal', '3.14159', ['3.14159']);  // Decimal point kept (MidNum rule)
assertEqual('comma number', '1,234,567', ['1,234,567']);  // Commas kept (MidNum rule)
assertEqual('mixed num alpha', 'A1B2C3', ['A1B2C3']);
assertEqual('alpha num boundary', 'foo123bar', ['foo123bar']);

// === Unicode Scripts ===
echo "\n🌍 Unicode Script Tests\n";
assertEqual('hebrew letters', 'אבגדה', ['אבגדה']);
assertEqual('hebrew apostrophe', "א'ב", ["א'ב"]);  // Kept together (MidLetter rule)
assertEqual('hebrew quotes', "א\"ב", ["א\"ב"]);  // Kept together per word break

// === CJK & Japanese ===
echo "\n🇯🇵 CJK & Japanese Tests\n";
assertEqual('katakana sequence', 'カタカナ', ['カタカナ']);
assertEqual('katakana mix', 'カタabcナ', ['カタ', 'abc', '']);  // Different scripts break

// === Combining Marks ===
echo "\n🎯 Combining Marks Tests\n";
assertEqual('combining acute (alone)', "a\u{0301}", ["a\u{0301}"]);  // Has base, kept
assertEqual('multiple combining', "a\u{0301}\u{0302}", ["a\u{0301}\u{0302}"]);  // Has base, kept
assertEqual('combining in word', "e\u{0301}cole", ["e\u{0301}cole"]);  // Base + text

// === Special Characters (underscores are word chars) ===
echo "\n⚙️ Special Characters\n";
assertEqual('underscore word', 'foo_bar', ['foo_bar']);  // Underscore is \w
assertEqual('mixed underscore', 'foo_bar123_baz', ['foo_bar123_baz']);

// === Emoji ===
echo "\n😊 Emoji Tests\n";
assertEqual('emoji sequence', '🙂🙂🙂', ['🙂', '🙂', '🙂']);  // Split individually!
assertEqual('emoji + text', 'hi🙂there', ['hi', '🙂', 'there']);
assertEqual(
    'family emoji (ZWJ)',
    "👨‍👩‍👧‍👦",
    ["👨‍👩‍👧‍👦"]  // ZWJ keeps together
);
assertEqual(
    'zwj mixed',
    "a👨‍👩‍👧‍👦b",
    ['a', '👨‍👩‍👧‍👦', 'b']
);

// === Flag Emoji (Regional Indicators) ===
echo "\n🚩 Flag Emoji Tests\n";
assertEqual(
    'single flag',
    "🇺🇸",
    ["🇺🇸"]  // 2 RIs = 1 flag
);
assertEqual(
    'multiple flags',
    "🇺🇸🇨🇦🇯🇵",
    ["🇺🇸", "🇨🇦", "🇯🇵"]  // Even number of RIs = multiple flags
);
assertEqual(
    'odd RI sequence',
    "🇺🇸🇨",
    ["🇺🇸"]  // 🇺🇸 = flag, 🇨 alone is filtered (single RI)
);

// === Complex Mixed Cases ===
echo "\n🎪 Complex Mixed Cases\n";
assertEqual(
    'full sentence',
    "Hello, world! It's 3.14 e.g. 🇺🇸🙂 foo_bar",
    [
        "3.14",        // Decimal kept
        "e.g",         // Dot abbreviation kept
        "foo_bar",     // Underscore kept
        "Hello",       // Basic word
        "It's",        // Apostrophe kept
        "world",       // Basic word
        "🇺🇸",         // Flag (2 RIs)
        "🙂"           // Emoji
    ]
);

assertEqual(
    'unicode chaos',
    "a\u{0301}b🇺🇸👨‍👩‍👧‍👦3.14",
    [
        "3.14",                   // Decimal
        "áb",                  // Base mark + letter
        "🇺🇸",                    // Flag
        "👨‍👩‍👧‍👦"               // Family emoji (ZWJ)
    ]
);

// === Whitespace & Edge Cases ===
echo "\n🌐 Whitespace & Edge Cases\n";
assertEqual('double space', 'a  b', ['a', 'b']);  // Spaces removed
assertEqual('tabs', "a\tb", ['a', 'b']);  // Tabs removed
assertEqual('newline', "a\nb", ['a', 'b']);  // Newlines removed
assertEqual('mixed whitespace', "a \t b", ['a', 'b']);  // All whitespace removed

// === Emoji with Variation Selectors ===
echo "\n✨ Emoji Modifiers & Variation Selectors\n";
assertEqual(
    'emoji with variation selector',
    '☺️',  // Smiley + variation selector
    ['☺️']
);

// === Performance Test ===
echo "\n⏱️ Performance Test\n";
$input = str_repeat("Hello🙂🇺🇸👨‍👩‍👧‍👦 3.14 foo_bar ", 100);

$times = [];

for ($i = 0; $i < 5; $i++) {
    $start = hrtime(true);

    $words = Utf8String::create($input)->extractWords(0);

    $end = hrtime(true);

    $times[] = ($end - $start) / 1e6; // ms
}

$mean = array_sum($times) / count($times);

echo "Mean: " . number_format($mean, 3) . " ms\n";
echo "Words extracted: " . count($words) . "\n";

if ($mean < 100) {
    echo "✅ Performance OK\n";
} else {
    echo "❌ Performance slow (threshold: 100ms)\n";
}

// === Invariant Tests ===
echo "\n🔄 Invariant Tests (No Crashes)\n";
$invariantPassed = 0;
$invariantFailed = 0;

function invariantTest($input)
{
    global $invariantPassed, $invariantFailed;
    try {
        $obj = Utf8String::create($input);
        $words = $obj->extractWords(0);
        // Should not crash, that's the main test
        $invariantPassed++;
    } catch (\Throwable $e) {
        echo "💥 Crash: " . json_encode(substr($input, 0, 50)) . "\n";
        echo "   Error: " . $e->getMessage() . "\n";
        $invariantFailed++;
    }
}

function randomString($len)
{
    $chars = [
        'a','b','c','1','2','3',
        '.',',','\'','_',
        'א','ב','','',
        '🙂','🇺🇸',
        "\u{0301}",
        "\u{200D}",
        ' '
    ];

    $s = '';
    for ($i = 0; $i < $len; $i++) {
        $s .= $chars[array_rand($chars)];
    }

    return $s;
}

for ($i = 0; $i < 100; $i++) {
    $input = randomString(50);
    invariantTest($input);
}

// === Summary ===
echo "\n╔═══════════════════════════════════════════════════════════════╗\n";
echo "║  Test Summary                                                 ║\n";
echo "╚═══════════════════════════════════════════════════════════════╝\n";
echo "✅ Passed: $passed\n";
echo "❌ Failed: $failed\n";
echo "🔄 Invariant Passed: $invariantPassed\n";
echo "💥 Invariant Failed: $invariantFailed\n";

$total = $passed + $failed;
if ($total > 0) {
    $percentage = ($passed / $total) * 100;
    echo "Success Rate: " . number_format($percentage, 1) . "%\n";
}

if ($failed === 0 && $invariantFailed === 0) {
    echo "\n🎉 All tests passed!\n";
    exit(0);
} else {
    exit(1);
}

@live627 live627 added Performance Charset/Encoding UTF8 & mb4 encoding related issues labels Apr 30, 2026
@live627 live627 linked an issue Apr 30, 2026 that may be closed by this pull request
Comment thread Sources/Unicode/Utf8String.php
Comment thread Sources/Unicode/Utf8String.php Outdated
Comment thread Sources/Unicode/Utf8String.php Outdated
Co-authored-by: John Rayes <live627@gmail.com>
@live627
Copy link
Copy Markdown
Contributor Author

live627 commented May 1, 2026

I experimented with sending the offset to preg_match instead of using $substring_after to try to reduce string copying, but it turned out to be slightly slower, surprisingly enough.
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Charset/Encoding UTF8 & mb4 encoding related issues Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Undefined variables in Utf8String::semanticSplit()

2 participants