vocab : refactor normalizer flags into options struct, add strip_accents#24371
Conversation
CISC
left a comment
There was a problem hiding this comment.
Should perhaps also handle type being set to NFC/NFD (or even NFKC/NFKD, but that's beyond this PR)...
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
I'll think about how to best handle explicit |
|
Hi @CISC, I re-ran the tests after applying your suggestions. Tests (
|
| if (normalizer_opts.strip_accents && flags.is_accent_mark) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Is this necessary, shouldn't NFD have dealt with it?
Also, this would be a behavioral change, if it is necessary when does it occur?
There was a problem hiding this comment.
No-op for precomposed input, but needed when the input is already decomposed (NFD form).
unicode_cpts_normalize_nfd folds precomposed chars straight to base (é U+00E9 → e), but a standalone combining mark (cafe + U+0301) survives and reaches WordPiece as [UNK].
café |
transformers |
with skip | without skip |
|---|---|---|---|
| precomposed | [7668] |
[7668] |
[7668] |
| decomposed | [7668] |
[7668] |
[100] |
WPM previously applied NFD unconditionally, so accented words on case-sensitive models (e.g.
German_Semantic_V3, which setsstrip_accents: false) didn't matchtransformers.NFD is now applied only when
strip_accentsis set.Tests (
German_Semantic_V3):schön[102, 2602, 103][102, 778, 103][102, 2602, 103]Müller[102, 5730, 103][102, 7107, 522, 103][102, 5730, 103]Größe[102, 5281, 103][102, 8812, 103][102, 5281, 103]café[102, 3416, 30897, 30969, 103][102, 3416, 604, 103][102, 3416, 30897, 30969, 103]