Skip to content

vocab : refactor normalizer flags into options struct, add strip_accents#24371

Merged
ggerganov merged 3 commits into
ggml-org:masterfrom
o7si:normalizer-options-strip-accents
Jun 11, 2026
Merged

vocab : refactor normalizer flags into options struct, add strip_accents#24371
ggerganov merged 3 commits into
ggml-org:masterfrom
o7si:normalizer-options-strip-accents

Conversation

@o7si

@o7si o7si commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

WPM previously applied NFD unconditionally, so accented words on case-sensitive models (e.g. German_Semantic_V3, which sets strip_accents: false) didn't match transformers.

NFD is now applied only when strip_accents is set.


Tests (German_Semantic_V3):

Input transformers master this PR
schön [102, 2602, 103] [102, 778, 103] [102, 2602, 103]
Müller [102, 5730, 103] [102, 7107, 522, 103] [102, 5730, 103]
Größe [102, 5281, 103] [102, 8812, 103] [102, 5281, 103]
café [102, 3416, 30897, 30969, 103] [102, 3416, 604, 103] [102, 3416, 30897, 30969, 103]

@github-actions github-actions Bot added the python python script changes label Jun 9, 2026
@o7si o7si marked this pull request as ready for review June 9, 2026 17:39
@o7si o7si requested a review from CISC as a code owner June 9, 2026 17:39

@CISC CISC left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should perhaps also handle type being set to NFC/NFD (or even NFKC/NFKD, but that's beyond this PR)...

Comment thread src/llama-vocab.h Outdated
Comment thread src/llama-vocab.cpp
o7si and others added 2 commits June 10, 2026 03:16
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@o7si

o7si commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Should perhaps also handle type being set to NFC/NFD (or even NFKC/NFKD, but that's beyond this PR)...

I'll think about how to best handle explicit NFC/NFD normalizers and track it in a separate PR.

@o7si

o7si commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Hi @CISC, I re-ran the tests after applying your suggestions.


Tests (German_Semantic_V3):

Input transformers master this PR
schön [102, 2602, 103] [102, 778, 103] [102, 2602, 103]
Müller [102, 5730, 103] [102, 7107, 522, 103] [102, 5730, 103]
Größe [102, 5281, 103] [102, 8812, 103] [102, 5281, 103]
café [102, 3416, 30897, 30969, 103] [102, 3416, 604, 103] [102, 3416, 30897, 30969, 103]

Comment thread src/llama-vocab.cpp
Comment on lines +834 to +836
if (normalizer_opts.strip_accents && flags.is_accent_mark) {
continue;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary, shouldn't NFD have dealt with it?

Also, this would be a behavioral change, if it is necessary when does it occur?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No-op for precomposed input, but needed when the input is already decomposed (NFD form).

unicode_cpts_normalize_nfd folds precomposed chars straight to base (é U+00E9 → e), but a standalone combining mark (cafe + U+0301) survives and reaches WordPiece as [UNK].

café transformers with skip without skip
precomposed [7668] [7668] [7668]
decomposed [7668] [7668] [100]

@CISC CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 10, 2026
@ggerganov ggerganov merged commit 68f3066 into ggml-org:master Jun 11, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants