vocab : refactor normalizer flags into options struct, add strip_accents by o7si · Pull Request #24371 · ggml-org/llama.cpp

o7si · 2026-06-09T17:12:36Z

WPM previously applied NFD unconditionally, so accented words on case-sensitive models (e.g. German_Semantic_V3, which sets strip_accents: false) didn't match transformers.

NFD is now applied only when strip_accents is set.

Tests (German_Semantic_V3):

Input	transformers	master	this PR
`schön`	`[102, 2602, 103]`	`[102, 778, 103]`	`[102, 2602, 103]`
`Müller`	`[102, 5730, 103]`	`[102, 7107, 522, 103]`	`[102, 5730, 103]`
`Größe`	`[102, 5281, 103]`	`[102, 8812, 103]`	`[102, 5281, 103]`
`café`	`[102, 3416, 30897, 30969, 103]`	`[102, 3416, 604, 103]`	`[102, 3416, 30897, 30969, 103]`

CISC

Should perhaps also handle type being set to NFC/NFD (or even NFKC/NFKD, but that's beyond this PR)...

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

o7si · 2026-06-09T19:26:53Z

Should perhaps also handle type being set to NFC/NFD (or even NFKC/NFKD, but that's beyond this PR)...

I'll think about how to best handle explicit NFC/NFD normalizers and track it in a separate PR.

o7si · 2026-06-09T19:35:43Z

Hi @CISC, I re-ran the tests after applying your suggestions.

Tests (German_Semantic_V3):

Input	`transformers`	master	this PR
`schön`	`[102, 2602, 103]`	`[102, 778, 103]`	`[102, 2602, 103]`
`Müller`	`[102, 5730, 103]`	`[102, 7107, 522, 103]`	`[102, 5730, 103]`
`Größe`	`[102, 5281, 103]`	`[102, 8812, 103]`	`[102, 5281, 103]`
`café`	`[102, 3416, 30897, 30969, 103]`	`[102, 3416, 604, 103]`	`[102, 3416, 30897, 30969, 103]`

CISC · 2026-06-10T05:17:51Z

+            if (normalizer_opts.strip_accents && flags.is_accent_mark) {
+                continue;
+            }


Is this necessary, shouldn't NFD have dealt with it?

Also, this would be a behavioral change, if it is necessary when does it occur?

No-op for precomposed input, but needed when the input is already decomposed (NFD form).

unicode_cpts_normalize_nfd folds precomposed chars straight to base (é U+00E9 → e), but a standalone combining mark (cafe + U+0301) survives and reaches WordPiece as [UNK].

café transformers with skip without skip

precomposed [7668] [7668] [7668]

decomposed [7668] [7668] [100]

vocab : refactor normalizer flags into options struct, add strip_accents

d890289

github-actions Bot added the python python script changes label Jun 9, 2026

o7si marked this pull request as ready for review June 9, 2026 17:39

o7si requested a review from CISC as a code owner June 9, 2026 17:39

CISC reviewed Jun 9, 2026

View reviewed changes

Comment thread src/llama-vocab.h Outdated

Comment thread src/llama-vocab.cpp

o7si and others added 2 commits June 10, 2026 03:16

Update src/llama-vocab.h

e54ccc4

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/llama-vocab.cpp

b5b4522

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC approved these changes Jun 10, 2026

View reviewed changes

CISC added the merge ready A maintainer can use this label to indicate that they consider the changes final and ready to merge. label Jun 10, 2026

ggerganov merged commit 68f3066 into ggml-org:master Jun 11, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab : refactor normalizer flags into options struct, add strip_accents#24371

vocab : refactor normalizer flags into options struct, add strip_accents#24371
ggerganov merged 3 commits into
ggml-org:masterfrom
o7si:normalizer-options-strip-accents

o7si commented Jun 9, 2026 •

edited

Loading

Uh oh!

CISC left a comment

Uh oh!

Uh oh!

Uh oh!

o7si commented Jun 9, 2026

Uh oh!

o7si commented Jun 9, 2026

Uh oh!

CISC Jun 10, 2026

Uh oh!

o7si Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`café`	`transformers`	with skip	without skip
precomposed	`[7668]`	`[7668]`	`[7668]`
decomposed	`[7668]`	`[7668]`	`[100]`

Conversation

o7si commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

o7si commented Jun 9, 2026

Uh oh!

o7si commented Jun 9, 2026

Uh oh!

CISC Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

o7si Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

o7si commented Jun 9, 2026 •

edited

Loading