System info
transformers: 5.3.0
tokenizers: 0.22.2
- Python: 3.12 / Linux
Who can help?
@ArthurZucker
Reproduction
Tokenizer-only, ~7 MB download:
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3")
print(repr(tok.decode(tok.encode("hello world", add_special_tokens=False))))
# 'helloworld' — expected 'hello world'
Expected vs actual
|
v4.x / raw tokenizers.Tokenizer.from_file |
v5.x AutoTokenizer |
round-trip of "hello world" |
"hello world" |
"helloworld" |
backend_tokenizer.pre_tokenizer |
Sequence([..., ByteLevel]) (from tokenizer.json) |
Metaspace(replacement="▁", split=False) |
Root cause
DeepSeek's tokenizer_config.json sets "tokenizer_class": "LlamaTokenizerFast", but tokenizer.json is ByteLevel BPE (GPT-2 style, not SentencePiece). In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer, overwriting whatever tokenizer.json declared. Since the vocab has no ▁-prefixed tokens, the metaspace marker drops and spaces vanish.
Loading the same file via tokenizers.Tokenizer.from_file(...) or PreTrainedTokenizerFast.from_pretrained(...) works — the bug is specifically in LlamaTokenizer's __init__.
Affected models
All models shipping tokenizer_class: Llama*Tokenizer + a ByteLevel tokenizer.json. Confirmed broken on all 5 DeepSeek V3/R1 upstream repos:
DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2-Exp, DeepSeek-R1, DeepSeek-R1-0528.
Note: LlamaTokenizer is LlamaTokenizerFast in v5, so either class name in tokenizer_config.json triggers the bug.
Downstream impact
Catastrophic silent accuracy loss on any HF-based inference stack. For example, GSM8K on DeepSeek-V3-Lite drops ~63.7% → ~26.1% purely from this tokenizer regression.
Suggested fix
In LlamaTokenizer.__init__, only install the default Metaspace pre-tokenizer when tokenizer.json is absent (the "train from scratch" path). When loading an existing tokenizer.json, trust its pre_tokenizer.
Workaround
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")
System info
transformers: 5.3.0tokenizers: 0.22.2Who can help?
@ArthurZucker
Reproduction
Tokenizer-only, ~7 MB download:
Expected vs actual
tokenizers.Tokenizer.from_fileAutoTokenizer"hello world""hello world""helloworld"backend_tokenizer.pre_tokenizerSequence([..., ByteLevel])(fromtokenizer.json)Metaspace(replacement="▁", split=False)Root cause
DeepSeek's
tokenizer_config.jsonsets"tokenizer_class": "LlamaTokenizerFast", buttokenizer.jsonis ByteLevel BPE (GPT-2 style, not SentencePiece). In v5,LlamaTokenizer.__init__unconditionally installs aMetaspacepre-tokenizer, overwriting whatevertokenizer.jsondeclared. Since the vocab has no▁-prefixed tokens, the metaspace marker drops and spaces vanish.Loading the same file via
tokenizers.Tokenizer.from_file(...)orPreTrainedTokenizerFast.from_pretrained(...)works — the bug is specifically inLlamaTokenizer's__init__.Affected models
All models shipping
tokenizer_class: Llama*Tokenizer+ a ByteLeveltokenizer.json. Confirmed broken on all 5 DeepSeek V3/R1 upstream repos:DeepSeek-V3,DeepSeek-V3.1,DeepSeek-V3.2-Exp,DeepSeek-R1,DeepSeek-R1-0528.Note:
LlamaTokenizer is LlamaTokenizerFastin v5, so either class name intokenizer_config.jsontriggers the bug.Downstream impact
Catastrophic silent accuracy loss on any HF-based inference stack. For example, GSM8K on DeepSeek-V3-Lite drops ~63.7% → ~26.1% purely from this tokenizer regression.
Suggested fix
In
LlamaTokenizer.__init__, only install the defaultMetaspacepre-tokenizer whentokenizer.jsonis absent (the "train from scratch" path). When loading an existingtokenizer.json, trust itspre_tokenizer.Workaround