Skip to content

LlamaTokenizer in v5 overrides tokenizer.json's ByteLevel pre-tokenizer with Metaspace, silently breaks DeepSeek V3/R1 family #45488

@dc3671

Description

@dc3671

System info

  • transformers: 5.3.0
  • tokenizers: 0.22.2
  • Python: 3.12 / Linux

Who can help?

@ArthurZucker

Reproduction

Tokenizer-only, ~7 MB download:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3")
print(repr(tok.decode(tok.encode("hello world", add_special_tokens=False))))
# 'helloworld'  — expected 'hello world'

Expected vs actual

v4.x / raw tokenizers.Tokenizer.from_file v5.x AutoTokenizer
round-trip of "hello world" "hello world" "helloworld"
backend_tokenizer.pre_tokenizer Sequence([..., ByteLevel]) (from tokenizer.json) Metaspace(replacement="▁", split=False)

Root cause

DeepSeek's tokenizer_config.json sets "tokenizer_class": "LlamaTokenizerFast", but tokenizer.json is ByteLevel BPE (GPT-2 style, not SentencePiece). In v5, LlamaTokenizer.__init__ unconditionally installs a Metaspace pre-tokenizer, overwriting whatever tokenizer.json declared. Since the vocab has no -prefixed tokens, the metaspace marker drops and spaces vanish.

Loading the same file via tokenizers.Tokenizer.from_file(...) or PreTrainedTokenizerFast.from_pretrained(...) works — the bug is specifically in LlamaTokenizer's __init__.

Affected models

All models shipping tokenizer_class: Llama*Tokenizer + a ByteLevel tokenizer.json. Confirmed broken on all 5 DeepSeek V3/R1 upstream repos:
DeepSeek-V3, DeepSeek-V3.1, DeepSeek-V3.2-Exp, DeepSeek-R1, DeepSeek-R1-0528.

Note: LlamaTokenizer is LlamaTokenizerFast in v5, so either class name in tokenizer_config.json triggers the bug.

Downstream impact

Catastrophic silent accuracy loss on any HF-based inference stack. For example, GSM8K on DeepSeek-V3-Lite drops ~63.7% → ~26.1% purely from this tokenizer regression.

Suggested fix

In LlamaTokenizer.__init__, only install the default Metaspace pre-tokenizer when tokenizer.json is absent (the "train from scratch" path). When loading an existing tokenizer.json, trust its pre_tokenizer.

Workaround

from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast.from_pretrained("deepseek-ai/DeepSeek-V3")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions