huggingface / transformers UNCLAIMED
🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.
158577 0 0 Python
rm slow tokenizers (#40936)

* fixes missed

* gemma test fix

* refactor

* rm legacy from llama

* added renaming

* add _model

* update legacy

* update legacy

* fix docstring

* always load blank, then set _tokenizer if we have it

* new toks

* update all berttokenizer based models

* apply feedback - delete bert duplicates

* more models --> fast only

* more convert_slow models

* fix common test refs

* updating fast only tokenizers

* openai and pegasus

* enable sentencepiecebackend

* more models

* code gen

* t5

* code gen tests

* speecht5

* mbart

* mbart50

* more models

* more models

* layouglmv2

* update tests

* update tests

* update tests

* pretrainedtokenizer

* whisper

* whisper

* layoutxlm and storing backends

* refactor sentencepiecebackend and additional_special_tokens

* renaming tokenization_utils --> tokenization_python

* udpate tests

* bert test

* blenderbot

* clip

* codegen

* code_llama

* cohere

* deberata, deberat v2, funnel

* gpt2

* batch update tests

* pegasus qwen2 roberta

* more models

* layout tests

* some renaming

* fix references to utils_fast

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix refs

* fix some tests

* regression

* fix refs

* fix refs

* missed the most crucial file in my last commit

* fix refs

* fix refs

* fix refs

* batch encode fix

* fix some tests

* BC for batch_decode bc too many refs

* more tests

* fix more tests

* fix for processors

* fixing more models

* deleted mbart50 by accident

* seamless m4t

* albert fix

* whisper

* layout3

* attempt to fix cached tokenizers on CI

* trying another fix on CI

* again try to work around CI

* bertweet

* tapas

* mbart50

* luke

* mluke

* markuplm

* markuplm

* fix some more auto tests

* some random model failures

* mistralcommontestser

* more fixes

* ref fix

* siglip

* marian

* plbart

* update utils toks

* seamless m4t

* roc bert

* udpate byt5 test

* xlm

* esm

* roformer

* code llama

* biogpt

* m2m100

* dpr and flaubert

* xlm and speech to text

* tok backend pass object

* tokenizer object pass

* wav2vec2

* wav2vec2

* cpmant

* update utils tokenizers

* cpmant

* bartpho

* test apply chat template assistant mask

* apply chat template video

* apply chat template assistant mask

* test torch

* update from slow in base and fix donut processor errors

* auto to point to tokenizers backend, fix kosmos2

* some non model fixes for old slow models that no longer have their own tokenizer file as they are the same as bert

* missed file from last commit

* idefics2

* fixup

* fixup

* pretrained tokenizer fast test update

* stash

* bad merged

* cherry pick more stuff that did not merge well

* fix gptsw3

* nit warn for now

* update error raising

* just ran fixup

* bring back bert legacy

* fix

* nit

* fix 56 errors on blenderbotsmall?

* 18 for blenderbotsmall

* tok auto

* missed clip

* fix tests

* something missed

* token healing

* tok common tests update - nonmodel

* try to fix non-model test in test_tokenization_utils

* fix hub tests

* try to fix hub tests

* custom vocab related fixed

* bert jap

* BERT JAP

* rename bert legacy to bert legacy

* Wav2vec2

* fix in tok python to update total vocab size - fixes speech t5

* blender bot small

* forgot test file

* test failures

* marian

* gpt2 tiktoken

* big bird / marian

* udop

* forgot couple changes

* test_serve fix

* missing import

* a couple processors fixes

* style partly

* fix to fetch tests ci

* Revert branch back to commit f5bc69ef state

* revert branch to styling

* update mistral after merge

* fixes for non model tests

* some processor test fixes

* more processor test fixes

* more processor fixes

* hub tests

* python tok utils

* fix hub test

* make style for now

* remove problemattic fic copies

* python utils/check_copies.py --fix_and_overwrite

* more styling

* fixup

* silence docstirng

* fix import?

* fix imports

* add the local test as well

* throw spm error

* llamas

* fix a couple tests

* broke ci

* broke ci

* broke ci

* broke ci

* add logs to debug gemma on ci

* gemma and llama

* gemma

* revert las commit

* gemma debug

* gemma debug

* gemma

* safely import spiece backend

* tok tests

* check none

* setup and qual

* ruff

* del dev files

* tok auto

* fill docstrings

* update auto

* blenderbot small nit

* add migration guide

* move mixtral patch to `TokenizersBackend`, move `TokenizerExtractor`

* rename MistralCommonTokenizer to MistralCommonB ackend

* nit

* fix failures

* fixup

* remoove one old test

* mark the slow one as slow

* very small fixes

* update auto mapping for missing ones

* fixup lorsd

* fixup doc and stuff

* should be the final fixe

* processing update

* update

* FIX or brute AI fix the llava test

* style

* slow?

* fix is offline mode?

* fix mt5

* One tok utils (#42462)

* consolidate python and utils tokenization files, they are copies

* ruff and ref

* Format

* fix cohere

* ?

* up

* am I dumbb?

* grumble

---------

Co-authored-by: Arthur <[email protected]>
Ita Zaporozhets committed 4mo ago
05c0e1d39082ee8b69064ed4ea9c239cd17405e9
Parent: 01c5159
Committed by GitHub <[email protected]> on 11/27/2025, 6:24:50 PM