SIGN IN SIGN UP

feat: add supertokenizers (#236)

* remove multiword warning

* add superbpe tokenizers

* fix issue with mwe

* form

* working version

* first pass

* small fixes, many comments

* fix e5 bug

* Adjust arcane formulae

* fix: logging

* wip

* wip

* wip

* lower complexity

* add lock file

* fix: metaspace pretokenizer

* fix: bug in vocab

* feat: spaces/commas etc.

* turn tokenizer into package

* add annotations

* feat: turn tokenizer into package

* fix: future

* add tokenizer function

* update lockfile

* feat: improve segmentation of unigram

* fix: broken merge

* fix interpunct tokens

* fix tests, make tokenizer changes better

* update lock file

* fix comment, add additional check for pad token

* tests: add a lot of tests

* fix: 3.9 error
S
Stephan Tulkens committed
80338f21cbc41589079c65f1d8a0a60e936c4915
Parent: 86d5378
Committed by GitHub <noreply@github.com> on 5/26/2025, 12:14:12 PM