feat: add supertokenizers (#236)
* remove multiword warning * add superbpe tokenizers * fix issue with mwe * form * working version * first pass * small fixes, many comments * fix e5 bug * Adjust arcane formulae * fix: logging * wip * wip * wip * lower complexity * add lock file * fix: metaspace pretokenizer * fix: bug in vocab * feat: spaces/commas etc. * turn tokenizer into package * add annotations * feat: turn tokenizer into package * fix: future * add tokenizer function * update lockfile * feat: improve segmentation of unigram * fix: broken merge * fix interpunct tokens * fix tests, make tokenizer changes better * update lock file * fix comment, add additional check for pad token * tests: add a lot of tests * fix: 3.9 error
S
Stephan Tulkens committed
80338f21cbc41589079c65f1d8a0a60e936c4915
Parent: 86d5378
Committed by GitHub <noreply@github.com>
on 5/26/2025, 12:14:12 PM