Regex is (almost) all you need
Nice new read on tokenization!
You've heard about the SolidGoldMagikarp token, which breaks GPT-2 because it was present in the training set of the Tokenizer, but not the LLM later.
This paper digs in in a lot more depth and detail, on a lot more models, discovering a less extreme version of the above -... See more
Andrej Karpathyx.com
