Could anyone who's an expert comment why there seems to be such a focus on discussing tokenizers? It seems every other day there's a new article or implementation of a tokenizer on HN. But downstream from that, rarely anything. As a non-expert I would have thought to tokenizing is just one step.
The reason it's trending today is because of the phenomenon of Glitch Tokens. They thought all Glitch Tokens had been removed by GPT-4 but apparently one is still left. If you go down the rabbit hole on Glitch Tokens it gets ... really really weird.
But does the tokenizer have anything to do with Glitch Tokens? Glitch Tokens seem more like a function of the neural network. I'm saying this with only a surface level understanding of glitch tokens.
It does a bit, because the fact that they're able to persist is sort of an artifact of how naive the tokenizer is (it's a counting operation based on n-grams), and that it runs as a separate step. There's no feedback from the transformer to the tokenizer to say "hey, this token is actually pretty meaningless, maybe try again on that one". That means that strings of characters that are common but very low semantic value, like the example of Reddit usernames that mostly post on /r/counting, will be included in the model's vocabulary even though they're not interesting.
When humans see extremely low-information-density data, we can forget it. And the model can too, but only kind of - it can forget (or rather, never learn) what the "word" means, but it can't forget that it's a word.
Tokens are the primitives that most LLMs (and broadly a lot of NLP) works with. While, you and I would expect whole-words to be tokens, many tokens are shorter - 3 to 4 characters - and don't always match the sentence structure you and I expect.
This can create some interesting challenges and unexpected behavior. It also makes certain things, like vectorization, a challenge since tokens may not map 1:1 with the words you intend to weight them against.
> While, you and I would expect whole-words to be tokens, many tokens are shorter - 3 to 4 characters - and don't always match the sentence structure you and I expect.
There is a phenomenon called Broca's Aphasia which is, essentially, the inability to connect words into sentences. This mostly prevents the patient from communicating via language. But patients with this condition can reveal quite a bit about the structure of the language they can no longer speak.
One example discussed in The Language Instinct is someone who works at (and was injured at) a mill. He is unable to produce utterances that are more than one word long, though he seems to do well at understanding what people say to him. One of his single-word utterances, describing the mill where he works, is "Four hundred tons a day!".
This is the opposite of what you describe, a single token that is longer than one word in the base language instead of being shorter. But it appears to be the same kind of thing.
By the way, if you study a highly inflectional language such as Latin or Russian, you will lose the assumption that interpretive tokens should be whole words. You'd still expect them to align closely with sentence structure, though.
You can observe (what I assume is) the same tokenization phenomenon in people who are struggling to speak (for example because they’re distracted by something or not native speakers): stock fragments will come out all at once, and less common words will get split, usually on affixes or at the join point of compound words.
Your answer explains what tokenizers are, which isn't what I asked. You also told me something interesting about tokenizers, which is also not what I asked. Can you tell me anything NOT about tokenized? This is my point.
The reason it's not discussed much is that what goes on downstream of tokenization is extremely opaque. It's lots of layers of the transformer network so the overall structure is documented but what exactly those numbers mean is hard to figure out.
There's an article here where the structure of an image generation network is explored a bit:
With all due respect, this feels like asking me to talk about math without talking about numbers.
Tokens are so closely tied to modern LLMs that’s it’s basically impossible to not talk about them. They’re getting a lot of attention because they are the primitive. They’re the thing of most interest for improving performance.
> ...It seems every other day there's a new article or implementation of a tokenizer on HN. But downstream from that, rarely anything. As a non-expert I would have thought to tokenizing is just one step.
If someone points out a preponderance of information on one step relative to all other steps, they probably are not asking for even more information about that step.
People like to chip in with what they've recently learned, so one answer is that most people on HN don't understand much beyond the input layer. A better answer is that the relative complexity of the processes in subsequent layers increases substantially, along with the requisite background to understand them. They also don't share the relative commonality of the input layer, so fewer people are qualified to discuss them with any authority.
That's where I am, so I get it. I'm working on building learning resources for a symposium, and it feels very much like "Step 1: Tokenize, Step 2: ???, Step 3: Output!".
Tokenizing is just one very trivial step, and it is probably the simplest and least interesting part of the process. Embedding vectors are dramatically more interesting and actually useful.
There is a mad rush to write articles in the LLM / ML / AI space to show that you haven't been left behind (like a FOMO, but more a FO-looking-like-you-MO). Tokenizers are by far the easiest part of that stack to grok, so the end result are a seemingly infinite selection of tokenization submissions.
Most of the shitty behavior of LLMs on syntactic and lexical tasks are due to the tokenizer and not due to the LLM itself. Having even tiny changes in tokenization has massive downstream effects on LLM behavior.