If you replace a tokenizer with 5 bytes per token on average by a byte-level representation, you now need 5 times as much memory and (depending on the specifics of the attention mechanism) 11 to 25 times as much compute.
At the scales we're talking about, that's quite a hefty price to pay, and it doesn't even take into account that you might need more layers to replace the processing that was implicitly done by the tokenizer.
At the scales we're talking about, that's quite a hefty price to pay, and it doesn't even take into account that you might need more layers to replace the processing that was implicitly done by the tokenizer.