If you replace a tokenizer with 5 bytes per token on average by a byte-level rep...

If you replace a tokenizer with 5 bytes per token on average by a byte-level representation, you now need 5 times as much memory and (depending on the specifics of the attention mechanism) 11 to 25 times as much compute.

At the scales we're talking about, that's quite a hefty price to pay, and it doesn't even take into account that you might need more layers to replace the processing that was implicitly done by the tokenizer.