Interesting idea. Aren't there a few models a little more like what he suggests ...

Interesting idea. Aren't there a few models a little more like what he suggests than a typical LLM? Like one or two experiments that operate on raw bytes, or some robotics diffusion transformers or whatever like Nvidia's thing? I guess that has action/motion tokens that are separate though. Are there a few vision language models that treat text and images more or less the same somehow?

For it to be science, "AGI" should be defined. It's used in an imprecise way even in papers like this.

Also for this to be constructive, he should make a machine learning model.