Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A multimodal AGI will be more useful than one that isn't. People want AI to work with and have it understand audio, images, videos, etc.


You didn't read the article. The thesis is that current "merely" multimodal approaches which project distinct kinds of inputs into the same latent space are insufficient for building a general world model that can be used for general internal reasoning. An example of this is this "Rs in strawberry" question, which requires them be trained on that information explicitly, since they don't have an experience of the characters in a word. It's an artifact of how LLMs don't learn how humans learn, which is by interacting with the world, instead of predicting text.

More elaborately, they don't have an natural understanding of pragmatics. Transformers are best at modelling syntax, and their semantic understanding seems to be through rote memorization and "manipulating symbols" rather than building general world models.


I did read it and even with their idea of focusing on a world model an AGI that can alsp operate on audio, images, and videos, being multimodal, will be more useful than one that operates purely on text.


I'm skeptical you read it because he doesn't make that argument. In fact i've literally never heard someone argue text-only is more useful than multimodal




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: