I really want to understand this post, but I can't, may you please direct me, an...

ricopags · on Jan 3, 2024

I am no brainiac, and it isn't super clear from your post what you're describing, but here is some info that might help you better convey your question:

This is a neural net built by conjoining Phi-2 [the best available small LLM, if you'll pardon the contradiction in terms] with pre-trained vision models like BLIP or CLIP. Models are piles of weights[/parameters] that are generated by training on datasets.

Already work has shown that training a multi-model model from the start results in smaller, more effective model. If you want to know more, check out recent work from CVPR [a vision machine learning conference] from '23[0] and upcoming work for this year[1]

edit to add:

The work of MS researcher Chunyuan Li[2] is worth keeping an eye on, particularly recent work like LLaVA-Interactive[3], a multi-model multi-task AI system, might be what you're trying to describe with your laser/fractal view phrasing.

[0]https://www.youtube.com/@VLPTutorial

[1]https://arxiv.org/search/?query=cvpr+2024&searchtype=all&sou...

[2]https://chunyuan.li/

[3]https://llava-vl.github.io/llava-interactive/

samstave · on Jan 4, 2024

Lovely for being downvoted on HN for requests to learn.

conacts · on Jan 4, 2024

thanks for sharing this!