Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones (github.com/dlyuangod)
237 points by T-A on Jan 3, 2024 | hide | past | favorite | 37 comments


From the paper's abstract [1]:

It stands out by requiring merely a 24G GPU for training and an 8G GPU or CPU for inference. Built upon Phi-2, TinyGPT-V couples an effective language backbone with pre-trained vision modules from BLIP-2 or CLIP. TinyGPT-V's 2.8B parameters can undergo a unique quantisation process, suitable for local deployment and inference tasks on 8G various devices.

[1] https://arxiv.org/abs/2312.16862


While the non-commercial Phi2 license continues to be a letdown, I'm excited to see additional development in the space of these ultracompact LLMs. On-device AI excites me far more than relying on yet another cloud-based API.

On my M1 Macbook Air, current LLMs can run surprisingly quickly locally already.


> On-device AI excites me far more than relying on yet another cloud-based API.

Agreed. The huge leap-forward in capability already boggles my mind, the fact that I can run it _on my desktop cpu_ and have an interactive, natural-language oracle of the internet in a mere ~11GB file.

Everyone seems to be chasing the network-accessible API approach because lock-in is easy and if you're at the bleeding-edge (training) you've got the compute for running it anyway.

But now with accessible, local models my bet is on bored young hackers coming up with the best use cases and killer apps, not Microsoft.


> Everyone seems to be chasing the network-accessible API approach because lock-in is easy

This feels like a lazy critique. Sure companies may like the lock-in, but people are chasing the network-accessible API approach because it is way more powerful than the stuff you can run locally. Admittedly it’d be great if local models could get up to ChatGPT’s level, but if we’re being honest with ourselves they’re not there yet and are miles away from GPT-4, so it’s totally understandable that people want what OpenAI is selling.


Which local models and CPU are you using?


> But now with accessible, local models my bet is on bored young hackers coming up with the best use cases and killer apps, not Microsoft.

That's only one part of the equation because these young hackers need companies with huge capitalization to deliver the models to them for free. What I hope actually happens is that model building can become modular and therefore could be crowdsourced. I'm not even sure this is even theoretically possible but it would help a lot.


I see an MIT license, am I Missing something?


The Phi-2 model (the LLM backbone in TinyGPT-V) weights, are licensed under a non-commercial research-only license[1] by Microsoft.

[1]: https://huggingface.co/microsoft/phi-2/blob/main/LICENSE


TBF, it's now been re-licensed under a more liberal MIT license.


The code is (edit - may be? I can't seem to find it right now), but the model itself is under this license: https://huggingface.co/microsoft/phi-2/resolve/main/LICENSE


> You need to execute the above code 17 times to complete the first stage of training.

Am I missing something here? Did the authors forget about for loops? What happens if you only do it 16 times?


Ever feed a Gremlin after midnight? Same thing.


please explain this reference for non US/western hemisphere folks.


In the 1984 movie Gremlins, the protagonist is warned not to feed the titular creatures (a type of transformer) after midnight, otherwise they promptly lose alignment and begin to hallucinate.


That particular rule always irked me because it's always after midnight as any meal is over time not at a precise point in time.

I should really have got over it after 40 years.


By that logic it’s always full moon somewhere, and yet we’re not seeing massive werwolf issues during the day.

Explain that!


with nowhere to hide, eradicating werewolves is much easier, though slightly more dangerous.

you can’t see what’s not there.

you aren’t a language model that ate after midnight.


> I should really have got over it after 40 years.

I am with you. The movie has no hints or clues when it's okay to feed again. Dawn ? Sunrise ?


I think you've just found the perfect storm for a killer LLM joke... - transformer - alignment - hallucinate

wait, unless you were joking in the first place.


Doesn't Phi-2 have testing data contamination, hence why it's performing well on these benchmarks?

Most professionals in the field that I'm near would not touch that model with a 10 foot pole. We desperately need better validation/data contamination detection methods.


I think you're thinking of phi-1.5 https://twitter.com/suchenzang/status/1701615026648605095

Unless there was something else for phi-2 as well?


Can you explain why you wouldn't want to use it, i.e. what data contamination is?


contamination: training on the questions/answers used for LLM benchmarks.

the top of the HuggingFace Open LLM leaderboard was pretty meaningless for a bit because many of the top scores were achieved by training on the evaluation data


Funny I was just looking for something to substitute GPT4V as they are bounding the API usage to few request per day. Sadly this project is built on top of phi-2 that has the non-commercial friendly Microsoft research license.


We have released an even-smaller model - UForm-Gen a couple of weeks ago. It's fully open-source, so it may help, but I wouldn't expect results anywhere close to the GPT4V, assuming there is over 1000x difference in model size and consumed resources.


Microsoft just changed the license of phi-2 to MIT!


Can anyone comment on an open source multi-modal LLM that can produce structured outputs based on an image? I have not found a good open source one yet (this included), seems to be only closed source that can do this reliably well. Any suggestions are very welcome!


Something like this?

https://imgur.com/a/hPAaZUv

https://huggingface.co/spaces/Qwen/Qwen-VL-Plus

You can also ask it to give you bounding boxes of objects.


I've only used LLaVA / BakLLaVA. It falls under the LLAMA 2 Community License. Not sure if you consider that open source or not.


MobileVLM [1] is another recent small multimodal model. They trained their own 1.4B/2.7B LLaMa from scratch using RedPajama and Vicuna instead of leveraging Phi-2.

The papers only have one common benchmark (GQA, MobileVLM scores better) so hard to say how they compare otherwise.

[1] https://arxiv.org/abs/2312.16886


Their results seem comparable to BLIP-2, shifted over in the diagram.


I really want to understand this post, but I can't, may you please direct me, and those noobs like me - to resources to be able to read this? (help anyone to climb the knowledge ladder) ELI3

-

EDIT - GPT helped me understand better:

--

>>> "This model is special because it can do similar tasks as the big models but requires much less computational power1. It’s like having a small but powerful engine that can do the work of a big one. This makes it more accessible for more people to use it"

---

>>> "TinyGPT-V is built on another model called Phi-2 and uses pre-trained vision modules from BLIP-2 or CLIP1. It has 2.8 billion parameters (these are like the model’s brain cells) and can be further compressed to fit on devices with 8GB memory1. This means you could potentially run this model on your personal computer or even some high-end smartphones1"

----

>>> "In summary, TinyGPT-V is a step towards making powerful AI models more accessible and efficient, which could lead to their use in a wide range of real-world applications1. The authors have also shared their code and training weights for others to use and learn from1"

-----

This is really interesting if you fan out implications over N time?

Here is my thinking:

Assume this paper results in a way of "compression-alyzed vision" into a model (a tiny compressed view into a model)

Then one, in a few years can imagine "laser views" - that slice through fractals of models to find the result. Resulting in tiny agents that have a heat-seeking-fractal-laser that can navigate giant data based on a method of knowing instantaneously what to exclude (meaning the path is defined by the walls that you already know you do not want to hit, so your steps are always that which helps you forward)

--

Or am I stating something obvious to all you brainiacs?

(no shame, I like thinking out loud)


I am no brainiac, and it isn't super clear from your post what you're describing, but here is some info that might help you better convey your question:

This is a neural net built by conjoining Phi-2 [the best available small LLM, if you'll pardon the contradiction in terms] with pre-trained vision models like BLIP or CLIP. Models are piles of weights[/parameters] that are generated by training on datasets.

Already work has shown that training a multi-model model from the start results in smaller, more effective model. If you want to know more, check out recent work from CVPR [a vision machine learning conference] from '23[0] and upcoming work for this year[1]

edit to add:

The work of MS researcher Chunyuan Li[2] is worth keeping an eye on, particularly recent work like LLaVA-Interactive[3], a multi-model multi-task AI system, might be what you're trying to describe with your laser/fractal view phrasing.

[0]https://www.youtube.com/@VLPTutorial

[1]https://arxiv.org/search/?query=cvpr+2024&searchtype=all&sou...

[2]https://chunyuan.li/

[3]https://llava-vl.github.io/llava-interactive/


Lovely for being downvoted on HN for requests to learn.


thanks for sharing this!


Is it related to tinygrad[1]?

[1] https://github.com/geohot/tinygrad


Not at all. tinygrad is a deep learning framework.

If you check the dependencies of TinyGPT-V you'll see that it does not depend on tinygrad but rather torch... https://github.com/DLYuanGod/TinyGPT-V/blob/main/environment...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: