what are examples of local LLMs that accept images, that are mentioned in the RE...

daemonologist · on Aug 9, 2024

This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.

That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.

[1] - https://github.com/Dicklesworthstone/llm_aided_ocr/blob/main...

[2] - https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#de...

[3] - https://huggingface.co/microsoft/Phi-3-vision-128k-instruct

[4] - https://github.com/haotian-liu/LLaVA

[5] - https://github.com/OpenBMB/MiniCPM-V

michaelt · on Aug 9, 2024

LLaVA is one LLM that takes both text and images as inputs - https://llava-vl.github.io/

Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.

You can also search on HuggingFace for the tag "image-text-to-text" https://huggingface.co/models?pipeline_tag=image-text-to-tex... and find a variety of other models.

katzinsky · on Aug 9, 2024

I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.

The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.

eigenvalue · on Aug 9, 2024

This is the best I've found so far:

https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf

But I see that this new one just came out using Llama 3.1 8B:

https://huggingface.co/aimagelab/LLaVA_MORE-llama_3_1-8B-fin...