Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/
One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.
I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.
I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.
Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back
Its errors are interesting (averaging around one per paragraph). Semantically-correct, but wrong on precision (simple example, the English word "ardour" is transcripted as "ardor", and a foreign word like "palazzo" which is intended to remain so, is translated to "palace"). I'm still messing with temp/presence/frequency/top-p/top-k/prompting to see if I can squeeze some more precision out of it, but I'm running out of time.
Not sure if it matters but I exported a PDF page as a PNG with 200dpi resolution, and used that.
It seems like it's reading the text but getting the details wrong.
I would not be comfortable using this in an official capacity without more accuracy. I could see using this for words that another OCR system is uncertain about, though, as a fallback.
You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)
Yes, I'll add that to the writeup! You're right, initially excluded it because it was really dependent on the providers, so lots of variance. Especially with the Qwen models.
One of these things is not like the others. $8.50/1000?? Any chance that's a typo? Otherwise, for someone that has no experience with LLM pricing models, why is Llama 90b so expensive?
It's not uncommon when using brokers to see outliers like this. What happens basically is that some models are very popular and have many different providers, and are priced "close to the metal" since the routing will normally pick the cheapest option with the specified requirements (like context size). But then other models - typically more specialized ones - are only hosted by a single provider, and said provider can then price it much higher than raw compute cost.
I'll add that some, big-name suppliers with big models might be running at or near a loss on purpose to draw in customers. That behavior is often encouraged by funders who gave them over $100 million to capture the market.
Their theory is they can raise prices once their competitors go out of business. The companies open-sourcing pretrained models are countering that. So, we see a mix of huge models underpriced by scheming companies and open-source models priced for inference with free market principles.
That was the cost when we ran Llama 90b using TogetherAI. But it's quite hard to standardize, since it depends a lot on who is hosting the model (i.e. together, openrouter, grok, etc.)
I think in order to run a proper cost comparison, we would need to run each model on an AWS gpu instance and compare the runtime required.
I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.
In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.
What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.
Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.
There's some comments I've run across saying Qwen2.5-VL's really good at handwriting recognition.
It'd also be interesting to see how Tesseract compares when trying to OCR more mixed text+graphic media. Some possible examples: high-design magazines with color backgrounds, TikTok posts, maps, cardboard hold-up signs at political gatherings.
I wrote a small, client-side-JS-only app that does OCR and TTS on board game cards, so my friends and I can listen to someone read the cards' flavor text. On a few pages of text in total so far, Qwen has made zero mistakes. It's very impressive.
I have a prompt which works for a single file in Copilot, but it's slower than opening the file and looking at it to find one specific piece of information and re-saving it manually and then running a .bat file to rename with more of the information, then filling out the last two bits when entering things.
I searched for any link between OmniAI and Alibaba's Qwen, but I can't find any link. Do you know anything I don't know?
All of these models are open source (I think?). They could presumably build their work on any of these options. It behooves them to pick well. And establish some authority along the way.
Generally running the whole benchmark is ~$200, since all the providers cost money. But if anyone wants to specifically benchmark Omni just drop us a note and we'll make the credits available.
The 32b sounds like it has some useful small tweakers. Tweaks to make output more human friendly, better mathematical reasoning, better fine-grained understanding. https://qwenlm.github.io/blog/qwen2.5-vl-32b/ https://news.ycombinator.com/item?id=43464068
Qwen2.5-VL-72b was released two months ago (to little fanfare in submissions, i think, but some very enthusiastic comments such as rabid enthusiasm for handwriting recognition) already very interesting. Its actually one of the releases that kind of turned me on to AI, that broke through some of my skepticism & grumpiness. There's pretty good release notes detailing capabilities here; well done blog post. https://qwenlm.github.io/blog/qwen2.5-vl/
One thing that really piqued my interest was Qwen HTML output, where it can provide bounding boxes in HTML format for its output. That really closes the loop interestingly to me, makes the output something I can imagine quickly building useful visual feedback around, or using the structured data from easily. I can't imagine an easier to use output format.
I suppose none of these models can output bounding box coordinates for extracted text? That seems to be a big advantage of traditional OCR over LLMs.
For applications I'm interested in, until we can get to 95+% accuracy, it will require human double-checking / corrections, which seems unfeasible w/o bounding boxes to quickly check for errors.
qwen2.5-vl-72b-instruct seems perfectly happy outputting bounding boxes in my testing.
There's also a paper https://arxiv.org/pdf/2409.12191 where they explicitly say some of their training included bounding boxes and coordinates.
We're also looking to test qwen and other for the bounding box support. Simon Willison had a great demo page where he used Gemini 2.5 to draw bounding boxes, and the results were pretty impressive. It would probably be pretty easy to drop qwen into the same UI.
https://simonwillison.net/2025/Mar/25/gemini
Actually qwen 2.5 is trained to provide bounding boxes
Yep, this is true. I was poking around on their github and they have examples in their “cookbooks” section. Eg:
https://github.com/QwenLM/Qwen2.5-VL/blob/main/cookbooks/ocr...
If you're limited to open source models, that's very true. But for larger models and depending on your document needs, we're definitely seeing very high accuracy (95%-99%) for direct to json extraction (no markdown in between step) with our solution at https://doctly.ai.
In addition, gemini Pro 2.5 does really well with bounding boxes, but yeah not open source :(
I'd guess that it wouldn't be a huge effort to fine tune them to produce bounding boxes.
I haven't done it with OCR tasks, but I have fine tuned other models to produce them instead of merely producing descriptive text. I'm not sure if there are datasets for this already, but creating one shouldn't be very difficult.
Downloading the MLX version of "Qwen2.5-VL-32b-Instruct -8bit" via LM Studio right now since it's not yet available on Ollama and I can run it locally... I have an OCR side project for it to work on, want to see how performant it is on my M4... will report back
I'm very curious about the results - I've been using mistral-ocr for the last 2 weeks and it worked really well.
Its errors are interesting (averaging around one per paragraph). Semantically-correct, but wrong on precision (simple example, the English word "ardour" is transcripted as "ardor", and a foreign word like "palazzo" which is intended to remain so, is translated to "palace"). I'm still messing with temp/presence/frequency/top-p/top-k/prompting to see if I can squeeze some more precision out of it, but I'm running out of time.
Not sure if it matters but I exported a PDF page as a PNG with 200dpi resolution, and used that.
It seems like it's reading the text but getting the details wrong.
I would not be comfortable using this in an official capacity without more accuracy. I could see using this for words that another OCR system is uncertain about, though, as a fallback.
You mention that you measured cost and latency in addition to accuracy - would you be willing to share those results as well? (I understand that for these open models they would vary between providers, but it would be useful to have an approximate baseline.)
Yes, I'll add that to the writeup! You're right, initially excluded it because it was really dependent on the providers, so lots of variance. Especially with the Qwen models.
High level results were:
- Qwen 32b => $0.33/1000 pages => 53s/page
- Qwen 72b => $0.71/1000 pages => 51s/page
- Llama 90b => $8.50/1000 pages => 44s/page
- Llama 11b => $0.21/1000 pages => 08s/page
- Gemma 27b => $0.25/1000 pages => 22s/page
- Mistral => $1.00/1000 pages => 03s/page
One of these things is not like the others. $8.50/1000?? Any chance that's a typo? Otherwise, for someone that has no experience with LLM pricing models, why is Llama 90b so expensive?
It's not uncommon when using brokers to see outliers like this. What happens basically is that some models are very popular and have many different providers, and are priced "close to the metal" since the routing will normally pick the cheapest option with the specified requirements (like context size). But then other models - typically more specialized ones - are only hosted by a single provider, and said provider can then price it much higher than raw compute cost.
E.g. if you look at https://openrouter.ai/models?order=pricing-high-to-low, you'll see that there are some 7B and 8B models that are more expensive than Claude Sonnet 3.7.
I'll add that some, big-name suppliers with big models might be running at or near a loss on purpose to draw in customers. That behavior is often encouraged by funders who gave them over $100 million to capture the market.
Their theory is they can raise prices once their competitors go out of business. The companies open-sourcing pretrained models are countering that. So, we see a mix of huge models underpriced by scheming companies and open-source models priced for inference with free market principles.
That was the cost when we ran Llama 90b using TogetherAI. But it's quite hard to standardize, since it depends a lot on who is hosting the model (i.e. together, openrouter, grok, etc.)
I think in order to run a proper cost comparison, we would need to run each model on an AWS gpu instance and compare the runtime required.
A 2d plot would be great
I've been consistently surprised by Gemini's OCR capabilities. And yeah, Qwen is climbing the vision ladder _fast_.
In my workflows I often have multiple models competing side-by-side, so I get to compare the same task executed on, say, 4o, Gemini, and Qwen. And I deal with a very wide range of vision related tasks. The newest Qwen models are not only overall better than their previous release by a good margin, but also much more stable (less prone to glitching) and easier to finetune. I'm not at all surprised they're topping the OCR benchmark.
What bugs me though is OpenAI. Outside of OCR, 4o is still king in terms of overall understanding of images. But 4o is now almost a year old, and in all that time they have neither improved the vision performance in any newer releases, nor have they improved OCR. OpenAI's OCR has been bad for a long time, and it's both odd and annoying.
Taken with a grain of salt since again I've only had it in my workflow for about a week or two, but I'd say Qwen 2.5 VL 72b beats Gemini for general vision. That lands it in second place for me. And it can be run _locally_. That's nuts. I'm going to laugh if Qwen drops another iteration in a couple months that beats 4o.
I've been doing some experiments with the OCR API on macOS lately and wonder how it compares to these LLMs.
Overall, it's very impressive, but makes some mistakes (on easy images - i.e. obviously wrong) that require human intervention.
I would like to compare it to these models, but this benchmark is beyond OCR - extracted structured JSON.
Tesseract can manage 99% accuracy on anything other than handwriting. Without being an LLM.
Is there an advantage of using an LLM here?
I'm really curious about this too! I don't know!
There's some comments I've run across saying Qwen2.5-VL's really good at handwriting recognition.
It'd also be interesting to see how Tesseract compares when trying to OCR more mixed text+graphic media. Some possible examples: high-design magazines with color backgrounds, TikTok posts, maps, cardboard hold-up signs at political gatherings.
I've been very impressed with Qwen in my testing, I think people are underestimating it
I wrote a small, client-side-JS-only app that does OCR and TTS on board game cards, so my friends and I can listen to someone read the cards' flavor text. On a few pages of text in total so far, Qwen has made zero mistakes. It's very impressive.
How does one configure an LLM interface using this to process multiple files with a single prompt?
Do you mean you want to process multiple files with a single LLM call or process multiple files using the same prompt across multiple LLM calls?
(I would recommend the latter)
Multiple files with a single LLM call.
I have a prompt which works for a single file in Copilot, but it's slower than opening the file and looking at it to find one specific piece of information and re-saving it manually and then running a .bat file to rename with more of the information, then filling out the last two bits when entering things.
Depends what is your setup? You can always find more support on r/Localllama
Using Copilot, and currently running jan.ai --- /r/Localllama seems to tend towards the typical Reddit cesspool.
Let me rephrase:
What locally-hosted LLM would be suited to batch processing image files?
Nice work Tyler and team!
Is there a reason Surya isn’t included?
What about mini cpm v2.6?
News update: OCR company touts new benchmark that shows its own products are the most performant.
I searched for any link between OmniAI and Alibaba's Qwen, but I can't find any link. Do you know anything I don't know?
All of these models are open source (I think?). They could presumably build their work on any of these options. It behooves them to pick well. And establish some authority along the way.
The model with the best accuracy in the linked benchmark is "OmniAI" (OP's company) which looks like a paid model, not open source [1].
[1]: https://getomni.ai/pricing
Someone should try to reproduce and post it here. I can't, my PC is about 15 years old. :(
(It is not a joke.)
Reproducing the whole benchmark would be expensive, OmniAi starts at $250/month.
Generally running the whole benchmark is ~$200, since all the providers cost money. But if anyone wants to specifically benchmark Omni just drop us a note and we'll make the credits available.
So not all of them are local and open source? Ugh.
I don't see why you couldn't run any of those locally if you buy the right hardware?
I haven't checked myself, so I'm not sure, others might be able to provide the answer though.
If they (all of the mentioned ones) are open source and can be ran locally, then most likely, yes.
From what I remember, they are all local and open source, so the answer is yes, if I am correct.
Mistral ocr is closed source
Thanks!
To be fair, they didn't include themselves at all in the graph.
They did. It’s in the #1 spot
Update: looks like the removed themselves from the graph since I saw it earlier today!
Yup, they did.
The beauty of version control: https://github.com/getomni-ai/benchmark/commit/0544e2a439423...