As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?
Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias
Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)
Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).
I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd
https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...
They still suck at explaining which model they serve is which, though.
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
Eh i mean often innovation is made just by letting a lot of fragmented, small teams of cracked nerds trying out stuff. It's way too early in the game. I mean, qwens release statements have anime etc. IBM, Bell, Google, Dell, many did it similarly, letting small focused teams having many attempts at cracking the same problem. All modern quant firms are doing basically the same as well. Anthropic is actually an exception, more like Apple.
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
If you're in SF, you don't want to miss this.
The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week.
https://partiful.com/e/P7E418jd6Ti6hA40H6Qm
Rare opportunity to directly engage with the Qwen team members.
I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.
Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.
It fails on any edge case, like all other VLMs. The last time a vision model succeeded at reading analog clocks, a notoriously difficult task, it was revealed they trained on nearly 1 million artificial clock images[0] to make it work. In a similar vein, I have encountered no model that could read for example a D20 correctly.[1]
It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.
Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.
aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)
The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Most multi-modal input implementations suck, and a lot of them suck big time.
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
They might have dozens of reasons, but they already did what they did.
Some of the reasons could be:
- mitigation of US AI supremacy
- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones
I can see how it would be in China's interest to make sure there was an LLM that produced cutting edge performance in Chinese-language conversations.
And some uses of LLMs are intensely political; think of a student using an LLM to learn about the causes of the civil war. I can understand a country wanting their own LLMs for the same reason they write their own history textbooks.
By releasing the weights they they can get free volunteer help, win hearts and minds with their open approach, weaken foreign corporations, give their citizens robust performance in their native language, and exercise narrative control - all at the same time.
I don't think they care about winning hearts exactly, but I do think they (correctly) realize that LLM models are racing rapidly toward being commodified and they are still going to be way ahead of us on manufacturing the hardware to run them on.
Watching the US stock market implode from the bubble generated from investors over here not realizing this is happening will be a nice bonus for them, I guess, and constantly shipping open SOTA models will speed that along.
Because it doesn’t make sense. The reason there’s a bubble is investor belief that AI will unlock tons of value. The reason the bubble is concentrated in silicon and model providers is because investors believe they have the most leverage to monetize this new value in the short term.
If all of that stuff becomes free, the money will just move a few layers up to all of the companies whose cost structure has suddenly been cut dramatically.
There is no commoditization of expensive technology that results in a net loss of market value. It just moves around.
But the spend on AI is globally is still measured in the tens of billions? Tiny in the grand scheme of things. So what 'money' is moving up? Not revenue, and in the case of a bubble bursting, not speculative capital.
I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.
My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.
So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.
Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"
Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.
On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.
Anyway, interesting showing, definitely real, and definitely useful.
Funny enough, I did a little bit of ChatGPT-assisted research into a loosely similar scenario not too long ago. LPT: if you happen to know in advance that you'll be in Renaissance Florence, make sure to pack as many synthetic diamonds as you can afford.
I JUST had a very intense dream that there was a catastrophic event that set humanity back massively, to the point that the internet was nonexistent and our laptops suddenly became priceless. The first thought I had was absolutely hating myself for not bothering to download a local LLM. A local LLM at the level of qwen is enough to massively jump start civilization.
Roughly 1/10 the cost of Opus 4.1, 1/2 the cost of Sonnet 4 on per token inference basis. Impressive. I'd love to see a fast (groq style) version of this served. I wonder if the architecture is amenable.
Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.
I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.
Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.
Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.
Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.
Their A3B Omni paper mentions that the Omni at that size outperformed the (unreleased I guess) VL. Edit: I see now that there is no Omni-235B-A22B; disregard the following. ~~Which is interesting - I'd have expected the larger model to have more weights to "waste" on additional modalities and thus for the opposite to be true (or for the VL to outperform in both cases, or for both to benefit from knowledge transfer).~~
As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn
LM Studio[0] is the best "i'm new here and what is this!?" tool for dipping your toes in the water.
If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.
[0]: https://lmstudio.ai/
Thank you! I will give it a try and see if I can get that 4090 working a bit.
Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?
Gemini has purpose post training for bounding boxes if you haven't tried it.
The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.
With Qwen I went as stupid as I could: please provide the bounding box metadata for pytesseract for the above image.
And it spat it out.
It’s funny that many of us say please. I don’t think it impacts the output, but it also feels wrong without it sometimes.
Depends on the model, but e.g. [1] found many models perform better if you are more polite. Though interestingly being rude can also sometimes improve performance at the cost of higher bias
Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)
1: https://arxiv.org/pdf/2402.14531
When I want an LLM to be be brief, I will say things like "be brief", "don't ramble", etc.
When that fails, "shut the fuck up" always seems to do the trick.
I ripped into cursor today. It didn't change anything but I felt better lmao
Bevore GPT5 was released I already had the feeling like the webui response was declining and I started to try to get more out of the responses and dissing it and saying how useless their response was did actually improve the output (I think).
Do you have some example images and the prompt you tried?
also documented stack setup if could.
I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...
They still suck at explaining which model they serve is which, though.
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
Eh i mean often innovation is made just by letting a lot of fragmented, small teams of cracked nerds trying out stuff. It's way too early in the game. I mean, qwens release statements have anime etc. IBM, Bell, Google, Dell, many did it similarly, letting small focused teams having many attempts at cracking the same problem. All modern quant firms are doing basically the same as well. Anthropic is actually an exception, more like Apple.
> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.
This "just" is incorrect.
The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
> Do we say "The British"
Yes.
The Americans do that all the time. :P
Yeah it's just weird Orientalism all over again
> Also I hate this "The Chinese" thing
to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.
It erases the individuals doing the actual research by viewing Chinese people as a monolith.
If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage with the Qwen team members.
Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days
Registration full :-(
Sadly it still fails the "extra limb" test.
I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.
Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.
It fails on any edge case, like all other VLMs. The last time a vision model succeeded at reading analog clocks, a notoriously difficult task, it was revealed they trained on nearly 1 million artificial clock images[0] to make it work. In a similar vein, I have encountered no model that could read for example a D20 correctly.[1]
It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.
[0] https://huggingface.co/datasets/allenai/pixmo-clocks
[1] https://files.catbox.moe/ocbr35.jpg
I wonder if you used their image editing feature if it would insist on “correcting” the number of limbs even if you asked for unrelated changes.
Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.
Can't seem to connect to qwen.ai with DNSSEC enabled
> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature
And
https://dnsviz.net/d/qwen.ai/dnssec/ shows
aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)
The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Most multi-modal input implementations suck, and a lot of them suck big time.
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
I feel like most Open Source releases regardless of size claim to be similar in output quality to SOTA closed source stuff.
So 235B parameter Qwen3-VL is FP16, so practically it requires at least 512 GB RAM to run? Possibly even more for a reasonable context window?
Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?
China is winning the hearts of developers in this race so far. At least, they won mine already.
Arguably they’ve already won. Check the names at the top the next time you see a paper from an American company, a lot of them are Chinese.
you can’t tell if someone is American or Chinese by looking at their name
I actually claim something even stronger, which is it’s what’s in your heart that really determines if you’re American :)
Cute but the US president is currently on a mass deportation campaign, so it appears what's in peoples' hearts doesn't really matter.
They don't have to ever make a profit, so the game they are playing is a bit different.
OpenAI were not found to be profit-driven too. It is sad to see the place they are now.
When you are not owned by the chosen people, you can actually focus on innovation to improve humanity. Who would have known!
so.. why do you think they are trying this hard to win your heart?
They might have dozens of reasons, but they already did what they did.
Some of the reasons could be:
- mitigation of US AI supremacy
- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones
- talent war inside China
- soften the sentiment against China in the US
- they're just awesome people
- and many more
I can see how it would be in China's interest to make sure there was an LLM that produced cutting edge performance in Chinese-language conversations.
And some uses of LLMs are intensely political; think of a student using an LLM to learn about the causes of the civil war. I can understand a country wanting their own LLMs for the same reason they write their own history textbooks.
By releasing the weights they they can get free volunteer help, win hearts and minds with their open approach, weaken foreign corporations, give their citizens robust performance in their native language, and exercise narrative control - all at the same time.
I don't think they care about winning hearts exactly, but I do think they (correctly) realize that LLM models are racing rapidly toward being commodified and they are still going to be way ahead of us on manufacturing the hardware to run them on.
Watching the US stock market implode from the bubble generated from investors over here not realizing this is happening will be a nice bonus for them, I guess, and constantly shipping open SOTA models will speed that along.
they aren’t even trying hard, it’s just that no one else is trying
Maybe they just want to see one of the biggest stock bubble pops of all time in the US.
Surprising this is the first time I’ve seen anyone say this out loud.
Because it doesn’t make sense. The reason there’s a bubble is investor belief that AI will unlock tons of value. The reason the bubble is concentrated in silicon and model providers is because investors believe they have the most leverage to monetize this new value in the short term.
If all of that stuff becomes free, the money will just move a few layers up to all of the companies whose cost structure has suddenly been cut dramatically.
There is no commoditization of expensive technology that results in a net loss of market value. It just moves around.
But the spend on AI is globally is still measured in the tens of billions? Tiny in the grand scheme of things. So what 'money' is moving up? Not revenue, and in the case of a bubble bursting, not speculative capital.
I know I do
China has been creating high quality cultural artifacts for thousands of years.
Thank you Qwen team for your generosity. I'm already using their thinking model to build some cool workflows that help boring tasks within my org.
https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507
Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!
Models:
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
The open source models are no longer catching up. They are leading now.
I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.
My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.
So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.
Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"
Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.
On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.
Anyway, interesting showing, definitely real, and definitely useful.
Funny enough, I did a little bit of ChatGPT-assisted research into a loosely similar scenario not too long ago. LPT: if you happen to know in advance that you'll be in Renaissance Florence, make sure to pack as many synthetic diamonds as you can afford.
> predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis
I love this! Simple and probably effective (or would get you killed for witchcraft)
That is a freaking insanely cool answer from gpt5
I JUST had a very intense dream that there was a catastrophic event that set humanity back massively, to the point that the internet was nonexistent and our laptops suddenly became priceless. The first thought I had was absolutely hating myself for not bothering to download a local LLM. A local LLM at the level of qwen is enough to massively jump start civilization.
That has got to be the most benchmarks I've ever seen posted with an announcement. Kudos for not just cherrypicking a favorable set.
We should stop reporting saturated benchmarks.
Roughly 1/10 the cost of Opus 4.1, 1/2 the cost of Sonnet 4 on per token inference basis. Impressive. I'd love to see a fast (groq style) version of this served. I wonder if the architecture is amenable.
Isnt it a 3x rate difference? 0.7$ for Qwen3-VL vs 3$ for Sonnet 4?
Cerebras are hosting other Qwen models via OpenRouter, so probably
Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.
I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.
Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.
This model is literally amazing. Everyone should try to get their hands on a H100 and just call it a day.
How does it compare to Omni?
This demo is crazy: "At what time was the goal scored in this match, who scored it, and how was it scored?"
Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.
Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.
I bought an GMKtec evo 2 that is a 128 GB unified memory system. Strong recommend.
Cool! Pity they are not releasing a smaller A3B MoE model
https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct
Their A3B Omni paper mentions that the Omni at that size outperformed the (unreleased I guess) VL. Edit: I see now that there is no Omni-235B-A22B; disregard the following. ~~Which is interesting - I'd have expected the larger model to have more weights to "waste" on additional modalities and thus for the opposite to be true (or for the VL to outperform in both cases, or for both to benefit from knowledge transfer).~~
Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765