I can't speak to the specifics, but I’ve always theorized that one of the problems was difficulty understanding periodic data (hands, banana bunches, piano keys) particularly in the context of a larger cohesive image (768x512 in SD days).
Those kinds of structures tend to cause issues. One possible reason that Flux got much better at handling them is simply that it’s a much larger model (on average about four times the size of its SDXL predecessors) at around twelve billion parameters.
On a slightly related note, I actually added a test around deliberate 4/6 finger hands to my comparison site because hands had become such a solved problem that they turned into an interesting benchmark. It let me check whether models could effectively generate images outside the enormous bias in the training data toward hands with exactly five digits.
I can't speak to the specifics, but I’ve always theorized that one of the problems was difficulty understanding periodic data (hands, banana bunches, piano keys) particularly in the context of a larger cohesive image (768x512 in SD days).
Those kinds of structures tend to cause issues. One possible reason that Flux got much better at handling them is simply that it’s a much larger model (on average about four times the size of its SDXL predecessors) at around twelve billion parameters.
On a slightly related note, I actually added a test around deliberate 4/6 finger hands to my comparison site because hands had become such a solved problem that they turned into an interesting benchmark. It let me check whether models could effectively generate images outside the enormous bias in the training data toward hands with exactly five digits.
https://genai-showdown.specr.net/#count-tyrone-rugen
The same way text models improved.
Remind me what that was?
More trainable parameters, more data, higher quality data.