Benchmark to measure AI on graphic design tasks

(arxiv.org)

5 points | by purvanshi 5 hours ago ago

6 comments

$haonanz 5 hours ago

Single layout generation is hard. Generating template variants, i.e., multiple layouts that share a structure but differ in style, color, and content, is a completely different problem.
We tested style completion and recoloring on template families. Structural fidelity is high (position and area preservation near 100% for structural generation), but palette coverage lags at 77.6%. Worse, SSIM and LPIPS actively mislead: a structurally valid, style-consistent output scores lower than a hallucinated one that happens to agree more on pixels.
The take away is that pixel metrics are the wrong evaluation substrate for design. The field needs structure-aware metrics that operate on extracted primitives such as bounding boxes, color tokens, font properties instead of raw pixels.
$eladlica 5 hours ago

We asked frontier models to detect components in a graphic design layout. The best result: 6.4% mAP@0.5. For context, natural image detection benchmarks sit above 60%. We're an order of magnitude behind on a task every professional design tool does trivially. So while models can talk about design, they still struggle to locate it. This means the system lacks the spatial grounding needed for reliable editing, completion, or structure-aware manipulation.
$purvanshi 5 hours ago

[dead]
$whydoanything 5 hours ago

[dead]
$Jaejung 5 hours ago

[dead]
$adriennedeg 5 hours ago

[dead]