Research·June 17, 2026·7 min read

Introducing Design Crit: we taught AI to judge design like a designer.

Ten professional designers ranked four frontier image models across nine dimensions of real design work. The models can make the work. Nothing on the market could reliably judge it, until we trained on the right data.

  1. 01Ten designers ranked four image models on nine criteria, one rating per axis.
  2. 02No off-the-shelf judge beat 55% agreement. The best hit 54.3%; a human hits 74.1%.
  3. 03Scaling doesn't help: Qwen3-VL at 4B, 8B, and 32B all stall near 51–54%.
  4. 04One in ten finished designs carried a major hallucination.
  5. 05Trained on Design Crit, a model reaches 0.611 and ties a human on the hardest splits.
Contra Labs
Contra Labs
Research
Design Crit (Criteria-Resolved Image Taste), a Lica × Contra collaboration.

Everyone keeps talking about taste. But you can't improve what you can't measure. So we measured it. Design Crit is a dataset of ten professional designers ranking four frontier image models across nine dimensions of real design work. The models can produce the work. Nothing on the market can reliably judge it. The good news is that what they're missing can be learned.

Text-to-image models have matured from research demos into deployed design tools, shipping posters, social posts, UI mockups, and logos straight into production. But the preference data that trains and grades them was collected on photo-style, scene-based generation, where a single "which one is better?" verdict captures almost everything that matters, like sharpness and prompt alignment.

Designers, however, don't judge a design on a single characteristic, and design doesn't collapse into one axis. A graphic design can nail the spatial structure and butcher the color intent. Another can satisfy the brief and break the typographic hierarchy. Both get the same overall thumbs-up, for completely different reasons. The signal designers actually use lives in the dimensions a single label averages away.

So we built the missing layer. Design Crit (Criteria-Resolved Image Taste) is a designer-annotated preference dataset for AI-generated graphic design. It records one rating per design-quality criterion instead of one verdict per image. That lets a system score a design on every axis at once, make sharper calls than a single label allows, and in time learn to weigh those axes the way a designer would.

10 designers, 4 models, 9 ways to be wrong

We put four current text-to-image models head to head. FLUX.2 max, GPT Image 1.5, Nano Banana 2, and Seedream 5.0 Lite, all shown to raters behind blind code-names so no one was anchoring on a brand.

Ten professional designers, recruited through Contra's network of creative experts, split into two cohorts of five. One cohort judged aesthetics, covering overall preference, mood and tone, visual hierarchy, color harmony, and typographic craft. The other judged description fidelity, covering overall preference, color accuracy, spatial accuracy, and whether the text the brief asked for was actually rendered.

One of the AI-generated graphic designs in the set. Designers rated work like this one axis at a time, not with a single overall verdict.
Share

We narrowed the nine criteria from a longer list through pilot studies and interviews, building on Contra's Human Creativity Benchmark and keeping the axes designers consistently treated as separate.

Nine criteria, 80 prompts each, every prompt scored by five designers across four models, for 1,600 ratings per criterion. Designers ran all six pairwise comparisons per prompt, and we aggregated those into strict four-way rankings. On the two overall-preference tracks they also flagged every image for hallucination.

These ratings come from working professionals, and they fill the part of the stack everyone training these models has been flying blind on.

Design is subjective, but real and consistent

A fair question is whether designers just don't agree, leaving no signal to learn. We tested that directly, checking whether designer agreement exceeds what random raters would produce, against exact null distributions.

They do agree. Designers agree on good design about as much as people agree on their favorite movie, and less than crowds agree on which photo is sharper. And the way they disagree is healthy. They share a rough sense of "good," with some personal variation on top. There are no rival camps with opposite taste. That's exactly the kind of pattern a model can learn.

Where design sits on the subjective-to-objective scale: more agreement than favorite-movie picks, less than judging which photo is sharper.
Share

But that agreement isn't even across the dimensions. We measured how often designers landed on the same call for each criterion, and the gap is wide. The axes you can check against the brief draw the tightest agreement, whether the requested text actually rendered, whether the layout is right, whether the colors match what was asked for. The axes that come down to pure feel draw the least, with color harmony at the bottom.

The clearest read is the matched pairs. Designers agree far more on whether the right text rendered than on whether the type is well set. They agree more on whether the requested color appeared than on whether the colors sit well together. Same subject each time, the checkable version high, the felt version low. The signal is real on every axis. It just gets noisier the more the call comes down to taste.

Designer agreement by criterion (Kendall's τ). Checkable axes like text rendering rank highest; felt axes like color harmony rank lowest.
Share

No off-the-shelf judge beats a coin flip

So the signal is real. The question is whether anything on the market can read it. We benchmarked nine pre-trained systems as design judges. Three were dedicated preference and aesthetic scorers (HPSv2.1, PickScore-v1, LAION-Aesthetic-V2) and six were open-weight vision-language models prompted to pick the better image.

Not one cleared 55% agreement with the five-designer majority. Chance is 50%. The best system, HPSv2.1, was trained on more than 640,000 human image comparisons, and it lands at 54.3%. LAION-Aesthetic-V2 actually scores below chance. A human designer agrees with the panel 74.1% of the time. Every machine judge sits in the dead zone just above a coin flip.

Agreement with the five-designer majority. Every off-the-shelf judge clusters just above chance; a human designer sits far above at 74.1%.
Share

You can't compute your way to taste

Scaling doesn't move the number. Qwen3-VL at 4B, 8B, and 32B all land between 51% and 54%. The reason is a trade-off. Bigger models carry less position bias, so their pick barely changes when you swap the order of the two images. They're more internally consistent. But that consistency buys no accuracy. On the calls they commit to, the bigger models are no better, and a little worse. Smaller models lean on position more, yet the calls they do commit to are sharper. The more a model leans on position, the better it judges when it doesn't (Spearman ρ = +0.94), so the two effects cancel and the total never moves. The bottleneck is the data.

Model size vs. agreement for Qwen3-VL at 4B, 8B, and 32B. Scaling the model leaves accuracy flat between 51% and 54%.
Share

One in ten designs hallucinates something the prompt never asked for

Reading design isn't the only place the models slip. While they ranked the designs, the same designers flagged every image on the overall-preference tracks for hallucination, meaning elements that drifted from the brief or had nothing to do with it. Across 1,600 flags per cohort, about 55% came back clean, 35% minor, and 10% major. One in ten finished designs carried a major hallucination, something the prompt never asked for. These are the kinds of failures a designer catches at a glance and the model that made them does not.

Hallucination flags across 1,600 designs per cohort: 55% clean, 35% minor, 10% major.
Share

Train on the data, and half the gap disappears

Then comes the turn. We trained a small pairwise-difference head on top of a frozen vision-language encoder, with no fine-tuning of the backbone, a deliberately modest model, directly on Design Crit.

It reaches 0.611 agreement with designers. That closes roughly 46% of the entire gap between a coin flip (0.500) and the human ceiling (0.741), and it's the first configuration in our sweep to clear the noise floor that standard regularization couldn't budge. The lesson from the benchmark holds in reverse. The signal was always there. It just had to be trained on the right data instead of borrowed from photo preference.

A small model trained on Design Crit reaches 0.611, closing about 46% of the gap between chance and the human ceiling.
Share

On the genuinely hard calls, it already ties a human

Roughly half of all pairwise comparisons are 3-2 splits, cases where the designers themselves are nearly evenly divided and even a perfect predictor is partly guessing. Those are the calls that actually test judgment.

On exactly those cases, a model trained on Design Crit scores 0.602 against a human ceiling of 0.600. When the panel splits three to two, even one designer only agrees with the majority three times in five, and the model now matches that. The gap to human agreement stays wide on the cases designers find easy. On the ones they find hardest, it shrinks to almost nothing.

On the hardest 3-2 splits, the Design Crit model (0.602) matches the human ceiling (0.600).
Share

Why it matters

Design Crit (Criteria-Resolved Image Taste) lets you build a decision layer for design generation. Its criterion-level structure means you can route between generators by what a job actually needs. Pick the model that's strongest on typography for a logo, or on spatial accuracy for a layout, rather than trusting one aggregate score. The same structure works as supervision for training preference judges and reward models that optimize for specific design dimensions rather than a blurry average.

The headline finding is blunt. AI can generate design, but it cannot yet reliably tell good design from bad, and no amount of scale fixes that on its own. The hopeful finding is that the missing signal, taste, is real and can be learned from expert data. That's the layer Contra's network is built to provide.

Our first Design Crit dataset “TASTE” is published on arXiv: arxiv.org/abs/2605.20731

Limitations

The sample is small. Each prompt was rated by five designers, enough to measure agreement and rule out noise but not enough to be sure of any single comparison. Each criterion used its own set of 80 prompts, so no design was ever rated on two criteria at once. That kept each rating clean, but it means we can't see how one designer weighs color against typography on the same design, because no one judged the same design on both. Every prompt was in English, so taste across languages isn't captured here. And the nine criteria cover a lot, but not everything. Accessibility, brand consistency, motion, and audience fit are all natural axes to add as the work grows.

Future research

The obvious next move is to run this at a wider scale, with more designers per prompt, more languages, and the criteria expanded. Rating the same designs on every axis might reveal how designers balance color against hierarchy or fidelity against feel, the trade-offs a single score hides.

So far we've shown the signal can be learned as a judge. The open question is whether it makes models better designers. Training a generator against these per-dimension rewards could push on typography or color directly, then show whether the work actually improves.

How we ran this study → Methodology
Continue reading3 studies
All research
  1. May 13, 2026Methodology
    Methodology: human-panel evaluation of generative models at Contra Labs.The standard playbook behind every Contra Labs battle, profile, and field note: blinded panels of practicing creatives, forced-choice rankings paired with scalar ratings and rationale, and a reliability battery that travels with every number we publish.Read
  2. May 29, 2026Benchmark
    Cursor took 60% of head-to-heads. Claude Code took 63% of client meetings.Four coding tools, 24 outputs, five working designers. The tool designers preferred to look at and the tool they'd put their name on turned out to be different.Read
  3. June 3, 2026Benchmark
    Ideogram v4 won 47.9% of typography matchups.10 designers, 4 models, 240 images. Spelling is solved. Typographic craft and client-readiness are where Ideogram v4 pulls away.Read

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings