White Paper·April 14, 2026·6 min read

We tested 4 AI models with professional web designers. Claude won, but not the way you'd expect.

A structured design benchmark across four leading AI models: Claude Opus 4.6, Gemini 3.1 Pro Preview, ChatGPT 5.3 Codex, and Qwen 3.5. The winner shifted at every phase of the creative process.

Contra Labs
Contra Labs
Research

We ran a structured design benchmark across four of the leading AI models: Anthropic's Claude Opus 4.6, Gemini 3.1 Pro Preview, OpenAI's ChatGPT 5.3 Codex, and Alibaba Qwen 3.5. Tasking each with building landing pages from identical prompts, judged by real designers across three distinct phases of the design process.

The results weren't what we expected.

Four leading AI models, evaluated across every phase of the creative process by working designers.

How the creative process actually works

Before the numbers, some context. Most AI benchmarks test a single output at a single moment. We structured ours around how creative work actually unfolds, across three phases.

Ideation is the generative, open-ended stage. The goal is direction, not precision. Think brainstorming, concept exploration, and early layout ideas. The stakes of any individual output are low, breadth matters more than accuracy, and it's where most creatives feel most comfortable handing work to AI.

Mockup is where a chosen direction gets translated into something structured: a composed layout, a functional prototype. Professional conventions and medium-specific rules start to constrain the work. Accuracy begins to matter.

Refinement is the precision stage. A near-final artifact gets edited toward production readiness, where small adjustments carry outsized consequences. This is where domain expertise exerts the most pressure, and where AI-generated content is most likely to be discarded or heavily reworked.

Each model was evaluated at all three stages. The results looked very different depending on where you looked.

Claude dominated ideation, by a lot

In the ideation phase, Claude Opus 4.6 posted a 79.8% win rate, a commanding lead over every other model tested. Creative experts consistently rated its outputs higher on usability, visual cohesion, and prompt alignment.

Claude posted a 79.8% win rate in ideation against Gemini, ChatGPT, and Qwen.

But the number doesn't tell the whole story. What stood out wasn't any single design decision. It was the feeling of the outputs.

Without coordination, the creative experts kept reaching for the same words: “intentional,” “considered,” “cohesive,” “polished.”

Claude's work felt designed, not generated.

The creative experts agreed.

The Claude version stood out because the sections felt more intentional and not just dropped into generic blocks. The layout and visual choices felt more considered overall.Blake Steven, creative director
Claude did incredible at understanding the use case of the site and then also taking the subject of the site and pulling it into the design—hierarchy of not only the page but the sections.Afton Negrea, brand identity and UI design
The design looks intentionally created rather than autogenerated with bugs.Alex Karpodinis, UI/UX and Framer designer

Then the rules changed

When the task moved from open-ended ideation into structured mockup work, where adhering to design system specifications actually matters, Gemini took over.

Gemini posted a 68.9% win rate in the mockup phase. Claude dropped to second place.

Gemini 68.9% in mockup. Claude narrowed the gap to 60% in refinement. The winner shifted at every phase.

By the refinement phase, Claude had narrowed the gap to 60%, but the pattern was clear: Claude's free-form creative strength doesn't translate as well when constrained by rigid design specifications.

Claude narrowed the gap to 60% in refinement. The pattern: free-form strength doesn’t translate to rigid design specifications.

What this actually means

There's a tempting takeaway here: Claude is the best AI for design. But that's not what the data shows.

The more accurate read is this: different models excel at different phases of the creative process. Claude's holistic, craft-driven outputs shine when the brief is open. When precision and constraint take over, other models close the gap fast.

For real-world creatives building AI-assisted design workflows, that's a meaningful operational insight: model selection should follow the task, not the hype.

What's coming

The full whitepaper, with complete data across all three phases of the creative workflow, all four models, and the methodology behind the benchmark, is coming soon.

The full Human Creativity Benchmark white paper is coming soon.
Continue reading
All research
Research
The creative process has 3 phases. AI performs very differently in each.
Contra Labs has been studying how working creatives integrate AI into their workflows. What emerged is a consistent 3-stage structure: ideation, mockup, refinement. The way creatives use AI shifts significantly at each one.
April 23, 2026 · 5 min
Research
Solo creatives are earning more with AI and staying independent.
The majority of independent creatives surveyed report higher earning potential since adopting AI. They're taking on more projects, charging more, and hiring no one.
April 21, 2026 · 5 min
Benchmark
Veo 3.1 is the "Creative Director" model. Use it early, but hand off before refinement.
Contra Labs ran Google Veo 3.1 through every phase of ad video production: ideation, mockup, refinement. The data produced the clearest model profile we've recorded.
April 22, 2026 · 6 min

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings