White Paper·April 14, 2026·6 min read

We tested 4 AI models with professional web designers. Claude won, but not the way you'd expect.

Claude Opus 4.6, Gemini 3.1 Pro, ChatGPT 5.3 Codex, Qwen 3.5. The winner shifted at every phase.

  1. 01Claude won ideation: 79.8% win rate. Led on usability, visual cohesion, prompt alignment.
  2. 02Gemini took mockup at 68.9%. Beat Claude on structured design-system work.
  3. 03Claude narrowed the gap to 60% at refinement.
  4. 04The winner shifted at every phase. Model selection should follow the task.
Contra Labs
Contra Labs
Research

We ran a structured design benchmark across four of the leading AI models: Anthropic's Claude Opus 4.6, Gemini 3.1 Pro Preview, OpenAI's ChatGPT 5.3 Codex, and Alibaba Qwen 3.5. Tasking each with building landing pages from identical prompts, judged by real designers across three distinct phases of the design process.

The results weren't what we expected.

Four leading AI models, evaluated across every phase of the creative process by working designers.
Share

How the creative process actually works

Before the numbers, some context. Most AI benchmarks test a single output at a single moment. We structured ours around how creative work actually unfolds, across three phases.

Ideation is the generative, open-ended stage. The goal is direction, not precision. Think brainstorming, concept exploration, and early layout ideas. The stakes of any individual output are low, breadth matters more than accuracy, and it's where most creatives feel most comfortable handing work to AI.

Mockup is where a chosen direction gets translated into something structured: a composed layout, a functional prototype. Professional conventions and medium-specific rules start to constrain the work. Accuracy begins to matter.

Refinement is the precision stage. A near-final artifact gets edited toward production readiness, where small adjustments carry outsized consequences. This is where domain expertise exerts the most pressure, and where AI-generated content is most likely to be discarded or heavily reworked.

Each model was evaluated at all three stages. The results looked very different depending on where you looked.

Claude dominated ideation, by a lot

In the ideation phase, Claude Opus 4.6 posted a 79.8% win rate, a commanding lead over every other model tested. Creative experts consistently rated its outputs higher on usability, visual cohesion, and prompt alignment.

Claude posted a 79.8% win rate in ideation against Gemini, ChatGPT, and Qwen.
Share

But the number doesn't tell the whole story. What stood out wasn't any single design decision. It was the feeling of the outputs.

Without coordination, the creative experts kept reaching for the same words: “intentional,” “considered,” “cohesive,” “polished.”

Claude's work felt designed, not generated.

The creative experts agreed.

The Claude version stood out because the sections felt more intentional and not just dropped into generic blocks. The layout and visual choices felt more considered overall.Blake Steven, creative director
Claude did incredible at understanding the use case of the site and then also taking the subject of the site and pulling it into the design—hierarchy of not only the page but the sections.Afton Negrea, brand identity and UI design
The design looks intentionally created rather than autogenerated with bugs.Alex Karpodinis, UI/UX and Framer designer

Then the rules changed

When the task moved from open-ended ideation into structured mockup work, where adhering to design system specifications actually matters, Gemini took over.

Gemini posted a 68.9% win rate in the mockup phase. Claude dropped to second place.

Gemini 68.9% in mockup. Claude narrowed the gap to 60% in refinement. The winner shifted at every phase.
Share

By the refinement phase, Claude had narrowed the gap to 60%, but the pattern was clear: Claude's free-form creative strength doesn't translate as well when constrained by rigid design specifications.

Claude narrowed the gap to 60% in refinement. The pattern: free-form strength doesn’t translate to rigid design specifications.
Share

What this actually means

There's a tempting takeaway here: Claude is the best AI for design. But that's not what the data shows.

The more accurate read is this: different models excel at different phases of the creative process. Claude's holistic, craft-driven outputs shine when the brief is open. When precision and constraint take over, other models close the gap fast.

For real-world creatives building AI-assisted design workflows, that's a meaningful operational insight: model selection should follow the task, not the hype.

What's coming

The full whitepaper, with complete data across all three phases of the creative workflow, all four models, and the methodology behind the benchmark, is coming soon.

The full Human Creativity Benchmark white paper is coming soon.
Share
Continue reading3 studies
All research
  1. May 5, 2026Research
    Creatives keep telling us the same thing about AI: every output looks the same.12 models, 5 creative domains. One repeated complaint from working evaluators: the work all looks the same.Read
  2. May 13, 2026Benchmark
    The image-model leaderboard flips by brief.Four frontier image models, six brand campaigns, ranked blind by working creatives. GPT Image 2 wins the aggregate. Every other model owns a category.Read
  3. May 14, 2026Benchmark
    ChatGPT Images 2.0 won every head-to-head. Here's where it still breaks.41 sessions, 7 designers, 6 briefs. GPT Image 2 nails the concept, then breaks at production.Read

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings

Connecting with the missing signal: taste

Contra connects top creative minds with AI teams training models to understand taste. This is expert input, not crowd labor. It's the creative layer powering the next generation of AI.

Designers

Writers

Marketers

Engineers

Social Media Experts

Video Editors & Animators

Music & Audio Engineers

1.5M+

creative experts

400+

Skills and tools represented

$250M+

verified expert earnings