We ran a structured design benchmark across four of the leading AI models: Anthropic's Claude Opus 4.6, Gemini 3.1 Pro Preview, OpenAI's ChatGPT 5.3 Codex, and Alibaba Qwen 3.5. Tasking each with building landing pages from identical prompts, judged by real designers across three distinct phases of the design process.
The results weren't what we expected.

How the creative process actually works
Before the numbers, some context. Most AI benchmarks test a single output at a single moment. We structured ours around how creative work actually unfolds, across three phases.
Ideation is the generative, open-ended stage. The goal is direction, not precision. Think brainstorming, concept exploration, and early layout ideas. The stakes of any individual output are low, breadth matters more than accuracy, and it's where most creatives feel most comfortable handing work to AI.
Mockup is where a chosen direction gets translated into something structured: a composed layout, a functional prototype. Professional conventions and medium-specific rules start to constrain the work. Accuracy begins to matter.
Refinement is the precision stage. A near-final artifact gets edited toward production readiness, where small adjustments carry outsized consequences. This is where domain expertise exerts the most pressure, and where AI-generated content is most likely to be discarded or heavily reworked.
Each model was evaluated at all three stages. The results looked very different depending on where you looked.
Claude dominated ideation, by a lot
In the ideation phase, Claude Opus 4.6 posted a 79.8% win rate, a commanding lead over every other model tested. Creative experts consistently rated its outputs higher on usability, visual cohesion, and prompt alignment.

But the number doesn't tell the whole story. What stood out wasn't any single design decision. It was the feeling of the outputs.
Without coordination, the creative experts kept reaching for the same words: “intentional,” “considered,” “cohesive,” “polished.”
Claude's work felt designed, not generated.
The creative experts agreed.
The Claude version stood out because the sections felt more intentional and not just dropped into generic blocks. The layout and visual choices felt more considered overall.Blake Steven, creative director
Claude did incredible at understanding the use case of the site and then also taking the subject of the site and pulling it into the design—hierarchy of not only the page but the sections.Afton Negrea, brand identity and UI design
The design looks intentionally created rather than autogenerated with bugs.Alex Karpodinis, UI/UX and Framer designer
Then the rules changed
When the task moved from open-ended ideation into structured mockup work, where adhering to design system specifications actually matters, Gemini took over.
Gemini posted a 68.9% win rate in the mockup phase. Claude dropped to second place.

By the refinement phase, Claude had narrowed the gap to 60%, but the pattern was clear: Claude's free-form creative strength doesn't translate as well when constrained by rigid design specifications.

What this actually means
There's a tempting takeaway here: Claude is the best AI for design. But that's not what the data shows.
The more accurate read is this: different models excel at different phases of the creative process. Claude's holistic, craft-driven outputs shine when the brief is open. When precision and constraint take over, other models close the gap fast.
For real-world creatives building AI-assisted design workflows, that's a meaningful operational insight: model selection should follow the task, not the hype.
What's coming
The full whitepaper, with complete data across all three phases of the creative workflow, all four models, and the methodology behind the benchmark, is coming soon.


