Benchmark·May 29, 2026·6 min read

Cursor took 60% of head-to-heads. Claude Code took 63% of client meetings.

Four coding tools, 24 outputs, five working designers. The tool designers preferred to look at and the tool they'd put their name on turned out to be different.

01Cursor won the head-to-heads: 60% pairwise win rate, top Elo (1575).
02Client readiness flipped it: Claude Code 63%, Antigravity 53%, Cursor 47%, Codex 30%.
03Pairwise wins reward surface polish; client readiness rewards defensible work.
04No tool led every format: Cursor took landing pages, Antigravity web apps.
05Mobile was weakest: no tool cleared a 3.0 composite out of 5.

Contra Labs

Research

The SF Standard ran a headline: “'Engineer' is so 2025. In AI land, everyone's a 'builder' now.” Figma's own 2025 report found 56% of non-designers are now performing design-centric tasks. The conclusion seems obvious. The tools have caught up, the role is dissolving, anyone can ship beautiful interfaces.

The interfaces all look the same. Give the same prompt to any of these tools and what comes back is the same left-rail sidebar, the same rounded card grid, the same indigo accent that's become the unofficial flag of AI-generated interfaces. Designers have started calling it “product slop.” The tools give you the average of what they've seen because nothing in the input asks them not to.

Two takes on the same Patina brief. Left, an output one designer called “the clearest articulation of Patina's concept… the kind of design that makes you slow down rather than scroll past.” Right, one that drew the opposite verdict: “the typography is inconsistent, the border radiuses are inconsistent, the spacing and shadows are chaotic… it looks so template-ish.”

AI coding tools can produce design. The question is whether a designer would put their name on what comes back. We ran 24 outputs from Cursor, Claude Code, Codex, and Antigravity past five working designers in three rounds: blind tournament, individual ratings, then the one that decides everything. Would you present this to a client?

The answer split.

The split in one card. Cursor takes the tournament; Claude Code takes the client meeting.

Cursor won the head-to-head tournament with about a 60% overall pairwise win rate and the highest Elo rating at 1575. In a side-by-side comparison, designers picked Cursor more often than any other tool. Claude Code came second at ~50%, Antigravity third at ~45%, and Codex last at ~40%.

Overall pairwise win rate across head-to-head matchups per tool.

But when designers were asked whether they'd present the output to a client, Claude Code led at 63%. Antigravity followed at 53%. Cursor, the pairwise winner, dropped to 47%. And Codex landed at 30%. Designers declined to present it seven times out of ten.

The tool designers preferred to look at and the tool they'd trust with their name on it turned out to be different.

Client readiness flips the leaderboard. Claude Code drew the most yeses (63%), Cursor split evenly (47%), and designers declined Codex seven times in ten.

What wins side-by-side vs. what wins the room

The gap between pairwise preference and client readiness tells you something about how designers actually evaluate work. In a tournament, they're comparing surfaces: which layout feels tighter, which color palette reads better, which typography has more presence. Client readiness seems to be an entirely different filter, with designers weighing whether they can defend every decision in the output, and if there's anything that would make them lose credibility.

Vaultra, the crypto-culture landing brief. Designers said this looked “bold and modern… it looks less like a typical crypto site” while also listing fixes: “I'd update the layout, typography, some of the colors and add icons; I don't like the arrow-style in the CTA.”

When designers explained why they ranked an output first, the reasons were dominated by usability and visual appeal. When they explained why they'd present something to a client, the reasons shifted toward layout, hierarchy, and whether the output felt like a real product rather than a template.

Designers scored outputs lower when they followed the brief but looked dated, and forgave trendy ones that drifted from it. Prompt adherence only got you in the door: outputs scoring below 2.5 on following the brief were called client-ready just 16% of the time. But once an output cleared that floor, how current it looked drove client readiness and perceived maturity more than fidelity to the brief did.

Landing pages, mobile apps, web apps. Three different contests

The overall rankings don't convey the whole story, which is that no tool is reliably good. On a per-brief basis, Claude Code won 14 of 15 matchups on Mise (a mobile wardrobe app) and 12 of 15 on Lumen (a web analytics dashboard), but lost all 15 of its matchups on Dial (a mobile coffee-logging app), where not a single evaluator ranked it first, with a dark-mode call that left most text failing basic readability. Antigravity produced the exact mirror image, topping Dial while no evaluator ranked it first on Mise. Every output makes a single aesthetic bet, and when that bet is wrong for the use case, the whole thing collapses no matter how strong the tool is on average. Cursor held steady on most briefs without ever crashing, which is why it leads overall.

Mobile, best case against worst case. Left, a wardrobe output praised as “calm and minimal… close to the Mise brand direction.” Right, a coffee log where “most text fails basic readability standards.” The designer questioned whether dark mode was even the right call.

That volatility shows up again when you group by design type. On landing pages, Cursor led with a composite score of 3.65 out of 5. The composite is the average of its four rated dimensions: prompt adherence, usability, visual appeal, and modernity. On web applications, Antigravity took the lead at 3.52. Claude Code scored 3.4 on web apps, close behind. On mobile apps, everything dropped. Cursor led at 2.83, but no tool cleared 3.0.

The rankings reshuffle by format. Cursor leads landing pages (3.65) and mobile (2.83). Antigravity tops web apps (3.52).

Mobile was the weakest format across the board. Designers flagged the same problems regardless of tool: inconsistent spacing, navigation that didn't feel native, card layouts that packed too much information into too little space. One designer summarized their reaction to a mobile output by saying it looked like it was built by a template. Consistent padding and spacing problems that wouldn't survive a client presentation.

What kept outputs from clients

The reasons for rejecting outputs were more uniform than the reasons for ranking them first: outputs were generic, and layouts needed structural work.

“Generic / needs refinement” was the most cited reason across all four tools when designers said they wouldn't present to a client, with Codex drawing 19 mentions, nearly double any other tool. But even the top performers weren't immune. Designers said yes to Claude Code's client readiness 63% of the time, but the “yes” rationales still frequently included caveats about refinement.

Layout / hierarchy was the second most common rejection theme. Codex drew 15 mentions, Claude Code drew 9. The complaints were specific: oversized headings wasting viewport space, filter bars that didn't look interactive, hero sections with no product below the fold. One designer described a marketplace output that was entirely a hero with nothing after it. A headline, body text, and then the page ended.

Same Mise brief, two outcomes. Left drew structural complaints: “the top and bottom menus do not feel stable… the information hierarchy needs improvement.” Right won the room: “it focuses on the user's demand… it communicates the brand identity… filters work and it's fast.”

Mobile apps drew the toughest feedback. Evaluators pointed to issues with information architecture, visual styles that felt outdated, and outputs not refined enough for the brief. Of the three format types, mobile had the widest gap between what designers expected and what the tools delivered.

Designers also classified each output by how far along it felt. Across all four tools, production-ready was the exception: no tool got more than a quarter of its outputs there, and most landed at brainstorming or mockup. Codex barely reached production-ready at all.

Design-stage classification per tool. Production-ready was the exception; most outputs landed at brainstorming or mockup.

The verdict

Every tool won something and every tool lost something. Cursor took the most head-to-head matchups but split the room on client readiness. Claude Code drew the most client-ready verdicts but collapsed on briefs that didn't suit it. Antigravity swung between the best output in the set and the worst. Codex trailed on nearly every measure.

The finding that matters most is that client readiness and side-by-side preference run on different logic. In the tournament, all four dimensions we rated predicted winning a matchup at similar strength, with prompt adherence marginally the strongest. Client readiness flipped that. Once an output met a baseline of adherence, what mattered was whether it felt current. Whether a designer looked at it and thought, this feels like something shipping today. Prompt adherence wins the side-by-side. Taste wins the client's yes. And the tools that won did so by understanding the product. The ones that lost often produced visually capable work that simply missed what was asked for.

So the advice for a designer picking these up today is specific: reach for Cursor on landing pages, Antigravity or Claude Code on web apps, and for mobile, pick whichever you know best and budget real rework since no tool cleared the bar. And treat whatever comes back as a starting point. The tool gives you an opening move. Finishing the work is still the designer's job.

The gap between “looks good in a side-by-side” and “I'd put my name on this” is the gap these tools still need to close.

Methodology

Five working designers in UI/UX, web, and mobile app design sourced from Contra's top-earning talent evaluated six briefs across three format types: two landing pages, two mobile apps, and two web applications, each set in a distinct product domain (crypto, developer tooling, specialty coffee, analytics, luxury fashion, and curated e-commerce). Briefs were written to professional standards and contained a project overview, objective, target audience, brand identity, mood references, and required screens or sections. No color palettes or typography specs were provided, forcing each tool to make its own aesthetic decisions from the brief's tone and references alone.

Each brief was run through all four tools using each tool's default model configuration, producing 24 total outputs:

Cursor — cursor-agent v2026.05.24-dda726e, composer-2.5
Claude Code — v2.1.150, claude-sonnet-4-6
OpenAI Codex — v0.134.0, gpt-5.5-medium
Google Antigravity — v1.0.2, gemini-3.5-flash-high

A single standardized prompt was used per brief across all tools:

You are an expert web developer and visual designer. Build a complete, polished web page based on the user's instructions.

Technical requirements:
- Generate a complete and valid HTML document with DOCTYPE, meta charset, and viewport meta tag.
- Return raw HTML that can be used directly without any additional processing.
- NO frameworks or libraries (React, Vue, Angular, Tailwind, Bootstrap, etc.). Pure vanilla HTML, CSS, and JavaScript only.
- When an external dependency is genuinely needed (e.g., an icon set or animation library), load it from UNPKG via a script/link tag.
- Use semantic HTML elements (nav, main, section, article, header, footer, etc.).
- All interactive elements must have visible hover/focus/active states.
- Be accessible: use ARIA labels, sufficient color contrast (WCAG AA), focus-visible styles, and logical tab order.

Design requirements:
- Always set an explicit background color on html and body.
- Define a cohesive color palette using CSS custom properties on :root.
- Establish a clear typographic hierarchy. Import a web font from Google Fonts if it elevates the design.
- Use generous whitespace and a consistent spacing rhythm.
- Apply subtle visual polish: box-shadows, border-radius, smooth transitions on interactive states.

CSS architecture:
- Write all styles in a single <style> block in <head>.
- Use short, descriptive, flat class names. No deep nesting.
- Mobile-first responsive design using modern CSS (Grid, Flexbox, clamp()).

Interactivity:
- All JavaScript in a single <script> block at the end of <body>.
- Use hash-based routing for multi-page navigation so it works inside iframes.
- Use localStorage to persist user-entered data.

Output only a markdown code block containing the complete HTML.

The following is the client brief:

Outputs were evaluated through an interactive interface where each one could be opened, scrolled, clicked, and fully interacted with as a live artifact. Evaluation followed a blinded, pairwise tournament format, with two side-by-side matchups, then a final comparison between winners, producing a rank order per brief. After ranking, evaluators rated each output individually on four dimensions (Prompt Adherence, Usability, Visual Appeal, and Modernity & Trend Relevance) using a shared 1–5 rubric. Ratings triggered conditional follow-ups to capture what specifically drove the score. Evaluators then classified each output by design stage and noted what changes would be required to make it production-ready.

Evaluation funnel: blind tournament → 1–5 scalar rubric → stage classification → client-readiness verdict.

Designers' free-text reasons for not presenting an output to a client were coded into themes using GPT-4o mini, then spot-checked by a member of the team. The seven themes were Layout / hierarchy, Typography / color, Visual polish / aesthetics, Functionality / UX, Brief / prompt alignment, Client-ready foundation, and Generic / needs refinement. We report each theme as the number of evaluator comments that mentioned it. Because these are raw mention counts, they reflect both how often an issue arose and how much detail an evaluator volunteered, so we treat them as indicative rather than exact.

Limitations

Evaluators interacted with live outputs, but the results reflect first-pass capability, not each tool's ceiling under refinement. Each tool used its own default model, so the comparison is at the product level and differences in output reflect the full stack of model selection, prompt engineering, and UI scaffolding, not the underlying model alone. The briefs are fictional, which removes client feedback loops and real-world constraint negotiation from the equation. Only the design surface was evaluated and did not include code quality, performance, or accessibility. Evaluator preferences carry inherent subjectivity and the rubric anchors judgment but doesn't eliminate taste. The sample is small enough that we treat these findings as directional rather than definitive. Inter-rater reliability across the five evaluators was modest (Krippendorff's α = 0.37 for pairwise rankings, 0.15–0.30 for the scalar ratings), meaning individual scores varied widely between evaluators. We place more weight on patterns across tools and briefs than on any single rating.

How we ran this study → Methodology

Continue reading3 studies

All research

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Benchmark

Research

Datasets

Jobs

The world's leading independent

human data & creative evaluation lab

Request partnership

Creative Human Data

Research

Datasets

Jobs